Pure Utils

Warning

This documentation needs improvments by the code’s author.

Box Extactor

Functions to extract boxes from text.

dodfminer.extract.pure.utils.box_extractor.compare_blocks(block1, block2)[source]
Implements a comparison heuristic between blocks.

Blocks that are in the uppermost and leftmost positions should be inserted before the other block in comparison.

Parameters
  • block1 – a block tuple to be compared.

  • block2 – a block tuple to be compared to.

Returns

Int

dodfminer.extract.pure.utils.box_extractor.draw_doc_text_boxes(doc: fitz.Document, doc_boxes, save_path=None)[source]
Draw extracted text blocks rectangles.

In result, a pdf file with rectangles shapes added, representing the extracted blocks, is saved.

Parameters
  • doc – an opened fitz document

  • doc_boxes – the list of blocks on a document, separated by pages

  • save_path – a custom path for saving the result pdf

Returns

None

dodfminer.extract.pure.utils.box_extractor.get_doc_img_boxes(doc: fitz.Document)[source]

Returns list of list of bouding boxes of extracted images.

Parameters

doc – an opened fitz document

Returns

List[List[Rect(float, float, float, float)]]. Each Rect represents

an image bounding box.

dodfminer.extract.pure.utils.box_extractor.get_doc_text_boxes(doc: fitz.Document)[source]

Returns list of list of extracted text blocks.

Parameters

doc – an opened fitz document.

Returns

List[List[tuple(float, float, float, float, str, int, int)]]

dodfminer.extract.pure.utils.box_extractor.get_doc_text_lines(doc: fitz.Document)[source]

Returns list of list of extracted text lines.

Parameters

doc – an opened fitz document.

Returns

List[List[tuple(float, float, float, str)]]

dodfminer.extract.pure.utils.box_extractor.sort_blocks(page_blocks)[source]

Sort blocks by their vertical and horizontal position.

Parameters

page_blocks – a list of blocks within a page.

Returns

List[tuple(float, float, float, float, str, int, int)]

dodfminer.extract.pure.utils.box_extractor._extract_page_lines_content(page)[source]

Extracts page lines.

Parameters

page – fitz.fitz.Page object to have its bold content extracted.

Returns

List[tuple(float, float, float, float, str)] A list containing lines content at the page, along with its bounding boxes.

dodfminer.extract.pure.utils.box_extractor.get_doc_text_boxes(doc: fitz.Document)[source]

Returns list of list of extracted text blocks.

Parameters

doc – an opened fitz document.

Returns

List[List[tuple(float, float, float, float, str, int, int)]]

dodfminer.extract.pure.utils.box_extractor.get_doc_text_lines(doc: fitz.Document)[source]

Returns list of list of extracted text lines.

Parameters

doc – an opened fitz document.

Returns

List[List[tuple(float, float, float, str)]]

dodfminer.extract.pure.utils.box_extractor._get_doc_img(doc: fitz.Document)[source]

Returns list of list of image items.

Note

This function is not intented to be used by final users, but internally. Image items are described at:

https://pymupdf.readthedocs.io/en/latest/page/#Page.getImageBbox

Parameters

doc – an opened fitz document

Returns

List[List[tuple(int, int, int, int, str, str, str, str, int)]] (xref, smask, width, height, bpc, colorspace, alt. colorspace, filter, invoker)

dodfminer.extract.pure.utils.box_extractor.get_doc_img_boxes(doc: fitz.Document)[source]

Returns list of list of bouding boxes of extracted images.

Parameters

doc – an opened fitz document

Returns

List[List[Rect(float, float, float, float)]]. Each Rect represents

an image bounding box.

Title Filter

Find titles using a Filter.

class dodfminer.extract.pure.utils.title_filter.BoldUpperCase[source]

Filter functions useful for bold and upper case text.

Note

This class is static and should not be instanciated.

classmethod dict_bold(data)[source]

Hmm.

Evaluates do True if d[‘flags’] matches the following conditions:

  • is one of the values in BoldUpperCase.BOLD_FLAGS

classmethod dict_text(data)[source]

Check if text is title.

Evaluates to true if d[‘text’] matches the following conditions:

  • all letters are uppercase;

  • does not contain 4 or more consecutive spaces;

  • has a len greater than BoldUpperCase.TEXT_MIN/

Returns

Boolean indicating if text is title.

Title Extactor

Extract Title and Subtitles.

class dodfminer.extract.pure.utils.title_extractor.BBox(bbox)
property bbox

Alias for field number 0

class dodfminer.extract.pure.utils.title_extractor.Box(x0, y0, x1, y1)
property x0

Alias for field number 0

property x1

Alias for field number 2

property y0

Alias for field number 1

property y1

Alias for field number 3

class dodfminer.extract.pure.utils.title_extractor.ExtractorTitleSubtitle(path)[source]

Use this class like that: >> path = “path_to_pdf” >> extractor = ExtractorTitleSubtitle(path) >> # To extract titles >> titles = extractor.titles >> # To extract subtitles >> titles = extractor.subtitles >> # To dump titles and subtitles on a json file >> json_path = “valid_file_name” >> extractor.dump_json(json_path) .

dump_json(path)[source]

Writes on file specified by path the JSON representation of titles and subtitles extracted.

Dumps the titles and subtitles according to the hierarchy verified on the document.

The outputfile should be specified and will be suffixed with the “.json” if it’s not.

Parameters
  • path – string containing path to .json file where the dump will

  • not. (be done. Its suffixed with ".json" if it's) –

property json

All titles with its subtitles associated.

All subtitles under the same title are at the same level. Deprecated. Better use titles_subtitles or titles_subtitles_hierarchy.

reset()[source]

Sets cache to False and reset others internal attributes. Use when for some reason the internal state was somehow modified by user.

property subtitles

All subtitles extracted from the file speficied by self._path.

Returns

List[TextTypeBboxPageTuple] each of which having its type attribute equals _TYPE_SUBTITLE

property titles

All titles extracted from the file speficied by self._path.

Returns

List[TextTypeBboxPageTuple] each of which having its type attribute equals _TYPE_TITLE

property titles_subtitles

A list with titles and subtitles, sorted according to its reading order.

property titles_subtitles_hierarchy: TitlesSubtitles(titles=<class 'str'>, subtitles=typing.List[str])

All titles and subtitles extracted from the file specified by self._path, hierarchically organized.

Returns

the titles and its respectively subtitles

Return type

List[TitlesSubtitles(str, List[str])]

class dodfminer.extract.pure.utils.title_extractor.TextTypeBboxPageTuple(text, type, bbox, page)
property bbox

Alias for field number 2

property page

Alias for field number 3

property text

Alias for field number 0

property type

Alias for field number 1

class dodfminer.extract.pure.utils.title_extractor.TitlesSubtitles(titles, subtitles)
property subtitles

Alias for field number 1

property titles

Alias for field number 0

dodfminer.extract.pure.utils.title_extractor.extract_titles_subtitles(path)[source]

Extracts titles and subtitles from DODF pdf.

Parameters

path – str indicating the path for the pdf to have its content extracted.

Returns

List[TextTypeBboxPageTuple] containing all titles ans subtitles.

dodfminer.extract.pure.utils.title_extractor.gen_hierarchy_base(dir_path='.', folder='hierarchy', indent=4, forced=False)[source]
Generates json base from all PDFs immediately under dir_path directory.

The hiearchy files are generated under dir_path directory. Args:

dir_path: path so folder containing PDFs base_name: titles’ base file name forced: proceed even if folder base_name already exists indent: how many spaces used will be used for indent

Returns:

List[Dict[str, List[Dict[str, List[Dict[str, str]]]]]] e.g: [

{ “22012019”: [
{

“PODER EXECUTIVO”: []

}, {

“SECRETARIA DE ESTADO DE FAZENDA,

PLANEJAMENTO, ORÇAMENTO E GESTÃO”: [
{

“SUBSECRETARIA DA RECEITA”: “”

}

]

}

}

] In case of error trying to create base_name folder, returns None.

dodfminer.extract.pure.utils.title_extractor.gen_title_base(dir_path='.', base_name='titles', indent=4, forced=False)[source]

Generates titles base from all PDFs immediately under dir_path directory. The base is generated under dir_path directory. :param dir_path: path so base_name will contain all titles

from PDFs under dir_path

Parameters
  • base_name – titles’ base file name

  • indent – how many spaces used will be used for indent

Returns

dict containing “titles” as key and a list of titles,

the same stored at base_name[.json]

dodfminer.extract.pure.utils.title_extractor.group_by_column(elements, width)[source]

Groups elements by its culumns. The sorting assumes they are on the same page and on a 2-column layout.

Essentially a “groupby” where the key is the page number of each span.

Parameters

elements – Iterable[TextTypeBboxPageTuple] sorted by its page number to be grouped.

Returns

A dict with spans of each page, being keys the page numbers.

dodfminer.extract.pure.utils.title_extractor.group_by_page(elements)[source]

Groups elements by page number.

Essentially a “groupby” where the key is the page number of each span.

Parameters

elements – Iterable[TextTypeBboxPageTuple] sorted by its page number to be grouped.

Returns

A dict with spans of each page, being keys the page numbers.

dodfminer.extract.pure.utils.title_extractor.invert_text_type_bbox_page_tuple(text_type_bbox_page_tuple)[source]

Reverses the type between _TYPE_TITLE and _TYPE_SUBTITLE.

Parameters

textTypeBboxPageTuple – instance of TextTypeBboxPageTuple.

Returns

copy of textTypeBboxPageTuple with its type field reversed.

dodfminer.extract.pure.utils.title_extractor.load_blocks_list(path)[source]

Loads list of blocks list from the file specified.

Parameters

path – string with path to DODF pdf file

Returns

A list with page blocks, each element being a list with its according page blocks.

dodfminer.extract.pure.utils.title_extractor.sort_2column(elements, width_lis)[source]

Sorts TextTypeBboxPageTuple iterable.

Sorts sequence of TextTypeBboxPageTuple objects, assuming a full 2-columns layout over them.

Parameters

elements – Iterable[TextTypeBboxPageTuple]

Returns

dictionary mapping page number to its elements sorted by column (assumig there are always 2 columns per page)

dodfminer.extract.pure.utils.title_extractor.sort_by_column(elements, width)[source]

Sorts list elements by columns.

Parameters
  • elements – Iterable[TextTypeBboxPageTuple].

  • width – the page width (the context in which all list elements were originally).

Returns

List[TextTypeBboxPageTuple] containing the list elements sorted according to:

  1. columns

  2. position on column

Assumes a 2-column page layout. All elements on the left column will be placed first of any element on the right one. Inside each columns, reading order is expected to be kept.

dodfminer.extract.pure.utils.title_extractor.load_blocks_list(path)[source]

Loads list of blocks list from the file specified.

Parameters

path – string with path to DODF pdf file

Returns

A list with page blocks, each element being a list with its according page blocks.

dodfminer.extract.pure.utils.title_extractor.group_by_column(elements, width)[source]

Groups elements by its culumns. The sorting assumes they are on the same page and on a 2-column layout.

Essentially a “groupby” where the key is the page number of each span.

Parameters

elements – Iterable[TextTypeBboxPageTuple] sorted by its page number to be grouped.

Returns

A dict with spans of each page, being keys the page numbers.

dodfminer.extract.pure.utils.title_extractor.group_by_page(elements)[source]

Groups elements by page number.

Essentially a “groupby” where the key is the page number of each span.

Parameters

elements – Iterable[TextTypeBboxPageTuple] sorted by its page number to be grouped.

Returns

A dict with spans of each page, being keys the page numbers.

dodfminer.extract.pure.utils.title_extractor.sort_by_column(elements, width)[source]

Sorts list elements by columns.

Parameters
  • elements – Iterable[TextTypeBboxPageTuple].

  • width – the page width (the context in which all list elements were originally).

Returns

List[TextTypeBboxPageTuple] containing the list elements sorted according to:

  1. columns

  2. position on column

Assumes a 2-column page layout. All elements on the left column will be placed first of any element on the right one. Inside each columns, reading order is expected to be kept.

dodfminer.extract.pure.utils.title_extractor._extract_bold_upper_page(page)[source]

Extracts page content which have bold font and are uppercase.

Parameters

page – fitz.fitz.Page object to have its bold content extracted.

Returns

A list containing all bold (and simultaneously upper) content at the page.

dodfminer.extract.pure.utils.title_extractor._extract_bold_upper_pdf(doc)[source]

Extracts bold content from DODF pdf.

Parameters

doc – DODF pdf file returned by fitz.open

Returns

a list of list of bold span text

dodfminer.extract.pure.utils.title_extractor.sort_2column(elements, width_lis)[source]

Sorts TextTypeBboxPageTuple iterable.

Sorts sequence of TextTypeBboxPageTuple objects, assuming a full 2-columns layout over them.

Parameters

elements – Iterable[TextTypeBboxPageTuple]

Returns

dictionary mapping page number to its elements sorted by column (assumig there are always 2 columns per page)

dodfminer.extract.pure.utils.title_extractor._get_titles_subtitles(elements, width_lis)[source]

Extracts titles and subtitles from list. WARNING: Based on font size and heuristic.

Parameters

titles_subtitles – a list of dict all of them having the keys: size -> float text -> str bbox -> Box page -> int

Returns

TitlesSubtitles[List[TextTypeBboxPageTuple], List[TextTypeBboxPageTuple]].

dodfminer.extract.pure.utils.title_extractor._get_titles_subtitles_smart(doc, width_lis)[source]

Extracts titles and subtitles. Makes use of heuristics.

Wraps _get_titles_subtitles, removing most of impurity (spans not which aren’t titles/subtutles).

Parameters

doc – DODF pdf file returned by fitz.open

Returns

TitlesSubtitles(List[TextTypeBboxPageTuple],

List[TextTypeBboxPageTuple]).

dodfminer.extract.pure.utils.title_extractor.extract_titles_subtitles(path)[source]

Extracts titles and subtitles from DODF pdf.

Parameters

path – str indicating the path for the pdf to have its content extracted.

Returns

List[TextTypeBboxPageTuple] containing all titles ans subtitles.

class dodfminer.extract.pure.utils.title_extractor.ExtractorTitleSubtitle(path)[source]

Use this class like that: >> path = “path_to_pdf” >> extractor = ExtractorTitleSubtitle(path) >> # To extract titles >> titles = extractor.titles >> # To extract subtitles >> titles = extractor.subtitles >> # To dump titles and subtitles on a json file >> json_path = “valid_file_name” >> extractor.dump_json(json_path) .

dump_json(path)[source]

Writes on file specified by path the JSON representation of titles and subtitles extracted.

Dumps the titles and subtitles according to the hierarchy verified on the document.

The outputfile should be specified and will be suffixed with the “.json” if it’s not.

Parameters
  • path – string containing path to .json file where the dump will

  • not. (be done. Its suffixed with ".json" if it's) –

property json

All titles with its subtitles associated.

All subtitles under the same title are at the same level. Deprecated. Better use titles_subtitles or titles_subtitles_hierarchy.

reset()[source]

Sets cache to False and reset others internal attributes. Use when for some reason the internal state was somehow modified by user.

property subtitles

All subtitles extracted from the file speficied by self._path.

Returns

List[TextTypeBboxPageTuple] each of which having its type attribute equals _TYPE_SUBTITLE

property titles

All titles extracted from the file speficied by self._path.

Returns

List[TextTypeBboxPageTuple] each of which having its type attribute equals _TYPE_TITLE

property titles_subtitles

A list with titles and subtitles, sorted according to its reading order.

property titles_subtitles_hierarchy: TitlesSubtitles(titles=<class 'str'>, subtitles=typing.List[str])

All titles and subtitles extracted from the file specified by self._path, hierarchically organized.

Returns

the titles and its respectively subtitles

Return type

List[TitlesSubtitles(str, List[str])]

dodfminer.extract.pure.utils.title_extractor.gen_title_base(dir_path='.', base_name='titles', indent=4, forced=False)[source]

Generates titles base from all PDFs immediately under dir_path directory. The base is generated under dir_path directory. :param dir_path: path so base_name will contain all titles

from PDFs under dir_path

Parameters
  • base_name – titles’ base file name

  • indent – how many spaces used will be used for indent

Returns

dict containing “titles” as key and a list of titles,

the same stored at base_name[.json]

dodfminer.extract.pure.utils.title_extractor.gen_hierarchy_base(dir_path='.', folder='hierarchy', indent=4, forced=False)[source]
Generates json base from all PDFs immediately under dir_path directory.

The hiearchy files are generated under dir_path directory. Args:

dir_path: path so folder containing PDFs base_name: titles’ base file name forced: proceed even if folder base_name already exists indent: how many spaces used will be used for indent

Returns:

List[Dict[str, List[Dict[str, List[Dict[str, str]]]]]] e.g: [

{ “22012019”: [
{

“PODER EXECUTIVO”: []

}, {

“SECRETARIA DE ESTADO DE FAZENDA,

PLANEJAMENTO, ORÇAMENTO E GESTÃO”: [
{

“SUBSECRETARIA DA RECEITA”: “”

}

]

}

}

] In case of error trying to create base_name folder, returns None.