Pure Utils ¶

Warning

This documentation needs improvments by the code’s author.

Table of Contents

Pure Utils

Box Extactor ¶

Functions to extract boxes from text.

dodfminer.extract.pure.utils.box_extractor.compare_blocks(block1, block2)[source]¶

Implements a comparison heuristic between blocks.: Blocks that are in the uppermost and leftmost positions should be inserted before the other block in comparison.

Parameters

block1 – a block tuple to be compared.
block2 – a block tuple to be compared to.

Returns

Int

dodfminer.extract.pure.utils.box_extractor.draw_doc_text_boxes(doc: fitz.Document, doc_boxes, save_path=None)[source]¶

Draw extracted text blocks rectangles.: In result, a pdf file with rectangles shapes added, representing the extracted blocks, is saved.

Parameters

doc – an opened fitz document
doc_boxes – the list of blocks on a document, separated by pages
save_path – a custom path for saving the result pdf

Returns

None

dodfminer.extract.pure.utils.box_extractor.get_doc_img_boxes(doc: fitz.Document)[source]¶

Returns list of list of bouding boxes of extracted images.

Parameters

doc – an opened fitz document

Returns

List[List[Rect(float, float, float, float)]]. Each Rect represents: an image bounding box.

dodfminer.extract.pure.utils.box_extractor.get_doc_text_boxes(doc: fitz.Document)[source]¶

Returns list of list of extracted text blocks.

Parameters: doc – an opened fitz document.
Returns: List[List[tuple(float, float, float, float, str, int, int)]]

dodfminer.extract.pure.utils.box_extractor.get_doc_text_lines(doc: fitz.Document)[source]¶

Returns list of list of extracted text lines.

Parameters: doc – an opened fitz document.
Returns: List[List[tuple(float, float, float, str)]]

dodfminer.extract.pure.utils.box_extractor.sort_blocks(page_blocks)[source]¶

Sort blocks by their vertical and horizontal position.

Parameters: page_blocks – a list of blocks within a page.
Returns: List[tuple(float, float, float, float, str, int, int)]

dodfminer.extract.pure.utils.box_extractor._extract_page_lines_content(page)[source]¶

Extracts page lines.

Parameters: page – fitz.fitz.Page object to have its bold content extracted.
Returns: List[tuple(float, float, float, float, str)] A list containing lines content at the page, along with its bounding boxes.

dodfminer.extract.pure.utils.box_extractor.get_doc_text_boxes(doc: fitz.Document)[source]¶

Returns list of list of extracted text blocks.

Parameters: doc – an opened fitz document.
Returns: List[List[tuple(float, float, float, float, str, int, int)]]

dodfminer.extract.pure.utils.box_extractor.get_doc_text_lines(doc: fitz.Document)[source]¶

Returns list of list of extracted text lines.

Parameters: doc – an opened fitz document.
Returns: List[List[tuple(float, float, float, str)]]

dodfminer.extract.pure.utils.box_extractor._get_doc_img(doc: fitz.Document)[source]¶

Returns list of list of image items.

Note

This function is not intented to be used by final users, but internally. Image items are described at:

https://pymupdf.readthedocs.io/en/latest/page/#Page.getImageBbox

Parameters: doc – an opened fitz document
Returns: List[List[tuple(int, int, int, int, str, str, str, str, int)]] (xref, smask, width, height, bpc, colorspace, alt. colorspace, filter, invoker)

dodfminer.extract.pure.utils.box_extractor.get_doc_img_boxes(doc: fitz.Document)[source]¶

Returns list of list of bouding boxes of extracted images.

Parameters

doc – an opened fitz document

Returns

List[List[Rect(float, float, float, float)]]. Each Rect represents: an image bounding box.

Title Filter ¶

Find titles using a Filter.

class dodfminer.extract.pure.utils.title_filter.BoldUpperCase[source]¶

Filter functions useful for bold and upper case text.

Note

This class is static and should not be instanciated.

classmethod dict_bold(data)[source]¶

Hmm.

Evaluates do True if d[‘flags’] matches the following conditions:

is one of the values in BoldUpperCase.BOLD_FLAGS

classmethod dict_text(data)[source]¶

Check if text is title.

Evaluates to true if d[‘text’] matches the following conditions:

all letters are uppercase;

does not contain 4 or more consecutive spaces;

has a len greater than BoldUpperCase.TEXT_MIN/

Returns: Boolean indicating if text is title.

Title Extactor ¶

Extract Title and Subtitles.

class dodfminer.extract.pure.utils.title_extractor.BBox(bbox)¶

property bbox¶: Alias for field number 0

class dodfminer.extract.pure.utils.title_extractor.Box(x0, y0, x1, y1)¶

property x0¶: Alias for field number 0

property x1¶: Alias for field number 2

property y0¶: Alias for field number 1

property y1¶: Alias for field number 3

class dodfminer.extract.pure.utils.title_extractor.ExtractorTitleSubtitle(path)[source]¶

Use this class like that: >> path = “path_to_pdf” >> extractor = ExtractorTitleSubtitle(path) >> # To extract titles >> titles = extractor.titles >> # To extract subtitles >> titles = extractor.subtitles >> # To dump titles and subtitles on a json file >> json_path = “valid_file_name” >> extractor.dump_json(json_path) .

dump_json(path)[source]¶

Writes on file specified by path the JSON representation of titles and subtitles extracted.

Dumps the titles and subtitles according to the hierarchy verified on the document.

The outputfile should be specified and will be suffixed with the “.json” if it’s not.

Parameters

path – string containing path to .json file where the dump will
not. (be done. Its suffixed with ".json" if it's) –

property json¶

All titles with its subtitles associated.

All subtitles under the same title are at the same level. Deprecated. Better use titles_subtitles or titles_subtitles_hierarchy.

reset()[source]¶: Sets cache to False and reset others internal attributes. Use when for some reason the internal state was somehow modified by user.

property subtitles¶

All subtitles extracted from the file speficied by self._path.

Returns: List[TextTypeBboxPageTuple] each of which having its type attribute equals _TYPE_SUBTITLE

property titles¶

All titles extracted from the file speficied by self._path.

Returns: List[TextTypeBboxPageTuple] each of which having its type attribute equals _TYPE_TITLE

property titles_subtitles¶: A list with titles and subtitles, sorted according to its reading order.

property titles_subtitles_hierarchy: TitlesSubtitles(titles=<class 'str'>, subtitles=typing.List[str])¶

All titles and subtitles extracted from the file specified by self._path, hierarchically organized.

Returns: the titles and its respectively subtitles
Return type: List[TitlesSubtitles(str, List[str])]

class dodfminer.extract.pure.utils.title_extractor.TextTypeBboxPageTuple(text, type, bbox, page)¶

property bbox¶: Alias for field number 2

property page¶: Alias for field number 3

property text¶: Alias for field number 0

property type¶: Alias for field number 1

class dodfminer.extract.pure.utils.title_extractor.TitlesSubtitles(titles, subtitles)¶

property subtitles¶: Alias for field number 1

property titles¶: Alias for field number 0

dodfminer.extract.pure.utils.title_extractor.extract_titles_subtitles(path)[source]¶

Extracts titles and subtitles from DODF pdf.

Parameters: path – str indicating the path for the pdf to have its content extracted.
Returns: List[TextTypeBboxPageTuple] containing all titles ans subtitles.

dodfminer.extract.pure.utils.title_extractor.gen_hierarchy_base(dir_path='.', folder='hierarchy', indent=4, forced=False)[source]¶

Generates json base from all PDFs immediately under dir_path directory.

The hiearchy files are generated under dir_path directory. Args:

dir_path: path so folder containing PDFs base_name: titles’ base file name forced: proceed even if folder base_name already exists indent: how many spaces used will be used for indent

Returns:

List[Dict[str, List[Dict[str, List[Dict[str, str]]]]]] e.g: [

{ “22012019”: [

{
“PODER EXECUTIVO”: []

}, {

“SECRETARIA DE ESTADO DE FAZENDA,

PLANEJAMENTO, ORÇAMENTO E GESTÃO”: [

{
“SUBSECRETARIA DA RECEITA”: “”

}

]

}

}

] In case of error trying to create base_name folder, returns None.

dodfminer.extract.pure.utils.title_extractor.gen_title_base(dir_path='.', base_name='titles', indent=4, forced=False)[source]¶

Generates titles base from all PDFs immediately under dir_path directory. The base is generated under dir_path directory. :param dir_path: path so base_name will contain all titles

from PDFs under dir_path

Parameters

base_name – titles’ base file name
indent – how many spaces used will be used for indent

Returns

dict containing “titles” as key and a list of titles,: the same stored at base_name[.json]

dodfminer.extract.pure.utils.title_extractor.group_by_column(elements, width)[source]¶

Groups elements by its culumns. The sorting assumes they are on the same page and on a 2-column layout.

Essentially a “groupby” where the key is the page number of each span.

Parameters: elements – Iterable[TextTypeBboxPageTuple] sorted by its page number to be grouped.
Returns: A dict with spans of each page, being keys the page numbers.

dodfminer.extract.pure.utils.title_extractor.group_by_page(elements)[source]¶

Groups elements by page number.

Essentially a “groupby” where the key is the page number of each span.

Parameters: elements – Iterable[TextTypeBboxPageTuple] sorted by its page number to be grouped.
Returns: A dict with spans of each page, being keys the page numbers.

dodfminer.extract.pure.utils.title_extractor.invert_text_type_bbox_page_tuple(text_type_bbox_page_tuple)[source]¶

Reverses the type between _TYPE_TITLE and _TYPE_SUBTITLE.

Parameters: textTypeBboxPageTuple – instance of TextTypeBboxPageTuple.
Returns: copy of textTypeBboxPageTuple with its type field reversed.

dodfminer.extract.pure.utils.title_extractor.load_blocks_list(path)[source]¶

Loads list of blocks list from the file specified.

Parameters: path – string with path to DODF pdf file
Returns: A list with page blocks, each element being a list with its according page blocks.

dodfminer.extract.pure.utils.title_extractor.sort_2column(elements, width_lis)[source]¶

Sorts TextTypeBboxPageTuple iterable.

Sorts sequence of TextTypeBboxPageTuple objects, assuming a full 2-columns layout over them.

Parameters: elements – Iterable[TextTypeBboxPageTuple]
Returns: dictionary mapping page number to its elements sorted by column (assumig there are always 2 columns per page)

dodfminer.extract.pure.utils.title_extractor.sort_by_column(elements, width)[source]¶

Sorts list elements by columns.

Parameters

elements – Iterable[TextTypeBboxPageTuple].
width – the page width (the context in which all list elements were originally).

Returns

List[TextTypeBboxPageTuple] containing the list elements sorted according to:

columns

position on column

Assumes a 2-column page layout. All elements on the left column will be placed first of any element on the right one. Inside each columns, reading order is expected to be kept.

dodfminer.extract.pure.utils.title_extractor.load_blocks_list(path)[source]¶

Loads list of blocks list from the file specified.

Parameters: path – string with path to DODF pdf file
Returns: A list with page blocks, each element being a list with its according page blocks.

dodfminer.extract.pure.utils.title_extractor.group_by_column(elements, width)[source]¶

Groups elements by its culumns. The sorting assumes they are on the same page and on a 2-column layout.

Essentially a “groupby” where the key is the page number of each span.

Parameters: elements – Iterable[TextTypeBboxPageTuple] sorted by its page number to be grouped.
Returns: A dict with spans of each page, being keys the page numbers.

dodfminer.extract.pure.utils.title_extractor.group_by_page(elements)[source]¶

Groups elements by page number.

Essentially a “groupby” where the key is the page number of each span.

Parameters: elements – Iterable[TextTypeBboxPageTuple] sorted by its page number to be grouped.
Returns: A dict with spans of each page, being keys the page numbers.

dodfminer.extract.pure.utils.title_extractor.sort_by_column(elements, width)[source]¶

Sorts list elements by columns.

Parameters

elements – Iterable[TextTypeBboxPageTuple].
width – the page width (the context in which all list elements were originally).

Returns

List[TextTypeBboxPageTuple] containing the list elements sorted according to:

columns

position on column

Assumes a 2-column page layout. All elements on the left column will be placed first of any element on the right one. Inside each columns, reading order is expected to be kept.

dodfminer.extract.pure.utils.title_extractor._extract_bold_upper_page(page)[source]¶

Extracts page content which have bold font and are uppercase.

Parameters: page – fitz.fitz.Page object to have its bold content extracted.
Returns: A list containing all bold (and simultaneously upper) content at the page.

dodfminer.extract.pure.utils.title_extractor._extract_bold_upper_pdf(doc)[source]¶

Extracts bold content from DODF pdf.

Parameters: doc – DODF pdf file returned by fitz.open
Returns: a list of list of bold span text

dodfminer.extract.pure.utils.title_extractor.sort_2column(elements, width_lis)[source]¶

Sorts TextTypeBboxPageTuple iterable.

Sorts sequence of TextTypeBboxPageTuple objects, assuming a full 2-columns layout over them.

Parameters: elements – Iterable[TextTypeBboxPageTuple]
Returns: dictionary mapping page number to its elements sorted by column (assumig there are always 2 columns per page)

dodfminer.extract.pure.utils.title_extractor._get_titles_subtitles(elements, width_lis)[source]¶

Extracts titles and subtitles from list. WARNING: Based on font size and heuristic.

Parameters: titles_subtitles – a list of dict all of them having the keys: size -> float text -> str bbox -> Box page -> int
Returns: TitlesSubtitles[List[TextTypeBboxPageTuple], List[TextTypeBboxPageTuple]].

dodfminer.extract.pure.utils.title_extractor._get_titles_subtitles_smart(doc, width_lis)[source]¶

Extracts titles and subtitles. Makes use of heuristics.

Wraps _get_titles_subtitles, removing most of impurity (spans not which aren’t titles/subtutles).

Parameters

doc – DODF pdf file returned by fitz.open

Returns

TitlesSubtitles(List[TextTypeBboxPageTuple],: List[TextTypeBboxPageTuple]).

dodfminer.extract.pure.utils.title_extractor.extract_titles_subtitles(path)[source]¶

Extracts titles and subtitles from DODF pdf.

Parameters: path – str indicating the path for the pdf to have its content extracted.
Returns: List[TextTypeBboxPageTuple] containing all titles ans subtitles.

class dodfminer.extract.pure.utils.title_extractor.ExtractorTitleSubtitle(path)[source]¶

Use this class like that: >> path = “path_to_pdf” >> extractor = ExtractorTitleSubtitle(path) >> # To extract titles >> titles = extractor.titles >> # To extract subtitles >> titles = extractor.subtitles >> # To dump titles and subtitles on a json file >> json_path = “valid_file_name” >> extractor.dump_json(json_path) .

dump_json(path)[source]¶

Writes on file specified by path the JSON representation of titles and subtitles extracted.

Dumps the titles and subtitles according to the hierarchy verified on the document.

The outputfile should be specified and will be suffixed with the “.json” if it’s not.

Parameters

path – string containing path to .json file where the dump will
not. (be done. Its suffixed with ".json" if it's) –

property json¶

All titles with its subtitles associated.

All subtitles under the same title are at the same level. Deprecated. Better use titles_subtitles or titles_subtitles_hierarchy.

reset()[source]¶: Sets cache to False and reset others internal attributes. Use when for some reason the internal state was somehow modified by user.

property subtitles¶

All subtitles extracted from the file speficied by self._path.

Returns: List[TextTypeBboxPageTuple] each of which having its type attribute equals _TYPE_SUBTITLE

property titles¶

All titles extracted from the file speficied by self._path.

Returns: List[TextTypeBboxPageTuple] each of which having its type attribute equals _TYPE_TITLE

property titles_subtitles¶: A list with titles and subtitles, sorted according to its reading order.

property titles_subtitles_hierarchy: TitlesSubtitles(titles=<class 'str'>, subtitles=typing.List[str])¶

All titles and subtitles extracted from the file specified by self._path, hierarchically organized.

Returns: the titles and its respectively subtitles
Return type: List[TitlesSubtitles(str, List[str])]

dodfminer.extract.pure.utils.title_extractor.gen_title_base(dir_path='.', base_name='titles', indent=4, forced=False)[source]¶

Generates titles base from all PDFs immediately under dir_path directory. The base is generated under dir_path directory. :param dir_path: path so base_name will contain all titles

from PDFs under dir_path

Parameters

base_name – titles’ base file name
indent – how many spaces used will be used for indent

Returns

dict containing “titles” as key and a list of titles,: the same stored at base_name[.json]

dodfminer.extract.pure.utils.title_extractor.gen_hierarchy_base(dir_path='.', folder='hierarchy', indent=4, forced=False)[source]¶

Generates json base from all PDFs immediately under dir_path directory.

The hiearchy files are generated under dir_path directory. Args:

dir_path: path so folder containing PDFs base_name: titles’ base file name forced: proceed even if folder base_name already exists indent: how many spaces used will be used for indent

Returns:

List[Dict[str, List[Dict[str, List[Dict[str, str]]]]]] e.g: [

{ “22012019”: [

{
“PODER EXECUTIVO”: []

}, {

“SECRETARIA DE ESTADO DE FAZENDA,

PLANEJAMENTO, ORÇAMENTO E GESTÃO”: [

{
“SUBSECRETARIA DA RECEITA”: “”

}

]

}

}

] In case of error trying to create base_name folder, returns None.

Pure Utils¶

Box Extactor¶

Title Filter¶

Title Extactor¶

Pure Utils ¶

Box Extactor ¶

Title Filter ¶

Title Extactor ¶