Pure Utils¶
Warning
This documentation needs improvments by the code’s author.
Table of Contents
Box Extactor¶
Functions to extract boxes from text.
- dodfminer.extract.pure.utils.box_extractor.compare_blocks(block1, block2)[source]¶
- Implements a comparison heuristic between blocks.
Blocks that are in the uppermost and leftmost positions should be inserted before the other block in comparison.
- Parameters
block1 – a block tuple to be compared.
block2 – a block tuple to be compared to.
- Returns
Int
- dodfminer.extract.pure.utils.box_extractor.draw_doc_text_boxes(doc: fitz.Document, doc_boxes, save_path=None)[source]¶
- Draw extracted text blocks rectangles.
In result, a pdf file with rectangles shapes added, representing the extracted blocks, is saved.
- Parameters
doc – an opened fitz document
doc_boxes – the list of blocks on a document, separated by pages
save_path – a custom path for saving the result pdf
- Returns
None
- dodfminer.extract.pure.utils.box_extractor.get_doc_img_boxes(doc: fitz.Document)[source]¶
Returns list of list of bouding boxes of extracted images.
- Parameters
doc – an opened fitz document
- Returns
- List[List[Rect(float, float, float, float)]]. Each Rect represents
an image bounding box.
- dodfminer.extract.pure.utils.box_extractor.get_doc_text_boxes(doc: fitz.Document)[source]¶
Returns list of list of extracted text blocks.
- Parameters
doc – an opened fitz document.
- Returns
List[List[tuple(float, float, float, float, str, int, int)]]
- dodfminer.extract.pure.utils.box_extractor.get_doc_text_lines(doc: fitz.Document)[source]¶
Returns list of list of extracted text lines.
- Parameters
doc – an opened fitz document.
- Returns
List[List[tuple(float, float, float, str)]]
- dodfminer.extract.pure.utils.box_extractor.sort_blocks(page_blocks)[source]¶
Sort blocks by their vertical and horizontal position.
- Parameters
page_blocks – a list of blocks within a page.
- Returns
List[tuple(float, float, float, float, str, int, int)]
- dodfminer.extract.pure.utils.box_extractor._extract_page_lines_content(page)[source]¶
Extracts page lines.
- Parameters
page – fitz.fitz.Page object to have its bold content extracted.
- Returns
List[tuple(float, float, float, float, str)] A list containing lines content at the page, along with its bounding boxes.
- dodfminer.extract.pure.utils.box_extractor.get_doc_text_boxes(doc: fitz.Document)[source]¶
Returns list of list of extracted text blocks.
- Parameters
doc – an opened fitz document.
- Returns
List[List[tuple(float, float, float, float, str, int, int)]]
- dodfminer.extract.pure.utils.box_extractor.get_doc_text_lines(doc: fitz.Document)[source]¶
Returns list of list of extracted text lines.
- Parameters
doc – an opened fitz document.
- Returns
List[List[tuple(float, float, float, str)]]
- dodfminer.extract.pure.utils.box_extractor._get_doc_img(doc: fitz.Document)[source]¶
Returns list of list of image items.
Note
This function is not intented to be used by final users, but internally. Image items are described at:
https://pymupdf.readthedocs.io/en/latest/page/#Page.getImageBbox
- Parameters
doc – an opened fitz document
- Returns
List[List[tuple(int, int, int, int, str, str, str, str, int)]] (xref, smask, width, height, bpc, colorspace, alt. colorspace, filter, invoker)
Title Filter¶
Find titles using a Filter.
- class dodfminer.extract.pure.utils.title_filter.BoldUpperCase[source]¶
Filter functions useful for bold and upper case text.
Note
This class is static and should not be instanciated.
Title Extactor¶
Extract Title and Subtitles.
- class dodfminer.extract.pure.utils.title_extractor.BBox(bbox)¶
- property bbox¶
Alias for field number 0
- class dodfminer.extract.pure.utils.title_extractor.Box(x0, y0, x1, y1)¶
- property x0¶
Alias for field number 0
- property x1¶
Alias for field number 2
- property y0¶
Alias for field number 1
- property y1¶
Alias for field number 3
- class dodfminer.extract.pure.utils.title_extractor.ExtractorTitleSubtitle(path)[source]¶
Use this class like that: >> path = “path_to_pdf” >> extractor = ExtractorTitleSubtitle(path) >> # To extract titles >> titles = extractor.titles >> # To extract subtitles >> titles = extractor.subtitles >> # To dump titles and subtitles on a json file >> json_path = “valid_file_name” >> extractor.dump_json(json_path) .
- dump_json(path)[source]¶
Writes on file specified by path the JSON representation of titles and subtitles extracted.
Dumps the titles and subtitles according to the hierarchy verified on the document.
The outputfile should be specified and will be suffixed with the “.json” if it’s not.
- Parameters
path – string containing path to .json file where the dump will
not. (be done. Its suffixed with ".json" if it's) –
- property json¶
All titles with its subtitles associated.
All subtitles under the same title are at the same level. Deprecated. Better use titles_subtitles or titles_subtitles_hierarchy.
- reset()[source]¶
Sets cache to False and reset others internal attributes. Use when for some reason the internal state was somehow modified by user.
- property subtitles¶
All subtitles extracted from the file speficied by self._path.
- Returns
List[TextTypeBboxPageTuple] each of which having its type attribute equals _TYPE_SUBTITLE
- property titles¶
All titles extracted from the file speficied by self._path.
- Returns
List[TextTypeBboxPageTuple] each of which having its type attribute equals _TYPE_TITLE
- property titles_subtitles¶
A list with titles and subtitles, sorted according to its reading order.
- property titles_subtitles_hierarchy: TitlesSubtitles(titles=<class 'str'>, subtitles=typing.List[str])¶
All titles and subtitles extracted from the file specified by self._path, hierarchically organized.
- Returns
the titles and its respectively subtitles
- Return type
List[TitlesSubtitles(str, List[str])]
- class dodfminer.extract.pure.utils.title_extractor.TextTypeBboxPageTuple(text, type, bbox, page)¶
- property bbox¶
Alias for field number 2
- property page¶
Alias for field number 3
- property text¶
Alias for field number 0
- property type¶
Alias for field number 1
- class dodfminer.extract.pure.utils.title_extractor.TitlesSubtitles(titles, subtitles)¶
- property subtitles¶
Alias for field number 1
- property titles¶
Alias for field number 0
- dodfminer.extract.pure.utils.title_extractor.extract_titles_subtitles(path)[source]¶
Extracts titles and subtitles from DODF pdf.
- Parameters
path – str indicating the path for the pdf to have its content extracted.
- Returns
List[TextTypeBboxPageTuple] containing all titles ans subtitles.
- dodfminer.extract.pure.utils.title_extractor.gen_hierarchy_base(dir_path='.', folder='hierarchy', indent=4, forced=False)[source]¶
- Generates json base from all PDFs immediately under dir_path directory.
The hiearchy files are generated under dir_path directory. Args:
dir_path: path so folder containing PDFs base_name: titles’ base file name forced: proceed even if folder base_name already exists indent: how many spaces used will be used for indent
- Returns:
List[Dict[str, List[Dict[str, List[Dict[str, str]]]]]] e.g: [
- { “22012019”: [
- {
“PODER EXECUTIVO”: []
}, {
“SECRETARIA DE ESTADO DE FAZENDA,
- PLANEJAMENTO, ORÇAMENTO E GESTÃO”: [
- {
“SUBSECRETARIA DA RECEITA”: “”
}
]
}
}
] In case of error trying to create base_name folder, returns None.
- dodfminer.extract.pure.utils.title_extractor.gen_title_base(dir_path='.', base_name='titles', indent=4, forced=False)[source]¶
Generates titles base from all PDFs immediately under dir_path directory. The base is generated under dir_path directory. :param dir_path: path so base_name will contain all titles
from PDFs under dir_path
- Parameters
base_name – titles’ base file name
indent – how many spaces used will be used for indent
- Returns
- dict containing “titles” as key and a list of titles,
the same stored at base_name[.json]
- dodfminer.extract.pure.utils.title_extractor.group_by_column(elements, width)[source]¶
Groups elements by its culumns. The sorting assumes they are on the same page and on a 2-column layout.
Essentially a “groupby” where the key is the page number of each span.
- Parameters
elements – Iterable[TextTypeBboxPageTuple] sorted by its page number to be grouped.
- Returns
A dict with spans of each page, being keys the page numbers.
- dodfminer.extract.pure.utils.title_extractor.group_by_page(elements)[source]¶
Groups elements by page number.
Essentially a “groupby” where the key is the page number of each span.
- Parameters
elements – Iterable[TextTypeBboxPageTuple] sorted by its page number to be grouped.
- Returns
A dict with spans of each page, being keys the page numbers.
- dodfminer.extract.pure.utils.title_extractor.invert_text_type_bbox_page_tuple(text_type_bbox_page_tuple)[source]¶
Reverses the type between _TYPE_TITLE and _TYPE_SUBTITLE.
- Parameters
textTypeBboxPageTuple – instance of TextTypeBboxPageTuple.
- Returns
copy of textTypeBboxPageTuple with its type field reversed.
- dodfminer.extract.pure.utils.title_extractor.load_blocks_list(path)[source]¶
Loads list of blocks list from the file specified.
- Parameters
path – string with path to DODF pdf file
- Returns
A list with page blocks, each element being a list with its according page blocks.
- dodfminer.extract.pure.utils.title_extractor.sort_2column(elements, width_lis)[source]¶
Sorts TextTypeBboxPageTuple iterable.
Sorts sequence of TextTypeBboxPageTuple objects, assuming a full 2-columns layout over them.
- Parameters
elements – Iterable[TextTypeBboxPageTuple]
- Returns
dictionary mapping page number to its elements sorted by column (assumig there are always 2 columns per page)
- dodfminer.extract.pure.utils.title_extractor.sort_by_column(elements, width)[source]¶
Sorts list elements by columns.
- Parameters
elements – Iterable[TextTypeBboxPageTuple].
width – the page width (the context in which all list elements were originally).
- Returns
List[TextTypeBboxPageTuple] containing the list elements sorted according to:
columns
position on column
Assumes a 2-column page layout. All elements on the left column will be placed first of any element on the right one. Inside each columns, reading order is expected to be kept.
- dodfminer.extract.pure.utils.title_extractor.load_blocks_list(path)[source]¶
Loads list of blocks list from the file specified.
- Parameters
path – string with path to DODF pdf file
- Returns
A list with page blocks, each element being a list with its according page blocks.
- dodfminer.extract.pure.utils.title_extractor.group_by_column(elements, width)[source]¶
Groups elements by its culumns. The sorting assumes they are on the same page and on a 2-column layout.
Essentially a “groupby” where the key is the page number of each span.
- Parameters
elements – Iterable[TextTypeBboxPageTuple] sorted by its page number to be grouped.
- Returns
A dict with spans of each page, being keys the page numbers.
- dodfminer.extract.pure.utils.title_extractor.group_by_page(elements)[source]¶
Groups elements by page number.
Essentially a “groupby” where the key is the page number of each span.
- Parameters
elements – Iterable[TextTypeBboxPageTuple] sorted by its page number to be grouped.
- Returns
A dict with spans of each page, being keys the page numbers.
- dodfminer.extract.pure.utils.title_extractor.sort_by_column(elements, width)[source]¶
Sorts list elements by columns.
- Parameters
elements – Iterable[TextTypeBboxPageTuple].
width – the page width (the context in which all list elements were originally).
- Returns
List[TextTypeBboxPageTuple] containing the list elements sorted according to:
columns
position on column
Assumes a 2-column page layout. All elements on the left column will be placed first of any element on the right one. Inside each columns, reading order is expected to be kept.
- dodfminer.extract.pure.utils.title_extractor._extract_bold_upper_page(page)[source]¶
Extracts page content which have bold font and are uppercase.
- Parameters
page – fitz.fitz.Page object to have its bold content extracted.
- Returns
A list containing all bold (and simultaneously upper) content at the page.
- dodfminer.extract.pure.utils.title_extractor._extract_bold_upper_pdf(doc)[source]¶
Extracts bold content from DODF pdf.
- Parameters
doc – DODF pdf file returned by fitz.open
- Returns
a list of list of bold span text
- dodfminer.extract.pure.utils.title_extractor.sort_2column(elements, width_lis)[source]¶
Sorts TextTypeBboxPageTuple iterable.
Sorts sequence of TextTypeBboxPageTuple objects, assuming a full 2-columns layout over them.
- Parameters
elements – Iterable[TextTypeBboxPageTuple]
- Returns
dictionary mapping page number to its elements sorted by column (assumig there are always 2 columns per page)
- dodfminer.extract.pure.utils.title_extractor._get_titles_subtitles(elements, width_lis)[source]¶
Extracts titles and subtitles from list. WARNING: Based on font size and heuristic.
- Parameters
titles_subtitles – a list of dict all of them having the keys: size -> float text -> str bbox -> Box page -> int
- Returns
TitlesSubtitles[List[TextTypeBboxPageTuple], List[TextTypeBboxPageTuple]].
- dodfminer.extract.pure.utils.title_extractor._get_titles_subtitles_smart(doc, width_lis)[source]¶
Extracts titles and subtitles. Makes use of heuristics.
Wraps _get_titles_subtitles, removing most of impurity (spans not which aren’t titles/subtutles).
- Parameters
doc – DODF pdf file returned by fitz.open
- Returns
- TitlesSubtitles(List[TextTypeBboxPageTuple],
List[TextTypeBboxPageTuple]).
- dodfminer.extract.pure.utils.title_extractor.extract_titles_subtitles(path)[source]¶
Extracts titles and subtitles from DODF pdf.
- Parameters
path – str indicating the path for the pdf to have its content extracted.
- Returns
List[TextTypeBboxPageTuple] containing all titles ans subtitles.
- class dodfminer.extract.pure.utils.title_extractor.ExtractorTitleSubtitle(path)[source]¶
Use this class like that: >> path = “path_to_pdf” >> extractor = ExtractorTitleSubtitle(path) >> # To extract titles >> titles = extractor.titles >> # To extract subtitles >> titles = extractor.subtitles >> # To dump titles and subtitles on a json file >> json_path = “valid_file_name” >> extractor.dump_json(json_path) .
- dump_json(path)[source]¶
Writes on file specified by path the JSON representation of titles and subtitles extracted.
Dumps the titles and subtitles according to the hierarchy verified on the document.
The outputfile should be specified and will be suffixed with the “.json” if it’s not.
- Parameters
path – string containing path to .json file where the dump will
not. (be done. Its suffixed with ".json" if it's) –
- property json¶
All titles with its subtitles associated.
All subtitles under the same title are at the same level. Deprecated. Better use titles_subtitles or titles_subtitles_hierarchy.
- reset()[source]¶
Sets cache to False and reset others internal attributes. Use when for some reason the internal state was somehow modified by user.
- property subtitles¶
All subtitles extracted from the file speficied by self._path.
- Returns
List[TextTypeBboxPageTuple] each of which having its type attribute equals _TYPE_SUBTITLE
- property titles¶
All titles extracted from the file speficied by self._path.
- Returns
List[TextTypeBboxPageTuple] each of which having its type attribute equals _TYPE_TITLE
- property titles_subtitles¶
A list with titles and subtitles, sorted according to its reading order.
- property titles_subtitles_hierarchy: TitlesSubtitles(titles=<class 'str'>, subtitles=typing.List[str])¶
All titles and subtitles extracted from the file specified by self._path, hierarchically organized.
- Returns
the titles and its respectively subtitles
- Return type
List[TitlesSubtitles(str, List[str])]
- dodfminer.extract.pure.utils.title_extractor.gen_title_base(dir_path='.', base_name='titles', indent=4, forced=False)[source]¶
Generates titles base from all PDFs immediately under dir_path directory. The base is generated under dir_path directory. :param dir_path: path so base_name will contain all titles
from PDFs under dir_path
- Parameters
base_name – titles’ base file name
indent – how many spaces used will be used for indent
- Returns
- dict containing “titles” as key and a list of titles,
the same stored at base_name[.json]
- dodfminer.extract.pure.utils.title_extractor.gen_hierarchy_base(dir_path='.', folder='hierarchy', indent=4, forced=False)[source]¶
- Generates json base from all PDFs immediately under dir_path directory.
The hiearchy files are generated under dir_path directory. Args:
dir_path: path so folder containing PDFs base_name: titles’ base file name forced: proceed even if folder base_name already exists indent: how many spaces used will be used for indent
- Returns:
List[Dict[str, List[Dict[str, List[Dict[str, str]]]]]] e.g: [
- { “22012019”: [
- {
“PODER EXECUTIVO”: []
}, {
“SECRETARIA DE ESTADO DE FAZENDA,
- PLANEJAMENTO, ORÇAMENTO E GESTÃO”: [
- {
“SUBSECRETARIA DA RECEITA”: “”
}
]
}
}
] In case of error trying to create base_name folder, returns None.