Pure Core
Table of Contents
Extract content from DODFS and export to JSON.
Contains class ContentExtractor which have to public functions avaiable to extract the DODF to JSON
Usage example:
from dodfminer.extract.pure.core import ContentExtractor
pdf_text = ContentExtractor.extract_text(file)
ContentExtractor.extract_to_txt(folder)
Extract Class
- class dodfminer.extract.pure.core.ContentExtractor[source]
Extract content from DODFs and export to JSON.
Extracts content from DODF files using as suport the title and subtitle databases—which runs using MuPDF—, and the Tesseract OCR library. All the content is exported to a JSON file, in which its keys are DODF titles or subtitles, and its values are the correspondent content.
Note
This class is not constructable, it cannot generate objects.
- classmethod extract_structure(file, single=False, norm='NFKD')[source]
Extract boxes of text with their respective titles.
- Parameters
file – The DODF file to extract titles from.
single – Output content in a single file in the file directory.
norm – Type of normalization applied to the text.
- Returns
A dictionaty with the blocks organized by title.
Example:
{ "Title": [ [ x0, y0, x1, y1, "Text" ] ], ... }
- classmethod extract_text(file, single=False, block=False, is_json=True, sep=' ', norm='NFKD')[source]
Extract block of text from file
- Parameters
file – The DODF to extract titles from.
single – output content in a single file in the file directory.
block – Extract the text as a list of text blocks.
json – The list of text blocks are written as a json file.
sep – The separator character between each block of text.
norm – Type of normalization applied to the text.
Note
To learn more about the each type of normalization used in the unicode.normalization method, click here.
- Returns
These are the outcomes for each parameter combination.
- When block=True and single=True:
In case json=True, The method saves a JSON file containing the text blocks in the DODF file. However, is case json=False, the text from the whole PDF is saved as a string in a .txt file.
- When block=True and single=False:
The method returns an array containing text blocks.
Each array in the list have 5 values: the first four are the coordinates of the box from where the text was extracted (x0, y0, x1, y1), while the last is the text from the box.
Example:
(127.77680206298828, 194.2507781982422, 684.0039672851562, 211.97523498535156, "ANO XLVI EDICAO EXTRA No- 4 BRASILIA - DF")
- When block=False and single=True:
The text from the whole PDF is saved in a .txt file as a normalized string.
- When block=False and single=False:
The method returns a normalized string containing the text from the whole PDF.
- classmethod extract_to_json(folder='./', titles_with_boxes=False, norm='NFKD')[source]
Extract information from DODF to JSON.
- Parameters
folder – The folder containing the PDFs to be extracted.
titles_with_boxes – If True, the method builds a dict containing a list of tuples (similar to extract_structure).
Otherwise (similar to extract_text) –
tuples (the method structures a list of) –
norm –
Type of normalization applied to the text.
- Returns
For each PDF file in data/DODFs, extract information from the PDF and output it to a JSON file.
- classmethod extract_to_txt(folder='./', norm='NFKD')[source]
Extract information from DODF to a .txt file.
For each PDF file in data/DODFs, the method extracts information from the PDF and writes it to the .txt file.
- Parameters
folder – The folder containing the PDFs to be extracted.
norm –
Type of normalization applied to the text.
Extractor Private Members
One does not access directly none of those methods, but they are listed here in case the programmer using the extract library needs more informations.
Text Preprocessing
- classmethod ContentExtractor._normalize_text(text, form='NFKD')[source]
This method is used for text nomalization.
- Parameters
text – The text to be normalized.
form –
Type of normalization applied to the text.
- Returns
A string with the normalized text.
- classmethod ContentExtractor._extract_titles(file)[source]
Extract titles and subtitles from the DODF.
- Parameters
file – The DODF to extract the titles.
- Returns
An object of type ExtractorTitleSubtitle, in which have the attributes:
titles: get all titles from PDF. subtitle: get all subtitles from PDF.
- Raises
Exception – error in extracting titles from PDF.
Check Existence
- classmethod ContentExtractor._get_pdfs_list(folder)[source]
Get DODFs list from the path.
- Parameters
folder – The folder containing the PDFs to be extracted.
- Returns
A list of DODFS’ PDFs paths.
Directory Handling
- classmethod ContentExtractor._struct_subfolders(path, json_f, folder)[source]
Creates a directory for the JSON files.
This method structures the folder tree for the allocation of files the code is curretly dealing with.
- Parameters
path – The path to the extracted file.
json_f (boolean) – If True, the file will extracted to a JSON. Otherwise, it will be extrated to a .txt.
folder – The folder containing the PDFs to be extracted.
- Raises
FileExistsError – The folder being created is already there.
- Returns
The path created for the JSON to be saved.
- classmethod ContentExtractor._create_single_folder(path)[source]
Create a single folder given the directory path.
This function might create a folder, observe if the folder already exists, or raise an error if the folder cannot be created.
- Parameters
path – The path to be created.
- Raises
OSError – Error creating the directory.