Pure Core

Extract content from DODFS and export to JSON.

Contains class ContentExtractor which have to public functions avaiable to extract the DODF to JSON

Usage example:

from dodfminer.extract.pure.core import ContentExtractor

pdf_text = ContentExtractor.extract_text(file)
ContentExtractor.extract_to_txt(folder)

Extract Class

class dodfminer.extract.pure.core.ContentExtractor[source]

Extract content from DODFs and export to JSON.

Extracts content from DODF files using as suport the title and subtitle databases—which runs using MuPDF—, and the Tesseract OCR library. All the content is exported to a JSON file, in which its keys are DODF titles or subtitles, and its values are the correspondent content.

Note

This class is not constructable, it cannot generate objects.

classmethod extract_structure(file, single=False, norm='NFKD')[source]

Extract boxes of text with their respective titles.

Parameters
  • file – The DODF file to extract titles from.

  • single – Output content in a single file in the file directory.

  • normType of normalization applied to the text.

Returns

A dictionaty with the blocks organized by title.

Example:

{
    "Title": [
        [
            x0,
            y0,
            x1,
            y1,
            "Text"
        ]
    ],
    ...
}

classmethod extract_text(file, single=False, block=False, is_json=True, sep=' ', norm='NFKD')[source]

Extract block of text from file

Parameters
  • file – The DODF to extract titles from.

  • single – output content in a single file in the file directory.

  • block – Extract the text as a list of text blocks.

  • json – The list of text blocks are written as a json file.

  • sep – The separator character between each block of text.

  • norm – Type of normalization applied to the text.

Note

To learn more about the each type of normalization used in the unicode.normalization method, click here.

Returns

These are the outcomes for each parameter combination.

When block=True and single=True:

In case json=True, The method saves a JSON file containing the text blocks in the DODF file. However, is case json=False, the text from the whole PDF is saved as a string in a .txt file.

When block=True and single=False:

The method returns an array containing text blocks.

Each array in the list have 5 values: the first four are the coordinates of the box from where the text was extracted (x0, y0, x1, y1), while the last is the text from the box.

Example:

(127.77680206298828,
194.2507781982422,
684.0039672851562,
211.97523498535156,
"ANO XLVI EDICAO EXTRA No- 4 BRASILIA - DF")
When block=False and single=True:

The text from the whole PDF is saved in a .txt file as a normalized string.

When block=False and single=False:

The method returns a normalized string containing the text from the whole PDF.

classmethod extract_to_json(folder='./', titles_with_boxes=False, norm='NFKD')[source]

Extract information from DODF to JSON.

Parameters
  • folder – The folder containing the PDFs to be extracted.

  • titles_with_boxes – If True, the method builds a dict containing a list of tuples (similar to extract_structure).

  • Otherwise (similar to extract_text) –

  • tuples (the method structures a list of) –

  • norm

    Type of normalization applied to the text.

Returns

For each PDF file in data/DODFs, extract information from the PDF and output it to a JSON file.

classmethod extract_to_txt(folder='./', norm='NFKD')[source]

Extract information from DODF to a .txt file.

For each PDF file in data/DODFs, the method extracts information from the PDF and writes it to the .txt file.

Parameters
  • folder – The folder containing the PDFs to be extracted.

  • norm

    Type of normalization applied to the text.

Extractor Private Members

One does not access directly none of those methods, but they are listed here in case the programmer using the extract library needs more informations.

Text Preprocessing

classmethod ContentExtractor._normalize_text(text, form='NFKD')[source]

This method is used for text nomalization.

Parameters
Returns

A string with the normalized text.

classmethod ContentExtractor._extract_titles(file)[source]

Extract titles and subtitles from the DODF.

Parameters

file – The DODF to extract the titles.

Returns

An object of type ExtractorTitleSubtitle, in which have the attributes:

titles: get all titles from PDF. subtitle: get all subtitles from PDF.

Raises

Exception – error in extracting titles from PDF.

Check Existence

classmethod ContentExtractor._get_pdfs_list(folder)[source]

Get DODFs list from the path.

Parameters

folder – The folder containing the PDFs to be extracted.

Returns

A list of DODFS’ PDFs paths.

classmethod ContentExtractor._get_json_list(folder)[source]

Get list of exisiting JSONs from the path.

Parameters

folder – The folder containing the PDFs to be extracted.

Returns

A list of all exisiting JSONs.

classmethod ContentExtractor._get_txt_list(folder)[source]

Get list of exisiting .txt files from the path.

Parameters

folder – The folder containing the PDFs to be extracted.

Returns

A list of all exisiting .txt files.

Directory Handling

classmethod ContentExtractor._struct_subfolders(path, json_f, folder)[source]

Creates a directory for the JSON files.

This method structures the folder tree for the allocation of files the code is curretly dealing with.

Parameters
  • path – The path to the extracted file.

  • json_f (boolean) – If True, the file will extracted to a JSON. Otherwise, it will be extrated to a .txt.

  • folder – The folder containing the PDFs to be extracted.

Raises

FileExistsError – The folder being created is already there.

Returns

The path created for the JSON to be saved.

classmethod ContentExtractor._create_single_folder(path)[source]

Create a single folder given the directory path.

This function might create a folder, observe if the folder already exists, or raise an error if the folder cannot be created.

Parameters

path – The path to be created.

Raises

OSError – Error creating the directory.

Others

classmethod ContentExtractor._log(msg)[source]

Print message from within the ContentExtractor class.

Parameters

msg – String with message that should be printed out.