Welcome to DODFMiner’s documentation!

_images/miner.svg

Introduction

Official publications such as the Diário Oficial do Distrito Federal (DODF) are sources of information on all official government acts. Although these documents are rich in knowledge, analysing these texts manually by specialists is a complex and unfeasible task considering the growing volume of documents, the result of the frequent number of publications in the Distrito Federal Government’s (GDF) communication vehicle.

This scenario is appropriate to employ computational techniques based on text mining and information visualization, in order to discover implicit and relevant knowledge in large textual data sets. It is known that these computational techniques receive data in a structured format. However, as DODF editions are originally published in unstructured format and in natural language, it is required to use techniques to prepare strategies in order to make the necessary adaptations to apply.

DODFMiner

With all that in mind, the DODFMiner is the software that is being developed for the extraction of data from documents in PDF format referring to the publications of the Official Gazette of the Federal District, Brazil.

Installation

DODFMiner is currently only supported on Linux and OSX. It may be possible to install on Windows, though this hasn’t been extensively tested.

Requirements

  • Python3

  • MuPDF

Installing MuPDF

MuPDF is the main engine used to parse pdf files on DODFMiner. Its installation is essencial for proper work.

macOS

In macOS use brew to install the library:

$ brew install mupdf

Debian Linux (Ubuntu)

On Ubuntu, or other Debian Linux distro, use the following commands:

$ add-apt-repository ppa:ubuntuhandbook1/apps
$ apt-get update
$ apt-get install mupdf mupdf-tools

DODFMiner Installation Methods

We support two method of installation. The Library method (recommended), and a Docker Install.

Library Install

From The Python Package Index (PyPI):

pip install dodfminer

From Github:

git clone https://github.com/UnB-KnEDLe/DODFMiner.git
cd dodfminer
pip install -e .

Docker Install

Since this project have several dependencies outside Python libraries, there is a DockerFile and a Compose file provided to facilitate the correct execution. The DockerFile contains instructions on how the image is build, while the Compose file contains instruction on how to run the image.

The container created by the DockerFile image use a DATA_PATH environment variable as the location to save the downloaded DODF PDFs and the extracted JSONs. This variable needs to be set before the execution.

To build and execute the image the docker and docker-compose need to be correct installed:

  1. Install Docker

  2. Install Docker Compose

After the installation, the first thing that docker needs is an image. To create the image run the following command in the root of the project:

$ docker-compose build

This can took a while to finish.

Now, with the image created, the docker-compose can generate instances (containers) of this image to run specifics tasks.

_images/dodfminer-docker.jpg
$ export DATA_PATH=/path/to/save/files/ \
$ sudo -E docker-compose run dodfminer -sd 01/19 -ed 01/19

This command executes the download task, where -st is the start date and -ed is the end date, representing the interval that the DODFs will be downloaded.

Other arguments can be found excuting the command:

$ export DATA_PATH=/path/to/save/files/ \
$ sudo -E docker-compose run dodfminer --help

Note

1. If your docker is already in the _sudo_ group you can execute without _sudo_, otherwise the -E argument is needed for _sudo_ use the environment variables declared in login _bash_.

  1. The container will not work if the DATA_PATH is not defined in the environment.

Using DODFMiner

Command-Line Usage

Considering the module has been installed using pip, you should be able to use DODFMiner as a command line program. To check if installation has been done successfully run:

$ dodfminer --help

A help screen of the program should appear. The helper should show two positional arguments: downloader and extract. Each of those arguments can be considered as a subprogram and work independently, you can choose the one you desire using:

$ dodfminer downloader --help
$ dodfminer extract --help

Depending on which module you choose the execution parameters will change.

Downloader Module

The downloader module is responsible for downloading DODF PDFs and JSONs from the website.

If you want to download the PDFs, it allows you to choose the start and end date of the files you want to download. On the other hand, if you want to download the JSONs, you can specify from which URL your download will be done.

Also, you can choose where to save the downloaded files. Following are the list of avaiable parameters, their description and the default value.

Note

This module relies on internet connection and can fail if internet is not working properly. Also, the execution might take a while if there are a huge ammount of PDFs to download.

Parameters Table

Argument

Description

Default

-sp –save_path

Folder to output the download DODFs

./

-sd –start_date

Input the date in either mm/yyyy or mm-yyyy

01/2019

-ed –end_date

Input the date in either mm/yyyy or mm-yyyy

01/2019

-f –file_type

File type to download

pdf

-url

URL to download JSON file from

dodf

Usage Example:

$ dodfminer downloader -sd 01/2003 -ed 05/2004
$ dodfminer downloader -f json

Note

If you want to download a JSON file, the start date and end date will be ignored. The only downloaded file will be the current available JSON in the URL.

Extractor Module

The extractor module is responsible for extracting information from DODF JSONs and PDFs and save it in a desirable format.

The extraction can be made, to a pure text content, where a DODF PDF will be converted to TXT or JSON. Or, additionaly, the extraction can be done in a polished way, where from the DODF will be extracted to acts and its given proprieties in a CSV format.

Pure extraction

Given a -t flag, it allows you to choose the output format between three options: blocks of text with tiles, pure text in .txt format and text separated by titles:

  • Blocks of Text: Outputs a JSON file that extract text blocks.

  • Pure Text: Output a .txt file, with raw text from the pdf.

  • Blocks of Text with Titles: Outputs a JSON file that extract text blocks indexed by titles.

Polished Extraction

Using the -a or –act flag, you can extract the dodf in a polished way. The usage of the -a will extract all types of act in the DODF. Additionaly, if desired, the flag can be followed by a list of specific acts types which you want to extract. The extraction is done using the backend specified in the -b flag, which can be either regex or ner.

Available Act Types:

  • aposentadoria

  • reversoes

  • nomeacao

  • exoneracao

  • abono

  • retificacoes

  • substituicao

  • cessoes

  • sem_efeito_aposentadoria

  • efetivos_nome

  • efetivos_exo

  • contrato_convenio

  • aditamento

  • licitacao

  • suspensao

  • anulacao_revogacao

Parameters Table

Following are the list of avaiable parameters, their description and the default value.

Argument

Description

Default

-i –input_folder

Path to the PDF/JSONs folder

./

-s –single-file

Path to a single PDF/JSON

None

-t –type-of-extraction

Type of text extraction

None

-a –act

List of acts that will be extract to CSV

all

-b –backend

Which backend will extract the acts

regex

Usage Example:

$ dodfminer extract -i path/to/pdf/folder -t with-titles
$ dodfminer extract -s path/to/dodf.pdf -t pure-text
$ dodfminer extract -i path/to/json/folder -a anulacao_revogacao
$ dodfminer extract -s path/to/dodf.pdf -a nomeacao
$ dodfminer extract -s path/to/dodf.pdf -a nomeacao cessoes -b ner

Note

It’s important to notice that if -t and -a options are used together the -t option will have the priority and the -a will not execute.

Note

The DODFMiner act extraction needs the text data from DODFs to correctly extract the acts from DODF, therefore the -a option generates first txt files before the act extraction.

Library Usage

The DODFMiner was created also thinking the user might want to use it as a library in their own projects. Users can use install the DODFMiner and call its modules and functions in their python scripts. Following are some of the imports you might want to do, while using as a library:

from dodfminer import acts
from dodfminer import Downloader
from dodfminer import ActsExtractor
from dodfminer import ContentExtractor

The details of using the DODFMiner modules and functions are described in this documentation, in the following sections.

Architecture’s Document

Python is surprisingly flexible when it comes to structuring your applications. On the one hand, this flexibility is great: it allows different use cases to use structures that are necessary for those use cases. On the other hand, though, it can be very confusing to the new developer.

Document Overview

Topic Description
Introduction Describes information about the purpose of this document.
Architectural Representation Provides a description of the software architecture for a better understanding of its structure and functioning.
In addition to showing how it is being represented.
Goals and Constraints Provides a description of the requirements and objectives of the software,
and whether they have any impact on the architecture.
Logical View Provides a description of the relevant parts related to the architecture of the design model.
References Relevant links and literatures to the development of this document.

Introduction


Objetive

This document aims to provide an overview of the architecture of the DODFMiner Library: it contains pertinent information about the architecture model adopted, such as diagrams that illustrate use cases, package diagram, among other resources.

Escope

Through this document, the reader will be able to understand the functioning of the DODFMiner Library, as well as the approach used in its development. In this way, it will be possible to have a broad understanding of its architecture.

Definitions, Acronyms and Abreviations

Acronym Definition
CLI Command Line Interface
DODF Diário Oficial do Distrito Federal

Revision History

Data Versão Descrição Autor
29/06/2020 1.0 Documment Creation Renato Nobre
16/07/2020 1.1.0 Documment Creation Renato Nobre & Khalil Casten

Architectural Representation


The main point to understand in this architecture is that the DODFMiner is a library and a CLI application simultaniously. DODFMiner can be integrated to another project or used standalone in a shell environment.

Being a library requires a given ammount of complexity. In larger applications, you may have one or more internal packages that provide specific functionality to a larger library you are packaging. This application follows this aspect, mining pdf documents, imply in many subpackages with specific functionality, that when working together, fulfill a greater aspect.

Relationship Diagram

_images/app.svg

Subpackages Structure

This applications follow the basic structure for a python library with multiple subpackages. It uses a common concept of core and helper files.

The core file is the main file in a package or subpackage, it contains the class with the main package execution. The helper file contains suporting functions to the package.

In summary, the project structure look as follows:

dodfminer
├── __version__.py
├── cli.py
├── downloader
│   └── core.py
├── extract
│   ├── polished
│      ├── acts
│         ├── abono.py
│         ├── aposentadoria.py
│         ├── base.py
│         ├── cessoes.py
│         ├── exoneracao.py
│         ├── models/
│         ├── nomeacao.py
│         ├── reversoes.py
│         ├── sem_efeito_aposentadoria.py
│         └── substituicao.py
│      ├── backend
│         ├── ner.py
│         └── regex.py
│      ├── core.py
│      └── helper.py
│   └── pure
│       ├── core.py
│       └── utils
│           ├── box_extractor.py
│           ├── title_extractor.py
│           └── title_filter.py
└── run.py

Technologies

Following are some of the most essencial tecnologies used with the DODFMiner application

  1. MuPDF

    MuPDF is a free and open-source software framework written in C that implements a PDF, XPS, and EPUB parsing and rendering engine. It is used primarily to render pages into bitmaps, but also provides support for other operations such as searching and listing the table of contents and hyperlinks

  2. BeautifulSoup

    Beautiful Soup is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping

  3. Pandas

    Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license

  4. Site do DODF

    Website where all of the DODFs are downloaded from.

Goals and Constraints


Non-functional Requirements

  • Be a library avaiable by pip on The Python Package Index (PyPI)

  • Work as a standalone command line application, installed globally without needing file execution

  • Support continuous deployment and continuous integration

  • The DODFMiner should be able to:

    • Download DODFs from the website

    • Extract pdf files to .txt and .json formats

    • Extract images and tables from the DODF

    • Extract DODF’s Acts and its proprieties to a dataframe or other desirable format

General Constraints

  • Have tested support for Mac and Linux users.

  • Have a docker installation method

  • Be open-source

  • Don’t use a database library

Tecnological Constraints

  • Python: Language used for development

  • MuPDF: Tool used for PDF extraction

  • BeautifulSoup: Library used for webscraping

  • Pandas: Library used for data handling and cration of dataframes

  • DODF Website: Website in which the DODFs are downloaded from

Logical View


Overview

DODFMiner is a library and CLI application made with the Python language, using MuPDF, BeautifulSoup, Pandas, and many others python libraries. The purpose of DODFMiner is to be an library and tool to fullfil the hole process of extraction of a official diary from federal district in Brazil.

Package Diagram

_images/pacotes.svg

Class Diagram

References


Amika Architecture

Python Layouts

Downloader Core

Download DODFs from the Buriti Website and save on proper directory.

Download monthly pdfs of DODFs.

Usage example:

downloader = Downloader()
downloader.pull(start_date, end_date)

Downloader Class

class dodfminer.downloader.core.Downloader(save_path='./')[source]

Responsible for the download of the DODFs Pdfs.

Parameters

save_path (str) – Path to save the downloads.

_download_path

Folder in which the downloads will be stored.

_prog_bar

Indicate if download should contain a progress bar.

pull(start_date, end_date)[source]

Make the download of the DODFs pdfs.

All dodfs are downloaded from start_date to end_date inclusively. The Pdfs are saved in a folder called “data” inside the project folder.

Parameters
  • start_date (str) – The start date in format mm/yyyy.

  • end_date (str) – The start date in format mm/yyyy.

Note

The name or the path of the save folder are hard coded and can’t be changed due to some nonsense software engineer decision.

pull_json(JSON_URL)[source]

Download the DODF JSON file available on the current day.

The file is saved either in the path provided or in the default ‘dodf’ directory.

Note

There is no way of downloading JSON files from past days because they are not provided.

Downloader Private Methods

One does not access directly none of those methods, but they are listed here in case the programmer using the downloader library needs more informations.

Path Handling

Methods that handle the creation of the paths to the dowloaded DODFS.

Downloader._create_single_folder(path)[source]

Create a single folder given the directory path.

This function might create a folder, observe that the folder already exists, or raise an error if the folder cannot be created.

Parameters

path (str) – The path to be created

Raises

OSError – Error creating the directory.

Downloader._create_download_folder()[source]

Create Downloaded DODFs Structures.

Downloader._make_month_path(year, actual_date)[source]

Create and return the folder for the year and month being download.

Parameters
  • year (int) – The year respective to the folder.

  • actual_date (datetime) – The date in which the downloaded

  • corresponds. (DODF) –

Returns

The path to the actual month in which the download is being made.

URL Making

Methods that construct an URL to further make the download request. .. .. automethod:: dodfminer.downloader.core.Downloader._make_url .. .. automethod:: dodfminer.downloader.core.Downloader._make_href_url .. .. automethod:: dodfminer.downloader.core.Downloader._make_download_url

Web Requests

Methods that handle the download request and its execution.

Downloader._fail_request_message(url, error)[source]

Log error messages in download.

Parameters
  • url (str) – The failing url to the website.

  • error (str) – The kind of error happening.

Downloader._download_pdf(url, path)[source]

Download the DODF PDF.

Note

Might be time consuming depending on bandwidth.

Parameters
  • url (str) – The pdf url.

  • path (str) – The path to save the pdf.

Raises

RequestException – Error in case the request to download fails.

Others

Other methods for the downloader library.

classmethod Downloader._string_to_date(date)[source]

Convert the date to datetime.

Parameters

date (datetime) – The date to be converted in string format.

Returns

Return the date formated in string now as datetime datatype.

Raises

Exception – date passed through cli is in wrong format.

Downloader._file_exist(path)[source]

Check if a file exists.

Prevents redownloads.

Parameters

path (str) – The path where the file might be

Returns

Boolean indicating if file does really exists.

Downloader._log(message)[source]

Logs a message following the downloader pattern.

Parameters

message (str) – The message to be logged.

Pure Core

Extract content from DODFS and export to JSON.

Contains class ContentExtractor which have to public functions avaiable to extract the DODF to JSON

Usage example:

from dodfminer.extract.pure.core import ContentExtractor

pdf_text = ContentExtractor.extract_text(file)
ContentExtractor.extract_to_txt(folder)

Extract Class

class dodfminer.extract.pure.core.ContentExtractor[source]

Extract content from DODFs and export to JSON.

Extracts content from DODF files using as suport the title and subtitle databases—which runs using MuPDF—, and the Tesseract OCR library. All the content is exported to a JSON file, in which its keys are DODF titles or subtitles, and its values are the correspondent content.

Note

This class is not constructable, it cannot generate objects.

classmethod extract_structure(file, single=False, norm='NFKD')[source]

Extract boxes of text with their respective titles.

Parameters
  • file – The DODF file to extract titles from.

  • single – Output content in a single file in the file directory.

  • normType of normalization applied to the text.

Returns

A dictionaty with the blocks organized by title.

Example:

{
    "Title": [
        [
            x0,
            y0,
            x1,
            y1,
            "Text"
        ]
    ],
    ...
}

classmethod extract_text(file, single=False, block=False, is_json=True, sep=' ', norm='NFKD')[source]

Extract block of text from file

Parameters
  • file – The DODF to extract titles from.

  • single – output content in a single file in the file directory.

  • block – Extract the text as a list of text blocks.

  • json – The list of text blocks are written as a json file.

  • sep – The separator character between each block of text.

  • norm – Type of normalization applied to the text.

Note

To learn more about the each type of normalization used in the unicode.normalization method, click here.

Returns

These are the outcomes for each parameter combination.

When block=True and single=True:

In case json=True, The method saves a JSON file containing the text blocks in the DODF file. However, is case json=False, the text from the whole PDF is saved as a string in a .txt file.

When block=True and single=False:

The method returns an array containing text blocks.

Each array in the list have 5 values: the first four are the coordinates of the box from where the text was extracted (x0, y0, x1, y1), while the last is the text from the box.

Example:

(127.77680206298828,
194.2507781982422,
684.0039672851562,
211.97523498535156,
"ANO XLVI EDICAO EXTRA No- 4 BRASILIA - DF")
When block=False and single=True:

The text from the whole PDF is saved in a .txt file as a normalized string.

When block=False and single=False:

The method returns a normalized string containing the text from the whole PDF.

classmethod extract_to_json(folder='./', titles_with_boxes=False, norm='NFKD')[source]

Extract information from DODF to JSON.

Parameters
  • folder – The folder containing the PDFs to be extracted.

  • titles_with_boxes – If True, the method builds a dict containing a list of tuples (similar to extract_structure).

  • Otherwise (similar to extract_text) –

  • tuples (the method structures a list of) –

  • norm

    Type of normalization applied to the text.

Returns

For each PDF file in data/DODFs, extract information from the PDF and output it to a JSON file.

classmethod extract_to_txt(folder='./', norm='NFKD')[source]

Extract information from DODF to a .txt file.

For each PDF file in data/DODFs, the method extracts information from the PDF and writes it to the .txt file.

Parameters
  • folder – The folder containing the PDFs to be extracted.

  • norm

    Type of normalization applied to the text.

Extractor Private Members

One does not access directly none of those methods, but they are listed here in case the programmer using the extract library needs more informations.

Text Preprocessing

classmethod ContentExtractor._normalize_text(text, form='NFKD')[source]

This method is used for text nomalization.

Parameters
Returns

A string with the normalized text.

classmethod ContentExtractor._extract_titles(file)[source]

Extract titles and subtitles from the DODF.

Parameters

file – The DODF to extract the titles.

Returns

An object of type ExtractorTitleSubtitle, in which have the attributes:

titles: get all titles from PDF. subtitle: get all subtitles from PDF.

Raises

Exception – error in extracting titles from PDF.

Check Existence

classmethod ContentExtractor._get_pdfs_list(folder)[source]

Get DODFs list from the path.

Parameters

folder – The folder containing the PDFs to be extracted.

Returns

A list of DODFS’ PDFs paths.

classmethod ContentExtractor._get_json_list(folder)[source]

Get list of exisiting JSONs from the path.

Parameters

folder – The folder containing the PDFs to be extracted.

Returns

A list of all exisiting JSONs.

classmethod ContentExtractor._get_txt_list(folder)[source]

Get list of exisiting .txt files from the path.

Parameters

folder – The folder containing the PDFs to be extracted.

Returns

A list of all exisiting .txt files.

Directory Handling

classmethod ContentExtractor._struct_subfolders(path, json_f, folder)[source]

Creates a directory for the JSON files.

This method structures the folder tree for the allocation of files the code is curretly dealing with.

Parameters
  • path – The path to the extracted file.

  • json_f (boolean) – If True, the file will extracted to a JSON. Otherwise, it will be extrated to a .txt.

  • folder – The folder containing the PDFs to be extracted.

Raises

FileExistsError – The folder being created is already there.

Returns

The path created for the JSON to be saved.

classmethod ContentExtractor._create_single_folder(path)[source]

Create a single folder given the directory path.

This function might create a folder, observe if the folder already exists, or raise an error if the folder cannot be created.

Parameters

path – The path to be created.

Raises

OSError – Error creating the directory.

Others

classmethod ContentExtractor._log(msg)[source]

Print message from within the ContentExtractor class.

Parameters

msg – String with message that should be printed out.

Pure Utils

Warning

This documentation needs improvments by the code’s author.

Box Extactor

Functions to extract boxes from text.

dodfminer.extract.pure.utils.box_extractor.draw_doc_text_boxes(doc: fitz.Document, doc_boxes, save_path=None)[source]
Draw extracted text blocks rectangles.

In result, a pdf file with rectangles shapes added, representing the extracted blocks, is saved.

Parameters
  • doc – an opened fitz document

  • doc_boxes – the list of blocks on a document, separated by pages

  • save_path – a custom path for saving the result pdf

Returns

None

dodfminer.extract.pure.utils.box_extractor.get_doc_img_boxes(doc: fitz.Document)[source]

Returns list of list of bouding boxes of extracted images.

Parameters

doc – an opened fitz document

Returns

List[List[Rect(float, float, float, float)]]. Each Rect represents

an image bounding box.

dodfminer.extract.pure.utils.box_extractor.get_doc_text_boxes(doc: fitz.Document)[source]

Returns list of list of extracted text blocks.

Parameters

doc – an opened fitz document.

Returns

List[List[tuple(float, float, float, float, str, int, int)]]

dodfminer.extract.pure.utils.box_extractor.get_doc_text_lines(doc: fitz.Document)[source]

Returns list of list of extracted text lines.

Parameters

doc – an opened fitz document.

Returns

List[List[tuple(float, float, float, str)]]

dodfminer.extract.pure.utils.box_extractor._extract_page_lines_content(page)[source]

Extracts page lines.

Parameters

page – fitz.fitz.Page object to have its bold content extracted.

Returns

List[tuple(float, float, float, float, str)] A list containing lines content at the page, along with its bounding boxes.

dodfminer.extract.pure.utils.box_extractor.get_doc_text_boxes(doc: fitz.Document)[source]

Returns list of list of extracted text blocks.

Parameters

doc – an opened fitz document.

Returns

List[List[tuple(float, float, float, float, str, int, int)]]

dodfminer.extract.pure.utils.box_extractor.get_doc_text_lines(doc: fitz.Document)[source]

Returns list of list of extracted text lines.

Parameters

doc – an opened fitz document.

Returns

List[List[tuple(float, float, float, str)]]

dodfminer.extract.pure.utils.box_extractor._get_doc_img(doc: fitz.Document)[source]

Returns list of list of image items.

Note

This function is not intented to be used by final users, but internally. Image items are described at:

https://pymupdf.readthedocs.io/en/latest/page/#Page.getImageBbox

Parameters

doc – an opened fitz document

Returns

List[List[tuple(int, int, int, int, str, str, str, str, int)]] (xref, smask, width, height, bpc, colorspace, alt. colorspace, filter, invoker)

dodfminer.extract.pure.utils.box_extractor.get_doc_img_boxes(doc: fitz.Document)[source]

Returns list of list of bouding boxes of extracted images.

Parameters

doc – an opened fitz document

Returns

List[List[Rect(float, float, float, float)]]. Each Rect represents

an image bounding box.

Title Filter

Find titles using a Filter.

class dodfminer.extract.pure.utils.title_filter.BoldUpperCase[source]

Filter functions useful for bold and upper case text.

Note

This class is static and should not be instanciated.

classmethod dict_bold(data)[source]

Hmm.

Evaluates do True if d[‘flags’] matches the following conditions:

  • is one of the values in BoldUpperCase.BOLD_FLAGS

classmethod dict_text(data)[source]

Check if text is title.

Evaluates to true if d[‘text’] matches the following conditions:

  • all letters are uppercase;

  • does not contain 4 or more consecutive spaces;

  • has a len greater than BoldUpperCase.TEXT_MIN/

Returns

Boolean indicating if text is title.

Title Extactor

Extract Title and Subtitles.

class dodfminer.extract.pure.utils.title_extractor.BBox(bbox)
property bbox

Alias for field number 0

class dodfminer.extract.pure.utils.title_extractor.Box(x0, y0, x1, y1)
property x0

Alias for field number 0

property x1

Alias for field number 2

property y0

Alias for field number 1

property y1

Alias for field number 3

class dodfminer.extract.pure.utils.title_extractor.ExtractorTitleSubtitle(path)[source]

Use this class like that: >> path = “path_to_pdf” >> extractor = ExtractorTitleSubtitle(path) >> # To extract titles >> titles = extractor.titles >> # To extract subtitles >> titles = extractor.subtitles >> # To dump titles and subtitles on a json file >> json_path = “valid_file_name” >> extractor.dump_json(json_path) .

dump_json(path)[source]

Writes on file specified by path the JSON representation of titles and subtitles extracted.

Dumps the titles and subtitles according to the hierarchy verified on the document.

The outputfile should be specified and will be suffixed with the “.json” if it’s not.

Parameters
  • path – string containing path to .json file where the dump will

  • not. (be done. Its suffixed with ".json" if it's) –

property json

All titles with its subtitles associated.

All subtitles under the same title are at the same level. Deprecated. Better use titles_subtitles or titles_subtitles_hierarchy.

reset()[source]

Sets cache to False and reset others internal attributes. Use when for some reason the internal state was somehow modified by user.

property subtitles

All subtitles extracted from the file speficied by self._path.

Returns

List[TextTypeBboxPageTuple] each of which having its type attribute equals _TYPE_SUBTITLE

property titles

All titles extracted from the file speficied by self._path.

Returns

List[TextTypeBboxPageTuple] each of which having its type attribute equals _TYPE_TITLE

property titles_subtitles

A list with titles and subtitles, sorted according to its reading order.

property titles_subtitles_hierarchy: TitlesSubtitles(titles=<class 'str'>, subtitles=typing.List[str])

All titles and subtitles extracted from the file specified by self._path, hierarchically organized.

Returns

the titles and its respectively subtitles

Return type

List[TitlesSubtitles(str, List[str])]

class dodfminer.extract.pure.utils.title_extractor.TextTypeBboxPageTuple(text, type, bbox, page)
property bbox

Alias for field number 2

property page

Alias for field number 3

property text

Alias for field number 0

property type

Alias for field number 1

class dodfminer.extract.pure.utils.title_extractor.TitlesSubtitles(titles, subtitles)
property subtitles

Alias for field number 1

property titles

Alias for field number 0

dodfminer.extract.pure.utils.title_extractor.extract_titles_subtitles(path)[source]

Extracts titles and subtitles from DODF pdf.

Parameters

path – str indicating the path for the pdf to have its content extracted.

Returns

List[TextTypeBboxPageTuple] containing all titles ans subtitles.

dodfminer.extract.pure.utils.title_extractor.gen_hierarchy_base(dir_path='.', folder='hierarchy', indent=4, forced=False)[source]
Generates json base from all PDFs immediately under dir_path directory.

The hiearchy files are generated under dir_path directory. Args:

dir_path: path so folder containing PDFs base_name: titles’ base file name forced: proceed even if folder base_name already exists indent: how many spaces used will be used for indent

Returns:

List[Dict[str, List[Dict[str, List[Dict[str, str]]]]]] e.g: [

{ “22012019”: [
{

“PODER EXECUTIVO”: []

}, {

“SECRETARIA DE ESTADO DE FAZENDA,

PLANEJAMENTO, ORÇAMENTO E GESTÃO”: [
{

“SUBSECRETARIA DA RECEITA”: “”

}

]

}

}

] In case of error trying to create base_name folder, returns None.

dodfminer.extract.pure.utils.title_extractor.gen_title_base(dir_path='.', base_name='titles', indent=4, forced=False)[source]

Generates titles base from all PDFs immediately under dir_path directory. The base is generated under dir_path directory. :param dir_path: path so base_name will contain all titles

from PDFs under dir_path

Parameters
  • base_name – titles’ base file name

  • indent – how many spaces used will be used for indent

Returns

dict containing “titles” as key and a list of titles,

the same stored at base_name[.json]

dodfminer.extract.pure.utils.title_extractor.group_by_column(elements, width)[source]

Groups elements by its culumns. The sorting assumes they are on the same page and on a 2-column layout.

Essentially a “groupby” where the key is the page number of each span.

Parameters

elements – Iterable[TextTypeBboxPageTuple] sorted by its page number to be grouped.

Returns

A dict with spans of each page, being keys the page numbers.

dodfminer.extract.pure.utils.title_extractor.group_by_page(elements)[source]

Groups elements by page number.

Essentially a “groupby” where the key is the page number of each span.

Parameters

elements – Iterable[TextTypeBboxPageTuple] sorted by its page number to be grouped.

Returns

A dict with spans of each page, being keys the page numbers.

dodfminer.extract.pure.utils.title_extractor.invert_text_type_bbox_page_tuple(text_type_bbox_page_tuple)[source]

Reverses the type between _TYPE_TITLE and _TYPE_SUBTITLE.

Parameters

textTypeBboxPageTuple – instance of TextTypeBboxPageTuple.

Returns

copy of textTypeBboxPageTuple with its type field reversed.

dodfminer.extract.pure.utils.title_extractor.load_blocks_list(path)[source]

Loads list of blocks list from the file specified.

Parameters

path – string with path to DODF pdf file

Returns

A list with page blocks, each element being a list with its according page blocks.

dodfminer.extract.pure.utils.title_extractor.sort_2column(elements, width_lis)[source]

Sorts TextTypeBboxPageTuple iterable.

Sorts sequence of TextTypeBboxPageTuple objects, assuming a full 2-columns layout over them.

Parameters

elements – Iterable[TextTypeBboxPageTuple]

Returns

dictionary mapping page number to its elements sorted by column (assumig there are always 2 columns per page)

dodfminer.extract.pure.utils.title_extractor.sort_by_column(elements, width)[source]

Sorts list elements by columns.

Parameters
  • elements – Iterable[TextTypeBboxPageTuple].

  • width – the page width (the context in which all list elements were originally).

Returns

List[TextTypeBboxPageTuple] containing the list elements sorted according to:

  1. columns

  2. position on column

Assumes a 2-column page layout. All elements on the left column will be placed first of any element on the right one. Inside each columns, reading order is expected to be kept.

dodfminer.extract.pure.utils.title_extractor.load_blocks_list(path)[source]

Loads list of blocks list from the file specified.

Parameters

path – string with path to DODF pdf file

Returns

A list with page blocks, each element being a list with its according page blocks.

dodfminer.extract.pure.utils.title_extractor.group_by_column(elements, width)[source]

Groups elements by its culumns. The sorting assumes they are on the same page and on a 2-column layout.

Essentially a “groupby” where the key is the page number of each span.

Parameters

elements – Iterable[TextTypeBboxPageTuple] sorted by its page number to be grouped.

Returns

A dict with spans of each page, being keys the page numbers.

dodfminer.extract.pure.utils.title_extractor.group_by_page(elements)[source]

Groups elements by page number.

Essentially a “groupby” where the key is the page number of each span.

Parameters

elements – Iterable[TextTypeBboxPageTuple] sorted by its page number to be grouped.

Returns

A dict with spans of each page, being keys the page numbers.

dodfminer.extract.pure.utils.title_extractor.sort_by_column(elements, width)[source]

Sorts list elements by columns.

Parameters
  • elements – Iterable[TextTypeBboxPageTuple].

  • width – the page width (the context in which all list elements were originally).

Returns

List[TextTypeBboxPageTuple] containing the list elements sorted according to:

  1. columns

  2. position on column

Assumes a 2-column page layout. All elements on the left column will be placed first of any element on the right one. Inside each columns, reading order is expected to be kept.

dodfminer.extract.pure.utils.title_extractor._extract_bold_upper_page(page)[source]

Extracts page content which have bold font and are uppercase.

Parameters

page – fitz.fitz.Page object to have its bold content extracted.

Returns

A list containing all bold (and simultaneously upper) content at the page.

dodfminer.extract.pure.utils.title_extractor._extract_bold_upper_pdf(doc)[source]

Extracts bold content from DODF pdf.

Parameters

doc – DODF pdf file returned by fitz.open

Returns

a list of list of bold span text

dodfminer.extract.pure.utils.title_extractor.sort_2column(elements, width_lis)[source]

Sorts TextTypeBboxPageTuple iterable.

Sorts sequence of TextTypeBboxPageTuple objects, assuming a full 2-columns layout over them.

Parameters

elements – Iterable[TextTypeBboxPageTuple]

Returns

dictionary mapping page number to its elements sorted by column (assumig there are always 2 columns per page)

dodfminer.extract.pure.utils.title_extractor._get_titles_subtitles(elements, width_lis)[source]

Extracts titles and subtitles from list. WARNING: Based on font size and heuristic.

Parameters

titles_subtitles – a list of dict all of them having the keys: size -> float text -> str bbox -> Box page -> int

Returns

TitlesSubtitles[List[TextTypeBboxPageTuple], List[TextTypeBboxPageTuple]].

dodfminer.extract.pure.utils.title_extractor._get_titles_subtitles_smart(doc, width_lis)[source]

Extracts titles and subtitles. Makes use of heuristics.

Wraps _get_titles_subtitles, removing most of impurity (spans not which aren’t titles/subtutles).

Parameters

doc – DODF pdf file returned by fitz.open

Returns

TitlesSubtitles(List[TextTypeBboxPageTuple],

List[TextTypeBboxPageTuple]).

dodfminer.extract.pure.utils.title_extractor.extract_titles_subtitles(path)[source]

Extracts titles and subtitles from DODF pdf.

Parameters

path – str indicating the path for the pdf to have its content extracted.

Returns

List[TextTypeBboxPageTuple] containing all titles ans subtitles.

class dodfminer.extract.pure.utils.title_extractor.ExtractorTitleSubtitle(path)[source]

Use this class like that: >> path = “path_to_pdf” >> extractor = ExtractorTitleSubtitle(path) >> # To extract titles >> titles = extractor.titles >> # To extract subtitles >> titles = extractor.subtitles >> # To dump titles and subtitles on a json file >> json_path = “valid_file_name” >> extractor.dump_json(json_path) .

dump_json(path)[source]

Writes on file specified by path the JSON representation of titles and subtitles extracted.

Dumps the titles and subtitles according to the hierarchy verified on the document.

The outputfile should be specified and will be suffixed with the “.json” if it’s not.

Parameters
  • path – string containing path to .json file where the dump will

  • not. (be done. Its suffixed with ".json" if it's) –

property json

All titles with its subtitles associated.

All subtitles under the same title are at the same level. Deprecated. Better use titles_subtitles or titles_subtitles_hierarchy.

reset()[source]

Sets cache to False and reset others internal attributes. Use when for some reason the internal state was somehow modified by user.

property subtitles

All subtitles extracted from the file speficied by self._path.

Returns

List[TextTypeBboxPageTuple] each of which having its type attribute equals _TYPE_SUBTITLE

property titles

All titles extracted from the file speficied by self._path.

Returns

List[TextTypeBboxPageTuple] each of which having its type attribute equals _TYPE_TITLE

property titles_subtitles

A list with titles and subtitles, sorted according to its reading order.

property titles_subtitles_hierarchy: TitlesSubtitles(titles=<class 'str'>, subtitles=typing.List[str])

All titles and subtitles extracted from the file specified by self._path, hierarchically organized.

Returns

the titles and its respectively subtitles

Return type

List[TitlesSubtitles(str, List[str])]

dodfminer.extract.pure.utils.title_extractor.gen_title_base(dir_path='.', base_name='titles', indent=4, forced=False)[source]

Generates titles base from all PDFs immediately under dir_path directory. The base is generated under dir_path directory. :param dir_path: path so base_name will contain all titles

from PDFs under dir_path

Parameters
  • base_name – titles’ base file name

  • indent – how many spaces used will be used for indent

Returns

dict containing “titles” as key and a list of titles,

the same stored at base_name[.json]

dodfminer.extract.pure.utils.title_extractor.gen_hierarchy_base(dir_path='.', folder='hierarchy', indent=4, forced=False)[source]
Generates json base from all PDFs immediately under dir_path directory.

The hiearchy files are generated under dir_path directory. Args:

dir_path: path so folder containing PDFs base_name: titles’ base file name forced: proceed even if folder base_name already exists indent: how many spaces used will be used for indent

Returns:

List[Dict[str, List[Dict[str, List[Dict[str, str]]]]]] e.g: [

{ “22012019”: [
{

“PODER EXECUTIVO”: []

}, {

“SECRETARIA DE ESTADO DE FAZENDA,

PLANEJAMENTO, ORÇAMENTO E GESTÃO”: [
{

“SUBSECRETARIA DA RECEITA”: “”

}

]

}

}

] In case of error trying to create base_name folder, returns None.

Polished Core

The Act Extractor Class

Returning Objects

The methods in this section return objects or vectors of objects.

Returning Dataframes

The methods in this section return dataframes or vectors of dataframes.

Polished Helper

Acts

Acts are always built as a child class from the Base class Atos. Following are the base class structure and a guide for implementating your own act. Also, a list of implementated and missing acts are presented.

Base Class

Implementing new acts

The Acts base class is build in a way to make easy implementation of new acts. A programmer seeking to help in the development of new acts, need not to worry about anything, besides the regex or ner itself.

Mainly, the following funcions need to be overwrited in the child class.

Regex Methods

In case you want to extract through regex, the following funcions needs to be written:

ActRegex._rule_for_inst()[source]

Rule for extraction of the act

Warning

Must return a regex rule that finds an act in two parts, containing a head and a body. Where only the body will be used to search for proprieties.

Raises

NotImplementedError – Child class needs to overwrite this method.

ActRegex._prop_rules()[source]

Rules for extraction of the proprieties.

Must return a dictionary of regex rules, where the key is the propriety type and the value is the rule.

Raises

NotImplementedError – Child class needs to overwrite this method

Additionaly, if the programmer whishes to change the regex flags for his/her class, they can overwrite the following function in the child class:

classmethod ActRegex._regex_flags()[source]

Flag of the regex search

NER Methods

If NER will be used, you shall add a trained model to the acts/models folder. Also the following method should be overwrited in your act:

Change the Core File

After all functions have been implemented, the programmer needs to do a minor change in the core file. The following must be added:

from dodfminer.extract.polished.acts.act_file_name import NewActClass
_acts_ids["new_act_name"] = NewActClass

Base Class Mechanisms

One does not access directly none of those functions, but they are listed here in case the programmer implementing the act needs more informations.

Implemented Acts and Properties

  • Abono
    • Nome

    • Matricula

    • Cargo_efetivo

    • Classe

    • Padrao

    • Quadro

    • Fundamento_legal

    • Orgao

    • Processo_sei

    • Vigencia

    • Matricula_siape

    • Cargo

    • Lotacao

  • Aposentadoria
    • Ato

    • Processo

    • Nome_ato

    • Cod_matricula_ato

    • Cargo

    • Classe

    • Padrao

    • Quadro

    • Fund_legal

    • Empresa_ato

  • Exoneração Efetivos
    • Nome

    • Matricula

    • Cargo_efetivo

    • Classe

    • Padrao

    • Carreira

    • Quadro

    • Processo_sei

    • Vigencia

    • A_pedido_ou_nao

    • Motivo

    • Fundamento_legal

    • Orgao

    • Simbolo

    • Hierarquia_lotacao

    • Cargo_comissionado

  • Exoneração Comissionados
    • Nome

    • Matricula

    • Simbolo

    • Cargo_comissionado

    • Hierarquia_lotacao

    • Orgao

    • Vigencia

    • Carreir

    • Fundamento_legal

    • A_pedido_ou_nao

    • Cargo_efetivo

    • Matricula_siape

    • Motivo

  • Nomeação Efetivos
    • Edital_normativo

    • Data_edital_normativo

    • Numero_dodf_edital_normativo

    • Data_dodf_edital_normativo

    • Edital_resultado_final

    • Data_edital_resultado_final

    • Numero_dodf_resultado_final

    • Data_dodf_resultado_final

    • Cargo

    • Especialidade

    • Carreira

    • Orgao

    • Candidato

    • Classe

    • Quadro

    • Candidato_pne

    • Padrao

  • Nomeação Comissionados
    • Edital_normativo

    • Data_edital_normativo

    • Numero_dodf_edital_normativo

    • Data_dodf_edital_normativo

    • Edital_resultado_final

    • Data_edital_resultado_final

    • Numero_dodf_resultado_final

    • Data_dodf_resultado_final

    • Cargo

    • Especialidade

    • Carreira

    • Orgao

    • Candidato

    • Classe

    • Quadro

    • Candidato_pne

    • Padrao

  • Retificações de Aposentadoria
    • Tipo do Ato,

    • Tipo de Documento

    • Número do documento

    • Data do documento

    • Número do DODF

    • Data do DODF

    • Página do DODF

    • Nome do Servidor

    • Matrícula

    • Cargo

    • Classe

    • Padrao

    • Matricula SIAPE

    • Informação Errada

    • Informação Corrigida

  • Reversões
    • Processo_sei

    • Nome

    • Matricula

    • Cargo_efetivo

    • Classe

    • Padrao

    • Quadro

    • Fundamento_legal

    • Orgao

    • Vigencia

  • Substituições
    • Nome_substituto

    • Cargo_substituto

    • Matricula_substituto

    • Nome_substituido

    • Matricula_substituido

    • Simbolo_substitut

    • Cargo_objeto_substituicao

    • Simbolo_objeto_substituicao

    • Hierarquia_lotacao

    • Orgao

    • Data_inicial

    • Data_final

    • Matricula_siape

    • Motivo

  • Cessões
    • nome

    • matricula

    • cargo_efetivo

    • classe

    • padrao

    • orgao_cedente

    • orgao_cessionario

    • onus

    • fundamento legal

    • processo_SEI

    • vigencia

    • matricula_SIAPE

    • cargo_orgao_cessionario

    • simbolo

    • hierarquia_lotaca

  • Tornar sem efeito Aposentadoria
    • tipo_documento

    • numero_documento

    • data_documento

    • numero_dodf

    • data_dodf

    • pagina_dodf

    • nome

    • matricula

    • matricula_SIAPE

    • cargo_efetivo

    • classe

    • padrao

    • quadro

    • orgao

    • processo_SE

Regex Backend

Regex backend for act and propriety extraction.

This module contains the ActRegex class, which have all that is necessary to extract an act and, its proprieties, using regex rules.

class dodfminer.extract.polished.backend.regex.ActRegex[source]

Act Regex Class.

This class encapsulate all functions, and attributes related to the process of regex extraction.

Note

This class is one of the fathers of the Base act class.

_flags

All the regex flags which will be used in extraction.

_rules

The regex rules for proprieties extraction.

_inst_rule

The regex rule for act extraction.

_find_prop_value(rule, act)[source]

Find a single proprietie in an single act.

Parameters
  • rule (str) – The regex rule to search for.

  • act (str) – The act to apply the rule.

Returns

The found propriety, or a nan in case nothing is found.

_prop_rules()[source]

Rules for extraction of the proprieties.

Must return a dictionary of regex rules, where the key is the propriety type and the value is the rule.

Raises

NotImplementedError – Child class needs to overwrite this method

classmethod _regex_flags()[source]

Flag of the regex search

_regex_props(act_raw) dict[source]

Create an act dict with all its proprieties.

Parameters

act_raw (str) – The raw text of a single act.

Returns

The act, and its props in a dictionary format.

_rule_for_inst()[source]

Rule for extraction of the act

Warning

Must return a regex rule that finds an act in two parts, containing a head and a body. Where only the body will be used to search for proprieties.

Raises

NotImplementedError – Child class needs to overwrite this method.

NER Backend

Acknowledgements

We gratefully acknowledge the contributions of the many people who helped get this project off of the ground, including people who beta tested the software, gave feedback on the material, improved dependencies of DODFMiner code in service of this release, or otherwise supported the project. Given the number of people who were involved at various points, this list of names may not be exhaustive. (If you think you should have been listed here, please do not hesitate to reach out.)

In no particular order, thank you Khalil Carsten, Renato Nobre, Isaque Alves, Leonardo Maffei, João Zarbiélli, Felipe Almeida, Davi Alves, Fabrício Braz, Thiago Faleiros and Nilton Correia.

We are also grateful to the University of Brasília, TCDF and Finatec (Fundação de Empreendimentos Científicos e Tecnológicos for the partnership, and the FAPDF (Fundação de Apoio à Pesquisa do Distrito Federal) for the funding.

About the KneDLE Team

_images/knedle.svg

The project “KnEDLe - Knowledge Extraction from Documents of Legal content” is a partnership among FAPDF (Fundação de Apoio à Pesquisa do Distrito Federal), UnB (the University of Brasília) and Finatec (Fundação de Empreendimentos Científicos e Tecnológicos), sponsored by FAPDF. This project was proposed in order to employ official publications as a research object and to extract knowledge. The objective is to develop intelligent tools for extracting structured information from such publications, aiming to facilitate the search and retrieval of information, increasing government transparency and facilitating audit tasks and detecting problems related to the use of public resources.

Check our website

JSON Acts Extraction Tutorial

This tutorial is meant to help in the process of extracting acts from the section 3 of the DODF JSON files manually. These are the types of acts extracted from the section 3:

  • Contrato / Convênio

  • Aditamento

  • Licitação

  • Anulação / Revogação

  • Suspensão

Requirements: in this tutorial, it is assumed that you have already installed the DODFMiner requirements and that you have got DODF JSON files, in case you don’t, this is where you can find them.

The first step to do is importing the DODFMiner ActsExtractor class in order to extract the acts from a JSON file:

from dodfminer.extract.polished.core import ActsExtractor

Each of the 5 types of acts have their own class that manages the whole process of extraction from the JSON file.

There are two ways to do so: extracting all acts of a specific type or extracting all acts at once. The default model of extraction used is CRF, but you may use your own trained model.

Extracting a Specific Type of Act

The get_act_obj method will be used to extract one type of act from the JSON file.

ActsExtractor.get_act_obj(ato_id="ID", file="PATH/TO/JSON/FILE.json")
  • Parameteres:

    • ato_id (string) - Act ID restricted to the following keys:

      • aditamento

      • licitacao

      • suspensao

      • anulacao_revogacao

      • contrato_convenio

    • file (string): Path of the JSON file.

    • pipeline (object): Scikit-learn pipeline object (optional).

  • Returns:

    • An object of the desired act, already with extracted information.

Using a Specific Model with Scikit-learn Pipeline

If you’re not familiar with Scikit-learn Pipeline, you can learn more.

If you want to use a different model you can do so by passing a scikit-learn pipeline object as a parameter of the get_act_obj method. There are a few things you have to do:

  • Specify the pipeline parameter when calling the method. Ex: get_act_obj(pipeline=pipeline_obj).

  • Set an element in your pipeline with key pre-processing wich will be responsable for pre-processing and tokenization. This process has to be called by the method Pipeline['pre-processing'].transform(X) where X is a list with the input data.

  • The model that extends the BaseEstimator class must return its output in IOB tag format.

In case of not following these requirements, the generated dataframe will not be correct.

Here’s an example step-by-step:

# 1. Creating pipeline as required.

pipeline_obj = Pipeline([('pre-processing', Processing()), ('feature-extraction', FeatureExtractor()), ('model', Model())])

# 2. Pre-processing and tokenizing input data.

pipeline_obj['pre-processing'].transform(X)

# 3. Training model.

pipeline_obj.fit(X, y)

# 4. Calling method.

result = get_act_obj("aditamento", "PATH/TO/JSON/FILE.json", pipeline=pipeline_obj)

# 5. Accessing data extracted.

dataframe = result.df

Extracting All Acts

In order to extract all acts at once, you have to use the get_all_obj method.

ActsExtractor.get_all_obj(file="PATH/TO/JSON/FILE.json")
  • Parameters:

    • file (string) - Path to JSON file.

  • Returns:

    • Dictionary containing the class objects correspondent to each type of act found.

Returned object details

If you extract all acts at once, the returned object will be a dictionary with a key to each type of act. The value of each key is the respective act object.

You can access them by the following keys:

  • aditamento

  • licitacao

  • suspensao

  • anulacao_revogacao

  • contrato_convenio

In case you extract only one type of act, the respective act object will be returned. The act objects have a pandas dataframe attribute df containing all acts extracted and their entities.

Here’s an example of accessing the dataframe of contrato_convenio:

df = d['contrato_convenio'].df

Aditamento

These are the entities captured in aditamento acts:

  • numero_dodf

  • titulo

  • text

  • NUM_ADITIVO

  • CONTRATANTE

  • OBJ_ADITIVO

  • PROCESSO

  • NUM_AJUSTE

  • DATA_ESCRITO

Here’s an example of the acts within the dataframe:

numero_dodf titulo text NUM_ADITIVO CONTRATANTE OBJ_ADITIVO PROCESSO NUM_AJUSTE DATA_ESCRITO
233 I TERMO ADITIVO AO CONTRATO BRB 011/2022 I TERMO ADITIVO AO CONTRATO BRB 011/2022 Contr... I TERMO ADITIVO [BRB, BRB, BRB] prorrogação 12 meses até 19/01/2024 1.096/2021 06/2021 19/12/2022
233 EXTRATO DO 1º TERMO ADITIVO AO CONTRATO Nº 19/... EXTRATO DO 1º TERMO ADITIVO AO CONTRATO Nº 19/... 1º TERMO ADITIVO [SEEDF, SEEDF] a ) Alterar a razão social da Contratada , de ... 19/2022 19/2022 14/12/2022

Licitação

These are the entities captured in licitacao acts:

  • numero_dodf

  • titulo

  • text

  • MODALIDADE_LICITACAO

  • NUM_LICITACAO

  • ORGAO_LICITANTE

  • OBJ_LICITACAO

  • VALOR_ESTIMADO

  • SISTEMA_COMPRAS

  • PROCESSO

  • DATA_ABERTURA

  • CODIGO_SISTEMA_COMPRAS

numero_dodf titulo text MODALIDADE_LICITACAO NUM_LICITACAO ORGAO_LICITANTE OBJ_LICITACAO VALOR_ESTIMADO SISTEMA_COMPRAS PROCESSO DATA_ABERTURA CODIGO_SISTEMA_COMPRAS
233 AVISO DE ABERTURA DE LICITAÇÃO AVISO DE ABERTURA DE LICITAÇÃO PREGÃO ELETRÔNI... PREGÃO ELETRÔNICO 26/2022 Fundação Hemocentro de Brasília aquisição de Materiais Médico-Hospitalares par... 6.686,20 www.gov.br/compras 35/2022 22/12/2022 925008
233 AVISO DE LICITAÇÃO AVISO DE LICITAÇÃO PREGÃO ELETRÔNICO PE Nº 274.. PREGÃO ELETRÔNICO [274/2022, 00092-00055194.2022-84] Caesb Aquisição de materiais de ferro fundido para r... 3.835.600,90 [https:/www.gov.br/compras/pt-br, https:/... 21.205.100.020-2 04/01/2023 974200

Suspensão

These are the entities captured in suspensao acts:

  • numero_dodf

  • titulo

  • text

  • PROCESSO

  • DATA_ESCRITO

  • OBJ_ADITIVO

Here’s an example of the acts in a dataframe:

numero_dodf titulo text PROCESSO OBJ_ADITIVO CONTRATANTE NUM_AJUSTE NUM_ADITIVO DATA_ESCRITO
215 AVISO DE SUSPENSÃO AVISO DE SUSPENSÃO PREGÃO ELETRÔNICO POR SRP N... 00055-00045741/2020-54 a suspensão da licitação supracitada , a qual ... Secretaria de Estado de Saúde do Distrito Federal 14/2021 03/2021 11 de novembro de 2021
118 AVISO DE SUSPENSÃO DO PREGÃO ELETRÔNICO Nº 49/... AVISO DE SUSPENSÃO DO PREGÃO ELETRÔNICO Nº 49/... 00050-00002711/2021-75 a suspensão de realização do PE nº 049/2022 BRB 21/2022 58/2021 07/10/2021

Anulação e Revogação

These are the entities captured in anulacao_revogacao acts:

  • numero_dodf

  • titulo

  • text

  • IDENTIFICACAO_OCORRENCIA

  • MODALIDADE_LICITACAO

Here’s an example of the acts in a dataframe:

numero_dodf titulo text IDENTIFICACAO_OCORRENCIA MODALIDADE_LICITACAO
160 AVISO DE REVOGAÇÃO DE LICITAÇÃO AVISO DE REVOGAÇÃO DE LICITAÇÃO O Presidente d... REVOGAÇÃO Concorrência
38 AVISO DE REVOGAÇÃO DE LICITAÇÃO AVISO DE REVOGAÇÃO DE LICITAÇÃO A Caesb torna ... Caesb Licitação Fechada

Contrato/Convênio

These are the entities captured in contrato_convenio acts:

  • numero_dodf

  • titulo

  • text

  • PROCESSO

  • NUM_AJUSTE

  • CONTRATANTE_ou_CONCEDENTE

  • CONTRATADA_ou_CONVENENTE

  • CNPJ_CONTRATADA_ou_CONVENENTE

  • OBJ_AJUSTE

  • VALOR

  • CODIGO_UO

  • FONTE_RECURSO

  • NATUREZA_DESPESA

  • NOTA_EMPENHO

  • VIGENCIA

  • DATA_ASSINATURA

  • PROGRAMA_TRABALHO

  • NOME_RESPONSAVEL

  • CNPJ_CONTRATANTE_ou_CONCEDENTE

Here’s an example of the acts in a dataframe:

numero_dodf titulo text PROCESSO NUM_AJUSTE CONTRATANTE_ou_CONCEDENTE CONTRATADA_ou_CONVENENTE CNPJ_CONTRATADA_ou_CONVENENTE OBJ_AJUSTE VALOR CODIGO_UO FONTE_RECURSO NATUREZA_DESPESA NOTA_EMPENHO VIGENCIA DATA_ASSINATURA PROGRAMA_TRABALHO NOME_RESPONSAVEL CNPJ_CONTRATANTE_ou_CONCEDENTE
38 EXTRATO DE CONTRATO EXTRATO DE CONTRATO Contrato nº 9441. Assinatu... 00146-0000000457/2021-01 37/2021 CAESB L2A UNIAO LTDA 90.180.605/0001-02 Fornecimento de acesso à sistema informatizado... [23.722,14, 23.722,14] 22.202 11.101.000.000-3 4.4.90.51 2021NE00764 12 ( doze ) e 12 ( doze ) mês ( es ) , respect... 21/02/2022 28.845.0903.00NR.0053 Bruno Costa Nunes 23.791.169/0001-02
38 EXTRATO DE CONTRATO Nº 045723/2022 EXTRATO DE CONTRATO Nº 045723/2022 Processo: 0... 00366-00000136/2022-11 045723/2022 ADMINISTRAÇÃO REGIONAL DE VICENTE PIRES/RA-VP ... OURO GÁS 27.983.951/0001-84 Aquisição de gás liquefeito de petróleo , boti... 1.991,20 09133 100 339030 2022NE00016 O contrato terá vigência do contrato será a pa... 21/02/2022 04.122.6001.8517.0095 THIAGO H. M. DOS SANTOS 16.615.705/0001-53

Obtaining JSON Files

If you do not have any JSON file to extract data from, you can find them in this page. In your web browser, just right click on the page, click on “save as” and select json file.

The page is updated everyday with the DODF of the day. Unfortunatelly there’s not a database available of previous DODFs.