Polished Helper

Polished extraction helper functions.

Functions in this files can be used inside, or outside, the ActsExtractor class. Their purpose is to make some tasks easier for the user, like creating txts, searching through files, and print dataframes.

Usage Example:

from dodfminer.extract.polished import helper
helper.print_dataframe(df)

Functions

dodfminer.extract.polished.helper.build_act_txt(acts, name, save_path='./results/')[source]

Create a text file in disc for a act type.

Note

This function might save data to disc in text format.

Parameters
  • acts ([str]) – List of all acts to save in the text file.

  • name (str) – Name of the output file.

  • save_path (str) – Path to save the text file.

dodfminer.extract.polished.helper.committee_classification(all_acts, path, types, backend)[source]

Uses committee classification to find act types.

Parameters
  • all_acts (DataFrame) – Dataframe with acts text and regex type.

  • path (str) – Folder where the Dodfs are.

  • types ([str]) – Types of the act, see the core class to view avaiables types.

  • backend (str) – what backend will be used to extract Acts {regex, ner}

Returns

None

dodfminer.extract.polished.helper.extract_multiple(files, act_type, backend, txt_out=False, txt_path='./results')[source]

Extract Act from Multiple DODF to a single DataFrame.

Note

This function might save data to disc in text format, if txt_out is True.

Parameters
  • files ([str]) – List of dodfs files path.

  • act_type (str) – Type of the act, see the core class to view avaiables types.

  • backend (str) – what backend will be used to extract Acts {regex, ner}

  • txt_out (bool) – Boolean indicating if acts should be saved on text files.

  • txt_path (str) – Path to save the text files.

Returns

A dataframe containing all instances of the desired act in the files set.

dodfminer.extract.polished.helper.extract_multiple_acts(path, types, backend)[source]

Extract multple Acts from Multiple DODFs to act named CSVs.

Parameters
  • path (str) – Folder where the Dodfs are.

  • types ([str]) – Types of the act, see the core class to view avaiables types.

  • backend (str) – what backend will be used to extract Acts {regex, ner}

Returns

None

dodfminer.extract.polished.helper.extract_multiple_acts_parallel(path: str, types: List[str], backend: str, processes=4)[source]

Extract multple Acts from Multiple DODFs to act named CSVs in parallel.

Parameters
  • path (str) – Folder where the Dodfs are.

  • types ([str]) – Types of the act, see the core class to view avaiables types.

  • backend (str) – what backend will be used to extract Acts {regex, ner}

Returns

None

dodfminer.extract.polished.helper.extract_multiple_acts_with_committee(path, types, backend)[source]

Extract multple Acts from Multiple DODFs to act named CSVs. Uses committee_classification to find act types.

Parameters
  • path (str) – Folder where the Dodfs are.

  • types ([str]) – Types of the act, see the core class to view avaiables types.

  • backend (str) – what backend will be used to extract Acts {regex, ner}

Returns

None

dodfminer.extract.polished.helper.extract_single(file, act_type, backend)[source]

Extract Act from a single DODF to a single DataFrame.

Note

This function might save data to disc in text format, if txt_out is True.

Parameters
  • files (str) – Dodf file path.

  • type (str) – Type of the act, see the core class to view avaiables types.

  • backend (str) – what backend will be used to extract Acts {regex, ner}

Returns

a dataframe containing all instances of the desired act including the texts found, and a list of the segmented text blocks, and .

Return type

A tuple containing, respectively

dodfminer.extract.polished.helper.get_files_path(path, file_type)[source]

Get all files path inside a folder.

Works with nested folders.

Parameters

path – Folder to look into for files

Returns:A dataframe containing all instances of the desired

act in the files set. A list of strings with the file path.

dodfminer.extract.polished.helper.print_dataframe(data_frame)[source]

Style a Dataframe.

Parameters

styled. (The dataframe to be) –

Returns

The styled dataframe

dodfminer.extract.polished.helper.run_extract_simple_wrap(file: str, act_type: str, backend: str) Tuple[str, pandas.DataFrame][source]

Run one extractions

dodfminer.extract.polished.helper.run_thread_wrap(files: list, act_type: str, backend: str, all_acts: Queue) None[source]

Run multiple extractions

dodfminer.extract.polished.helper.run_thread_wrap_multiple(files: list, act_type: str, backend: str) Tuple[str, pandas.DataFrame][source]

Run multiple extractions

dodfminer.extract.polished.helper.xml_multiple(path, backend)[source]
dodfminer.extract.polished.helper.extract_multiple_acts(path, types, backend)[source]

Extract multple Acts from Multiple DODFs to act named CSVs.

Parameters
  • path (str) – Folder where the Dodfs are.

  • types ([str]) – Types of the act, see the core class to view avaiables types.

  • backend (str) – what backend will be used to extract Acts {regex, ner}

Returns

None

dodfminer.extract.polished.helper.extract_multiple(files, act_type, backend, txt_out=False, txt_path='./results')[source]

Extract Act from Multiple DODF to a single DataFrame.

Note

This function might save data to disc in text format, if txt_out is True.

Parameters
  • files ([str]) – List of dodfs files path.

  • act_type (str) – Type of the act, see the core class to view avaiables types.

  • backend (str) – what backend will be used to extract Acts {regex, ner}

  • txt_out (bool) – Boolean indicating if acts should be saved on text files.

  • txt_path (str) – Path to save the text files.

Returns

A dataframe containing all instances of the desired act in the files set.

dodfminer.extract.polished.helper.extract_single(file, act_type, backend)[source]

Extract Act from a single DODF to a single DataFrame.

Note

This function might save data to disc in text format, if txt_out is True.

Parameters
  • files (str) – Dodf file path.

  • type (str) – Type of the act, see the core class to view avaiables types.

  • backend (str) – what backend will be used to extract Acts {regex, ner}

Returns

a dataframe containing all instances of the desired act including the texts found, and a list of the segmented text blocks, and .

Return type

A tuple containing, respectively

dodfminer.extract.polished.helper.build_act_txt(acts, name, save_path='./results/')[source]

Create a text file in disc for a act type.

Note

This function might save data to disc in text format.

Parameters
  • acts ([str]) – List of all acts to save in the text file.

  • name (str) – Name of the output file.

  • save_path (str) – Path to save the text file.

dodfminer.extract.polished.helper.print_dataframe(data_frame)[source]

Style a Dataframe.

Parameters

styled. (The dataframe to be) –

Returns

The styled dataframe

dodfminer.extract.polished.helper.get_files_path(path, file_type)[source]

Get all files path inside a folder.

Works with nested folders.

Parameters

path – Folder to look into for files

Returns:A dataframe containing all instances of the desired

act in the files set. A list of strings with the file path.