API Reference

doc2data.pdf

Reading PDF files and creating collections.

PDFCollection

Creates a collection from pdf files that are stored in a directory.

This class serves multiple purposes: First, it allows to parse pdf files from a directory to create a collection. Second, it provides overview information about the collection on page level. Third, it allows reading contents from individual pdfs through a dictionary with PDFFile objects.

The PDFCollection class does not store any files itself. Instead, it only keeps track of the files present in the target folder via a dictionary. Therefore, once a collection is created, files should not be removed from the source folder.

Examples:

>>> from doc2data.pdf import PDFCollection
>>> pdf_collection = PDFCollection('path_to_files') # create collection
>>> pdf_collection.parse_files() # populate collection
>>> print(pdf_collection.overview) # inspect collection

Attributes:

Name	Type	Description
`path_to_files`		Path to source directory containing the pdf files.
`pdfs`		A dictionary containing a PDFFile object for each file.
`ignored_files`		A list of file names that could not be processed.
`overview`		A Pandas DataFrame containing summary information on page level.

count_source_files

count_source_files()

Counts files in the source directory.

load `staticmethod`

load(file_path)

Loads serialized collection.

Parameters:

Name	Type	Description	Default
`file_path`		File path from where to load the collection	required

Returns:

Type	Description
	A PDFCollection object.

parse_files

parse_files(use_multiprocessing=True)

Populates the collection with PDFFile objects based on source directory.

Additionally, information on the content is extracted and recorded on page level.

Parameters:

Name	Type	Description	Default
`use_multiprocessing`		Use multiprocessing to speed up population using all available cores	`True`

parse_single_file

parse_single_file(file_name)

Instantiates PDFFile object and parses pages.

save

save(file_path, overwrite=False)

Serializes collection to a file with pickle.

Path directories are created if they do not exist.

Parameters:

Name	Type	Description	Default
`file_path`		File path where to save the collection.	required
`overwrite`		Set to True if an existing collection should be overwritten.	`False`

PDFFile

Class representing an individual pdf file.

Attributes:

Name	Type	Description
`path_to_pdf`		Path to pdf file.
`file_name`		File name.
`n_pages`		Number of pages in the pdf file.
`parsed_pages`		List of Page objects providing an interface to each page.
`loaded_successfully`		Boolan indicating if pdf file could be processed.

open_fitz_document

open_fitz_document()

Opens the pdf file as fitz object.

parse_pages

parse_pages()

Iterates over pages and instantiates Page objects.

Raises:

Type	Description
`AssertionError`	If the file does not have a .pdf extension.
`RuntimeError`	If pymupdf fails to open the file.

Page

Class representing individual pages from a pdf file.

This class allows accessing the contents of a page using the pymupdf package. It also contains additional attributes that describe the page content. All attribute values are obtained using the pymupdf interface therefore relying on its accuracy.

Attributes:

Name	Type	Description
`pt_height`		Height of the page in points (1 point = 1/72 inch).
`pt_width`		Width of the page in points (1 point = 1/72 inch).
`number`		Page number within the pdf file.
`content_type`		Type of information contained in the page. One of: text: Pure text. images: One or more images. mixed: A combination of text and images. recovered_image: If pymupdf does not detect any content, the page is rendered as an image.
`n_tokens`		Number of tokens identified by pymupdf. Roughly corresponds to words.
`n_images`		Number of images identified by pymupdf.
`pcnt_chars_corrupted`		Number of characters that could not be correctly decoded.
`path_to_pdf`		Path to pdf file containing the page.

read_contents

read_contents(types, force_rgb=None, dpi=None)

Read page contents via the pymupdf interface.

This opens the pdf file with pymupdf and extracts the requested content. Multiple content types can be provided simultaneously.

Parameters:

Name Type Description Default

types

String or list of strings indicating which contents should be returned. Possible content types are:

tokens: Tokens with bounding boxes. A token rougly corresponds to a word.
text: String containg all tokens in the reading order which is recovered by pymupdf.
images: Images with bounding boxes.
page_image: Entire page as one image.
raw_dict: Raw output from pymupdf.

required

force_rgb

Only if 'page_image' in types: Coverts to RGB and adds white backgound.

None

dpi

Only if 'page_image' in types: Resolution to use when converting pdf to image.

None

Returns:

Type	Description
	A dictionary with the requested contents. If a single type was requested, it is returned directly.

Examples:

>>> from doc2data.pdf import PDFFile
>>> pdf_file = PDFFile('path_to_pdf')
>>> pdf_file.parse_pages()
>>> page = pdf_file.processed_pages[0]
>>> page.read_contents('page_image')
>>> page.read_contents(['tokens', 'images'])

Raises:

Type	Description
`ValueError`	If one or more requested types are not recognized.

show_page_image

show_page_image(dpi=None, show_bboxes=False, bbox_color='red')

Wrapper for Page.read_contents to show page as image.

Parameters:

Name	Description	Default
`dpi`	Resolution to use when converting pdf to image.	`None`
`show_bboxes`	Boolean indicating whether to draw bounding boxes aroung the tokens and images.	`False`
`bbox_color`	Color of the bounding boxes.	`'red'`

Returns:

Type	Description
	A PIL image of the document page.

doc2data.utils

Utilities.

convert_to_rgb

convert_to_rgb(image, add_white_background=True)

Converts a PIL Image from RGBA to RGB.

White background is added per default to prevent transparent pixels to be rendered black.

denormalize_bbox

denormalize_bbox(bounding_box, width, height)

Calculates absolute coordinates of a bounding box.

get_pcnt_chars_corrupted

get_pcnt_chars_corrupted(page)

Calculates proportion of replacement characters on page.

load_image

load_image(file_path, target_size=None, to_array=True, force_rgb=True)

Loads image from file.

normalize_bbox

normalize_bbox(bounding_box, width, height)

Calculates relative coordinates of a bounding box.

API Reference

doc2data.pdf

PDFCollection

count_source_files

load staticmethod

parse_files

parse_single_file

save

PDFFile

open_fitz_document

parse_pages

Page

read_contents

show_page_image

doc2data.utils

convert_to_rgb

denormalize_bbox

get_pcnt_chars_corrupted

load_image

normalize_bbox

load `staticmethod`