API Reference

doc2data.pdf

Reading PDF files and creating collections.

PDFCollection

Creates a collection from pdf files that are stored in a directory.

This class serves multiple purposes: First, it allows to parse pdf files from a directory to create a collection. Second, it provides overview information about the collection on page level. Third, it allows reading contents from individual pdfs through a dictionary with PDFFile objects.

The PDFCollection class does not store any files itself. Instead, it only keeps track of the files present in the target folder via a dictionary. Therefore, once a collection is created, files should not be removed from the source folder.

Examples:

>>> from doc2data.pdf import PDFCollection
>>> pdf_collection = PDFCollection('path_to_files') # create collection
>>> pdf_collection.parse_files() # populate collection
>>> print(pdf_collection.overview) # inspect collection

Attributes:

Name Type Description
path_to_files

Path to source directory containing the pdf files.

pdfs

A dictionary containing a PDFFile object for each file.

ignored_files

A list of file names that could not be processed.

overview

A Pandas DataFrame containing summary information on page level.

count_source_files

count_source_files()

Counts files in the source directory.

load staticmethod

load(file_path)

Loads serialized collection.

Parameters:

Name Type Description Default
file_path

File path from where to load the collection

required

Returns:

Type Description

A PDFCollection object.

parse_files

parse_files(use_multiprocessing=True)

Populates the collection with PDFFile objects based on source directory.

Additionally, information on the content is extracted and recorded on page level.

Parameters:

Name Type Description Default
use_multiprocessing

Use multiprocessing to speed up population using all available cores

True

parse_single_file

parse_single_file(file_name)

Instantiates PDFFile object and parses pages.

save

save(file_path, overwrite=False)

Serializes collection to a file with pickle.

Path directories are created if they do not exist.

Parameters:

Name Type Description Default
file_path

File path where to save the collection.

required
overwrite

Set to True if an existing collection should be overwritten.

False

PDFFile

Class representing an individual pdf file.

Attributes:

Name Type Description
path_to_pdf

Path to pdf file.

file_name

File name.

n_pages

Number of pages in the pdf file.

parsed_pages

List of Page objects providing an interface to each page.

loaded_successfully

Boolan indicating if pdf file could be processed.

open_fitz_document

open_fitz_document()

Opens the pdf file as fitz object.

parse_pages

parse_pages()

Iterates over pages and instantiates Page objects.

Raises:

Type Description
AssertionError

If the file does not have a .pdf extension.

RuntimeError

If pymupdf fails to open the file.

Page

Class representing individual pages from a pdf file.

This class allows accessing the contents of a page using the pymupdf package. It also contains additional attributes that describe the page content. All attribute values are obtained using the pymupdf interface therefore relying on its accuracy.

Attributes:

Name Type Description
pt_height

Height of the page in points (1 point = 1/72 inch).

pt_width

Width of the page in points (1 point = 1/72 inch).

number

Page number within the pdf file.

content_type

Type of information contained in the page. One of:

  • text: Pure text.
  • images: One or more images.
  • mixed: A combination of text and images.
  • recovered_image: If pymupdf does not detect any content, the page is rendered as an image.
n_tokens

Number of tokens identified by pymupdf. Roughly corresponds to words.

n_images

Number of images identified by pymupdf.

pcnt_chars_corrupted

Number of characters that could not be correctly decoded.

path_to_pdf

Path to pdf file containing the page.

read_contents

read_contents(types, force_rgb=None, dpi=None)

Read page contents via the pymupdf interface.

This opens the pdf file with pymupdf and extracts the requested content. Multiple content types can be provided simultaneously.

Parameters:

Name Type Description Default
types

String or list of strings indicating which contents should be returned. Possible content types are:

  • tokens: Tokens with bounding boxes. A token rougly corresponds to a word.
  • text: String containg all tokens in the reading order which is recovered by pymupdf.
  • images: Images with bounding boxes.
  • page_image: Entire page as one image.
  • raw_dict: Raw output from pymupdf.
required
force_rgb

Only if 'page_image' in types: Coverts to RGB and adds white backgound.

None
dpi

Only if 'page_image' in types: Resolution to use when converting pdf to image.

None

Returns:

Type Description

A dictionary with the requested contents. If a single type was requested, it is returned directly.

Examples:

>>> from doc2data.pdf import PDFFile
>>> pdf_file = PDFFile('path_to_pdf')
>>> pdf_file.parse_pages()
>>> page = pdf_file.processed_pages[0]
>>> page.read_contents('page_image')
>>> page.read_contents(['tokens', 'images'])

Raises:

Type Description
ValueError

If one or more requested types are not recognized.

show_page_image

show_page_image(dpi=None, show_bboxes=False, bbox_color='red')

Wrapper for Page.read_contents to show page as image.

Parameters:

Name Type Description Default
dpi

Resolution to use when converting pdf to image.

None
show_bboxes

Boolean indicating whether to draw bounding boxes aroung the tokens and images.

False
bbox_color

Color of the bounding boxes.

'red'

Returns:

Type Description

A PIL image of the document page.

doc2data.utils

Utilities.

convert_to_rgb

convert_to_rgb(image, add_white_background=True)

Converts a PIL Image from RGBA to RGB.

White background is added per default to prevent transparent pixels to be rendered black.

denormalize_bbox

denormalize_bbox(bounding_box, width, height)

Calculates absolute coordinates of a bounding box.

get_pcnt_chars_corrupted

get_pcnt_chars_corrupted(page)

Calculates proportion of replacement characters on page.

load_image

load_image(file_path, target_size=None, to_array=True, force_rgb=True)

Loads image from file.

normalize_bbox

normalize_bbox(bounding_box, width, height)

Calculates relative coordinates of a bounding box.