API Reference
doc2data.pdf
Reading PDF files and creating collections.
PDFCollection
Creates a collection from pdf files that are stored in a directory.
This class serves multiple purposes: First, it allows to parse pdf files from a directory to create a collection. Second, it provides overview information about the collection on page level. Third, it allows reading contents from individual pdfs through a dictionary with PDFFile objects.
The PDFCollection class does not store any files itself. Instead, it only keeps track of the files present in the target folder via a dictionary. Therefore, once a collection is created, files should not be removed from the source folder.
Examples:
>>> from doc2data.pdf import PDFCollection
>>> pdf_collection = PDFCollection('path_to_files') # create collection
>>> pdf_collection.parse_files() # populate collection
>>> print(pdf_collection.overview) # inspect collection
Attributes:
Name | Type | Description |
---|---|---|
path_to_files |
Path to source directory containing the pdf files. |
|
pdfs |
A dictionary containing a PDFFile object for each file. |
|
ignored_files |
A list of file names that could not be processed. |
|
overview |
A Pandas DataFrame containing summary information on page level. |
count_source_files
count_source_files()
Counts files in the source directory.
load
staticmethod
load(file_path)
Loads serialized collection.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path |
File path from where to load the collection |
required |
Returns:
Type | Description |
---|---|
A PDFCollection object. |
parse_files
parse_files(use_multiprocessing=True)
Populates the collection with PDFFile objects based on source directory.
Additionally, information on the content is extracted and recorded on page level.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
use_multiprocessing |
Use multiprocessing to speed up population using all available cores |
True
|
parse_single_file
parse_single_file(file_name)
Instantiates PDFFile object and parses pages.
save
save(file_path, overwrite=False)
Serializes collection to a file with pickle.
Path directories are created if they do not exist.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path |
File path where to save the collection. |
required | |
overwrite |
Set to True if an existing collection should be overwritten. |
False
|
PDFFile
Class representing an individual pdf file.
Attributes:
Name | Type | Description |
---|---|---|
path_to_pdf |
Path to pdf file. |
|
file_name |
File name. |
|
n_pages |
Number of pages in the pdf file. |
|
parsed_pages |
List of Page objects providing an interface to each page. |
|
loaded_successfully |
Boolan indicating if pdf file could be processed. |
open_fitz_document
open_fitz_document()
Opens the pdf file as fitz object.
parse_pages
parse_pages()
Iterates over pages and instantiates Page objects.
Raises:
Type | Description |
---|---|
AssertionError
|
If the file does not have a .pdf extension. |
RuntimeError
|
If pymupdf fails to open the file. |
Page
Class representing individual pages from a pdf file.
This class allows accessing the contents of a page using the pymupdf package. It also contains additional attributes that describe the page content. All attribute values are obtained using the pymupdf interface therefore relying on its accuracy.
Attributes:
Name | Type | Description |
---|---|---|
pt_height |
Height of the page in points (1 point = 1/72 inch). |
|
pt_width |
Width of the page in points (1 point = 1/72 inch). |
|
number |
Page number within the pdf file. |
|
content_type |
Type of information contained in the page. One of:
|
|
n_tokens |
Number of tokens identified by pymupdf. Roughly corresponds to words. |
|
n_images |
Number of images identified by pymupdf. |
|
pcnt_chars_corrupted |
Number of characters that could not be correctly decoded. |
|
path_to_pdf |
Path to pdf file containing the page. |
read_contents
read_contents(types, force_rgb=None, dpi=None)
Read page contents via the pymupdf interface.
This opens the pdf file with pymupdf and extracts the requested content. Multiple content types can be provided simultaneously.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
types |
String or list of strings indicating which contents should be returned. Possible content types are:
|
required | |
force_rgb |
Only if 'page_image' in types: Coverts to RGB and adds white backgound. |
None
|
|
dpi |
Only if 'page_image' in types: Resolution to use when converting pdf to image. |
None
|
Returns:
Type | Description |
---|---|
A dictionary with the requested contents. If a single type was requested, it is returned directly. |
Examples:
>>> from doc2data.pdf import PDFFile
>>> pdf_file = PDFFile('path_to_pdf')
>>> pdf_file.parse_pages()
>>> page = pdf_file.processed_pages[0]
>>> page.read_contents('page_image')
>>> page.read_contents(['tokens', 'images'])
Raises:
Type | Description |
---|---|
ValueError
|
If one or more requested types are not recognized. |
show_page_image
show_page_image(dpi=None, show_bboxes=False, bbox_color='red')
Wrapper for Page.read_contents to show page as image.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dpi |
Resolution to use when converting pdf to image. |
None
|
|
show_bboxes |
Boolean indicating whether to draw bounding boxes aroung the tokens and images. |
False
|
|
bbox_color |
Color of the bounding boxes. |
'red'
|
Returns:
Type | Description |
---|---|
A PIL image of the document page. |
doc2data.utils
Utilities.
convert_to_rgb
convert_to_rgb(image, add_white_background=True)
Converts a PIL Image from RGBA to RGB.
White background is added per default to prevent transparent pixels to be rendered black.
denormalize_bbox
denormalize_bbox(bounding_box, width, height)
Calculates absolute coordinates of a bounding box.
get_pcnt_chars_corrupted
get_pcnt_chars_corrupted(page)
Calculates proportion of replacement characters on page.
load_image
load_image(file_path, target_size=None, to_array=True, force_rgb=True)
Loads image from file.
normalize_bbox
normalize_bbox(bounding_box, width, height)
Calculates relative coordinates of a bounding box.