How-to Guides

doc2data is currently structured along two top-level modules and the subpackage experimental .

The top-level modules contain functionality to create PDF collections and to access contents of individual files. The modules are:

Module Description
pdf Reading PDF files and creating collections
utilities Utilities

The subpackage doc2data.experimental contains modules for feature creation and model training:

Module Description
preprocessing Feature extractors for images, tokens and embeddings
ocr Wrapper for the dotTR OCR package
base_processors Data pipelines for model training with TensorFlow & PyTorch
task_processes Task-specific processors
trainers Generic model training
utils Additional utilities

Colab Notebooks

The following notebooks showcase typical applications of the above modules. The starting point is a number of PDF files with annotations. The output is a neural network for a specific document processing task.

Colab Notebook Description
Open In Colab Document classification Train a custom Keras model to classify pages as images on a subset of the RVL-CDIP dataset.
Open In Colab Token classification Fine-tune the LayoutXLM model in PyTorch to classify each token on a document page.