How-to Guides
doc2data is currently structured along two top-level modules and the subpackage experimental
.
The top-level modules contain functionality to create PDF collections and to access contents of individual files. The modules are:
Module | Description |
---|---|
Reading PDF files and creating collections | |
utilities | Utilities |
The subpackage doc2data.experimental
contains modules for feature creation and model training:
Module | Description |
---|---|
preprocessing | Feature extractors for images, tokens and embeddings |
ocr | Wrapper for the dotTR OCR package |
base_processors | Data pipelines for model training with TensorFlow & PyTorch |
task_processes | Task-specific processors |
trainers | Generic model training |
utils | Additional utilities |
Colab Notebooks
The following notebooks showcase typical applications of the above modules. The starting point is a number of PDF files with annotations. The output is a neural network for a specific document processing task.