Introduction to Padhana
The Padhana framework is designed to enable you to work with PDF documents (and other types) in a formal way. By combining a simple document format (based on a node hierarchy) and then a set of parsers and document analysis tools we parse and then structure/annotate document content to enable rich interactions.
Source code: https://github.com/hohonu/padhana
For local development:
Install Anaconda 3 or greater, then run:
conda env create -f conda.yml --force
Activate the padhana Conda environment with the command:
conda activate padhana
If you want to use the Tesseract Parser then you will need to install Tesseract. See https://github.com/tesseract-ocr/tesseract/wiki for more information.
Documents are used as the formal representations for source files or binary objects, and should contain features (such as text, layout, visual elements) that can then be used in Padhana to try and enable rich interaction with the document for re-structuring it, labelling, training, etc.
Connectors take a source and provide a list of file locations (implementing the iterator pattern).
Stores allow you to store your parsed documents in a format that allows for quick and easy access.