e.g. inspired by - Tika - [Resources for Advanced Document Processing](https://cwiki.apache.org/confluence/display/TIKA/Resources+for+Advanced+Document+Processing) - [Docling](https://pypi.org/project/docling/) - [GROBID](https://grobid.readthedocs.io/) - [Working with batches of PDF files](https://programminghistorian.org/en/lessons/working-with-batches-of-pdf-files) - [Classifying all of the pdfs on the internet](https://snats.xyz/pages/articles/classifying_a_bunch_of_pdfs.html) (for a given value of 'all' ;-) )