Transforms complex documents like PDFs into LLM-ready markdown/JSON for your Agentic workflows.
- 
            Updated
            Oct 25, 2025 
- Python
Transforms complex documents like PDFs into LLM-ready markdown/JSON for your Agentic workflows.
The official repo for “Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting”, ACL, 2025.
A system for agentic LLM-powered data processing and ETL
Read and extract text and other content from PDFs in C# (port of PDFBox)
A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.
An on-premises, OCR-free unstructured data extraction, markdown conversion and benchmarking toolkit. (https://idp-leaderboard.org/)
A curated list of resources for Document Understanding (DU) topic
Open-source platform for extracting structured data from documents using AI.
This repository provides train&test code, dataset, det.&rec. annotation, evaluation script, annotation tool, and ranking.
Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic documents. (Parse document; Document content extraction; Logical structure extraction; PDF parser; Scanned document parser; DOCX parser; HTML parser
Code for the paper "PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks" (ICPR 2020)
AssemblyLine 4: File triage and malware analysis
Official PyTorch implementation of LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding (ACL 2022)
Pandora is an analysis framework to discover if a file is suspicious and conveniently show the results
A package for parsing PDFs and analyzing their content using LLMs.
RObust document image BINarization
YOLO models trained by DocLayNet - power your Document Intelligent by Layout Analysis
Powerful web application that combines Streamlit, LangChain, and Pinecone to simplify document analysis. Powered by OpenAI's GPT-3, RAG enables dynamic, interactive document conversations, making it ideal for efficient document retrieval and summarization.
Local adaptive image binarization
Document Visual Question Answering
Add a description, image, and links to the document-analysis topic page so that developers can more easily learn about it.
To associate your repository with the document-analysis topic, visit your repo's landing page and select "manage topics."