A modern, PyQt6-powered desktop app for extracting, labeling, and exporting structured data from documents (PDFs, images) using advanced AI models (T5, LayoutLM, etc). Includes a no-code trainer and dataset builder for your own fine-tuning!
- Document Extraction:
- Extracts key fields (invoice number, date, total, etc) from PDFs and images using LayoutLM and T5 models.
- Supports both pre-trained and your own fine-tuned models (auto-discovered from
models/
).
- Beautiful UI:
- Modern dark theme, responsive layout, and real-time logs.
- Export:
- Save results as JSON, Excel, or SQLite with one click.
- Trainer GUI:
- Fine-tune T5 models on your own data (no code needed).
- Build datasets from PDFs/images with a simple labeling dialog.
- Post-processing:
- Smart field correction and validation for cleaner results.
- Clone the repo:
git clone https://github.com/foudhilriahi/AIDocumentExtractorPro cd YOUR-REPO
- Create a virtual environment:
python -m venv myenv myenv\Scripts\activate # On Windows # source myenv/bin/activate # On Linux/Mac
- Install dependencies:
pip install -r requirements.txt
- Run the app:
python main.py
- (Optional) Train your own model:
python trainer.py
- Extraction:
- Select a document, choose a model, and click "Start Extraction".
- See logs and results in the UI.
- Trainer:
- Build a dataset from your own PDFs/images (label each sample in the dialog).
- Train a new model and it will appear in the main app automatically.
main.py # Main app UI
trainer.py # Trainer and dataset builder
ai_extractor.py # Extraction logic
export_manager.py # Export logic
requirements.txt # Dependencies
models/ # Your models (auto-discovered)
logs/ # Logs
- Each dataset is a folder containing a
train.json
andval.json
(or just a single JSON for small sets). - Each JSON is a list of samples like:
{
"input": "text or OCR text here",
"output": "structured JSON string here"
}
You can adapt this to your real dataset as needed.
- INVOICE_RECEIPT_ID
- INVOICE_RECEIPT_DATE
- DUE_DATE
- TOTAL
- TAX
- SUBTOTAL
- VENDOR_NAME
- VENDOR_ADDRESS
- SHIP_TO_NAME
- SHIP_TO_ADDRESS
- BILL_TO_NAME
- BILL_TO_ADDRESS
- Display clean AI extraction results from flan-t5-base, flan-t5-small, and layoutlm-invoices
- Save to .json, .xlsx, .db
- Show key:value results in a rich readable text viewer
- Handle file uploads and GPU/CPU logs smartly
- Modern UI
- No ugly logs, clean messages only
- Future-ready for training, doc type detection, etc.
- Provides a real TrainingThread class
- Trains LayoutLMv3 or FLAN-style models using Hugging Face’s transformers
- Automatically processes FUNSD/SROIE-style annotated datasets
- Emits progress and final result signals
- Future-ready for GUI integration (e.g. phase 3 training window in app_gui.py)
To train:
- Select your dataset folder (with FUNSD/SROIE-style JSON + image, or your own format)
- Choose model: LayoutLMv3 vs custom T5 (bonus FLAN training later)
- Training runs in a background thread and updates the GUI
- Python 3.9+
- PyQt6
- transformers, datasets, evaluate, torch, easyocr, pytesseract, pdf2image, pillow
Pull requests and issues are welcome! Please open an issue for bugs or feature requests.
MIT
- Hugging Face Transformers
- PyQt6
- LayoutLM, T5, and all open-source model authors
Made with ❤️ by [FOUDHIL_RIAHI]