AI Document Extractor Pro

A modern, PyQt6-powered desktop app for extracting, labeling, and exporting structured data from documents (PDFs, images) using advanced AI models (T5, LayoutLM, etc). Includes a no-code trainer and dataset builder for your own fine-tuning!

✨ Features

Document Extraction:
- Extracts key fields (invoice number, date, total, etc) from PDFs and images using LayoutLM and T5 models.
- Supports both pre-trained and your own fine-tuned models (auto-discovered from models/).
Beautiful UI:
- Modern dark theme, responsive layout, and real-time logs.
Export:
- Save results as JSON, Excel, or SQLite with one click.
Trainer GUI:
- Fine-tune T5 models on your own data (no code needed).
- Build datasets from PDFs/images with a simple labeling dialog.
Post-processing:
- Smart field correction and validation for cleaner results.

🚀 Quick Start

Clone the repo:

git clone https://github.com/foudhilriahi/AIDocumentExtractorPro
cd YOUR-REPO

Create a virtual environment:

python -m venv myenv
myenv\Scripts\activate  # On Windows
# source myenv/bin/activate  # On Linux/Mac

Install dependencies:
```
pip install -r requirements.txt
```
Run the app:
```
python main.py
```
(Optional) Train your own model:
```
python trainer.py
```

🧠 How It Works

Extraction:
- Select a document, choose a model, and click "Start Extraction".
- See logs and results in the UI.
Trainer:
- Build a dataset from your own PDFs/images (label each sample in the dialog).
- Train a new model and it will appear in the main app automatically.

📁 Project Structure

main.py                # Main app UI
trainer.py             # Trainer and dataset builder
ai_extractor.py        # Extraction logic
export_manager.py      # Export logic
requirements.txt       # Dependencies
models/                # Your models (auto-discovered)
logs/                  # Logs

📦 Dataset Format

Each dataset is a folder containing a train.json and val.json (or just a single JSON for small sets).
Each JSON is a list of samples like:

{
  "input": "text or OCR text here",
  "output": "structured JSON string here"
}

You can adapt this to your real dataset as needed.

🏷️ Supported Fields (for extraction/training)

INVOICE_RECEIPT_ID
INVOICE_RECEIPT_DATE
DUE_DATE
TOTAL
TAX
SUBTOTAL
VENDOR_NAME
VENDOR_ADDRESS
SHIP_TO_NAME
SHIP_TO_ADDRESS
BILL_TO_NAME
BILL_TO_ADDRESS

✅ Capabilities Checklist

Display clean AI extraction results from flan-t5-base, flan-t5-small, and layoutlm-invoices
Save to .json, .xlsx, .db
Show key:value results in a rich readable text viewer
Handle file uploads and GPU/CPU logs smartly
Modern UI
No ugly logs, clean messages only
Future-ready for training, doc type detection, etc.

🏋️‍♂️ Training Workflow

Provides a real TrainingThread class
Trains LayoutLMv3 or FLAN-style models using Hugging Face’s transformers
Automatically processes FUNSD/SROIE-style annotated datasets
Emits progress and final result signals
Future-ready for GUI integration (e.g. phase 3 training window in app_gui.py)

To train:

Select your dataset folder (with FUNSD/SROIE-style JSON + image, or your own format)
Choose model: LayoutLMv3 vs custom T5 (bonus FLAN training later)
Training runs in a background thread and updates the GUI

🛠️ Requirements

Python 3.9+
PyQt6
transformers, datasets, evaluate, torch, easyocr, pytesseract, pdf2image, pillow

🤝 Contributing

Pull requests and issues are welcome! Please open an issue for bugs or feature requests.

📜 License

MIT

🙏 Credits

Hugging Face Transformers
PyQt6
LayoutLM, T5, and all open-source model authors

Made with ❤️ by [FOUDHIL_RIAHI]

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
training_data		training_data
.gitignore		.gitignore
README.md		README.md
ai_extractor.py		ai_extractor.py
dataset_builder.py		dataset_builder.py
export_manager.py		export_manager.py
how to train .txt		how to train .txt
main.py		main.py
output.json		output.json
requirements.txt		requirements.txt
theme_manager.py		theme_manager.py
trainer.py		trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AI Document Extractor Pro

✨ Features

🚀 Quick Start

🧠 How It Works

📁 Project Structure

📦 Dataset Format

🏷️ Supported Fields (for extraction/training)

✅ Capabilities Checklist

🏋️‍♂️ Training Workflow

🛠️ Requirements

🤝 Contributing

📜 License

🙏 Credits

About

Uh oh!

Releases

Packages

Languages

foudhilriahi/AIDocumentExtractorPro

Folders and files

Latest commit

History

Repository files navigation

AI Document Extractor Pro

✨ Features

🚀 Quick Start

🧠 How It Works

📁 Project Structure

📦 Dataset Format

🏷️ Supported Fields (for extraction/training)

✅ Capabilities Checklist

🏋️‍♂️ Training Workflow

🛠️ Requirements

🤝 Contributing

📜 License

🙏 Credits

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages