Skip to content

Offline Intelligent Document Parser Extracts structured data from invoices, receipts, and scanned documents using OCR + AI models (FLAN-T5, LayoutLM). Works offline. Exports to JSON, Excel, and SQLite.

Notifications You must be signed in to change notification settings

foudhilriahi/AIDocumentExtractorPro

Repository files navigation

AI Document Extractor Pro

A modern, PyQt6-powered desktop app for extracting, labeling, and exporting structured data from documents (PDFs, images) using advanced AI models (T5, LayoutLM, etc). Includes a no-code trainer and dataset builder for your own fine-tuning!


✨ Features

  • Document Extraction:
    • Extracts key fields (invoice number, date, total, etc) from PDFs and images using LayoutLM and T5 models.
    • Supports both pre-trained and your own fine-tuned models (auto-discovered from models/).
  • Beautiful UI:
    • Modern dark theme, responsive layout, and real-time logs.
  • Export:
    • Save results as JSON, Excel, or SQLite with one click.
  • Trainer GUI:
    • Fine-tune T5 models on your own data (no code needed).
    • Build datasets from PDFs/images with a simple labeling dialog.
  • Post-processing:
    • Smart field correction and validation for cleaner results.

🚀 Quick Start

  1. Clone the repo:
    git clone https://github.com/foudhilriahi/AIDocumentExtractorPro
    cd YOUR-REPO
  2. Create a virtual environment:
    python -m venv myenv
    myenv\Scripts\activate  # On Windows
    # source myenv/bin/activate  # On Linux/Mac
  3. Install dependencies:
    pip install -r requirements.txt
  4. Run the app:
    python main.py
  5. (Optional) Train your own model:
    python trainer.py

🧠 How It Works

  • Extraction:
    • Select a document, choose a model, and click "Start Extraction".
    • See logs and results in the UI.
  • Trainer:
    • Build a dataset from your own PDFs/images (label each sample in the dialog).
    • Train a new model and it will appear in the main app automatically.

📁 Project Structure

main.py                # Main app UI
trainer.py             # Trainer and dataset builder
ai_extractor.py        # Extraction logic
export_manager.py      # Export logic
requirements.txt       # Dependencies
models/                # Your models (auto-discovered)
logs/                  # Logs

📦 Dataset Format

  • Each dataset is a folder containing a train.json and val.json (or just a single JSON for small sets).
  • Each JSON is a list of samples like:
{
  "input": "text or OCR text here",
  "output": "structured JSON string here"
}

You can adapt this to your real dataset as needed.


🏷️ Supported Fields (for extraction/training)

  • INVOICE_RECEIPT_ID
  • INVOICE_RECEIPT_DATE
  • DUE_DATE
  • TOTAL
  • TAX
  • SUBTOTAL
  • VENDOR_NAME
  • VENDOR_ADDRESS
  • SHIP_TO_NAME
  • SHIP_TO_ADDRESS
  • BILL_TO_NAME
  • BILL_TO_ADDRESS

✅ Capabilities Checklist

  • Display clean AI extraction results from flan-t5-base, flan-t5-small, and layoutlm-invoices
  • Save to .json, .xlsx, .db
  • Show key:value results in a rich readable text viewer
  • Handle file uploads and GPU/CPU logs smartly
  • Modern UI
  • No ugly logs, clean messages only
  • Future-ready for training, doc type detection, etc.

🏋️‍♂️ Training Workflow

  • Provides a real TrainingThread class
  • Trains LayoutLMv3 or FLAN-style models using Hugging Face’s transformers
  • Automatically processes FUNSD/SROIE-style annotated datasets
  • Emits progress and final result signals
  • Future-ready for GUI integration (e.g. phase 3 training window in app_gui.py)

To train:

  • Select your dataset folder (with FUNSD/SROIE-style JSON + image, or your own format)
  • Choose model: LayoutLMv3 vs custom T5 (bonus FLAN training later)
  • Training runs in a background thread and updates the GUI

🛠️ Requirements

  • Python 3.9+
  • PyQt6
  • transformers, datasets, evaluate, torch, easyocr, pytesseract, pdf2image, pillow

🤝 Contributing

Pull requests and issues are welcome! Please open an issue for bugs or feature requests.


📜 License

MIT


🙏 Credits

  • Hugging Face Transformers
  • PyQt6
  • LayoutLM, T5, and all open-source model authors

Made with ❤️ by [FOUDHIL_RIAHI]

About

Offline Intelligent Document Parser Extracts structured data from invoices, receipts, and scanned documents using OCR + AI models (FLAN-T5, LayoutLM). Works offline. Exports to JSON, Excel, and SQLite.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages