Skip to content

Welcome to the LayoutLMv3 Fine-Tuning project! πŸš€ This project focuses on extracting structured data from invoices and PDFs using LayoutLMv3, PaddleOCR, and Label Studio. The system extracts key fields like invoice number, date, vendor GSTIN, PAN, product description, rate, quantity, and amount.

Notifications You must be signed in to change notification settings

Yashsonaar/LayoutLMv3-Fine-Tuning

Repository files navigation

🧾 LayoutLMv3 Fine-Tuning for Invoice Extraction

Welcome to the LayoutLMv3 Fine-Tuning project! πŸš€ This project focuses on extracting structured data from invoices and PDFs using LayoutLMv3, PaddleOCR, and Label Studio. The system extracts key fields like invoice number, date, vendor GSTIN, PAN, product description, rate, quantity, and amount.

πŸ“„ Overview

This repository contains code to fine-tune LayoutLMv3 for extracting fields from invoice PDFs. It supports various formats, including scanned documents, by leveraging PaddleOCR for text extraction and Label Studio for annotation.

πŸ”§ Key Features

  • Fine-Tunes LayoutLMv3 to extract structured data from PDFs.
  • OCR Integration using PaddleOCR for text extraction from scanned PDFs.
  • Annotation with Label Studio to create labeled datasets.
  • Handles Multiple Formats of invoices with different label placements.
  • Extracts Multiple Fields:
    • πŸ“… Invoice Number and Date
    • 🏒 Vendor GSTIN and PAN
    • πŸ“¦ Product Description
    • πŸ’² Rate, Quantity, and Amount
    • πŸ“œ HSN Codes
    • πŸ“¬ Client Address

πŸ› οΈ Installation

  1. Clone the repository:
 git clone https://github.com/YashSonar/LayoutLMv3-Invoice-Extraction.git
  1. Install dependencies:
 pip install -r requirements.txt
  1. Download pre-trained LayoutLMv3 weights:
 mkdir input && cd input
 wget <layoutlmv3-pretrained-model-url>
  1. Install PaddleOCR for text extraction:
 pip install paddleocr
  1. Install Label Studio for annotation:
 pip install label-studio

πŸš€ Usage

  1. Annotate PDFs with Label Studio to generate JSON datasets.
  2. Fine-tune LayoutLMv3 using the provided main.py.
  3. Extract text from scanned PDFs using PaddleOCR.
  4. Run the model to extract fields from invoices:
 python main.py

πŸ“ˆ Model Training

  • The model is fine-tuned on annotated data created using Label Studio.
  • Text from scanned documents is extracted using PaddleOCR.

πŸ“Š Output Example

Invoice No: 12345
Date: 25-12-2023
Vendor GSTIN: 29ABCDE1234F1Z5
PAN: ABCDE1234F
Description: Office Chair
HSN: 940330
Rate: 5000
Quantity: 10
Amount: 50000
Address: 123, Business Street, City

🧩 File Structure

|-- src/
|   |-- main.py
|   |-- engine.py
|   |-- loader.py
|   |-- trainer.py
|-- input/
|-- README.md

🎯 Future Improvements

  • Add support for multilingual invoices.
  • Enhance OCR accuracy using advanced PaddleOCR models.
  • Expand extraction to receipts and other documents.

🀝 Contributing

Feel free to open issues or pull requests. Let's make this project better together! πŸŽ‰

About

Welcome to the LayoutLMv3 Fine-Tuning project! πŸš€ This project focuses on extracting structured data from invoices and PDFs using LayoutLMv3, PaddleOCR, and Label Studio. The system extracts key fields like invoice number, date, vendor GSTIN, PAN, product description, rate, quantity, and amount.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages