🧾 LayoutLMv3 Fine-Tuning for Invoice Extraction

Welcome to the LayoutLMv3 Fine-Tuning project! 🚀 This project focuses on extracting structured data from invoices and PDFs using LayoutLMv3, PaddleOCR, and Label Studio. The system extracts key fields like invoice number, date, vendor GSTIN, PAN, product description, rate, quantity, and amount.

📄 Overview

This repository contains code to fine-tune LayoutLMv3 for extracting fields from invoice PDFs. It supports various formats, including scanned documents, by leveraging PaddleOCR for text extraction and Label Studio for annotation.

🔧 Key Features

Fine-Tunes LayoutLMv3 to extract structured data from PDFs.
OCR Integration using PaddleOCR for text extraction from scanned PDFs.
Annotation with Label Studio to create labeled datasets.
Handles Multiple Formats of invoices with different label placements.
Extracts Multiple Fields:
- 📅 Invoice Number and Date
- 🏢 Vendor GSTIN and PAN
- 📦 Product Description
- 💲 Rate, Quantity, and Amount
- 📜 HSN Codes
- 📬 Client Address

🛠️ Installation

Clone the repository:

 git clone https://github.com/YashSonar/LayoutLMv3-Invoice-Extraction.git

Install dependencies:

 pip install -r requirements.txt

Download pre-trained LayoutLMv3 weights:

 mkdir input && cd input
 wget <layoutlmv3-pretrained-model-url>

Install PaddleOCR for text extraction:

 pip install paddleocr

Install Label Studio for annotation:

 pip install label-studio

🚀 Usage

Annotate PDFs with Label Studio to generate JSON datasets.
Fine-tune LayoutLMv3 using the provided main.py.
Extract text from scanned PDFs using PaddleOCR.
Run the model to extract fields from invoices:

 python main.py

📈 Model Training

The model is fine-tuned on annotated data created using Label Studio.
Text from scanned documents is extracted using PaddleOCR.

📊 Output Example

Invoice No: 12345
Date: 25-12-2023
Vendor GSTIN: 29ABCDE1234F1Z5
PAN: ABCDE1234F
Description: Office Chair
HSN: 940330
Rate: 5000
Quantity: 10
Amount: 50000
Address: 123, Business Street, City

🧩 File Structure

|-- src/
|   |-- main.py
|   |-- engine.py
|   |-- loader.py
|   |-- trainer.py
|-- input/
|-- README.md

🎯 Future Improvements

Add support for multilingual invoices.
Enhance OCR accuracy using advanced PaddleOCR models.
Expand extraction to receipts and other documents.

🤝 Contributing

Feel free to open issues or pull requests. Let's make this project better together! 🎉

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Data		Data
output_images		output_images
output_json		output_json
src		src
Convert.py		Convert.py
Create_LMv3_dataset_with_paddleOCR.py		Create_LMv3_dataset_with_paddleOCR.py
Label_studio_to_layoutLMV3.py		Label_studio_to_layoutLMV3.py
README.md		README.md
Training_json.json		Training_json.json
Training_layoutLMV3.json		Training_layoutLMV3.json
project-2-at-2024-12-31-17-43-f0376d4e.json		project-2-at-2024-12-31-17-43-f0376d4e.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧾 LayoutLMv3 Fine-Tuning for Invoice Extraction

📄 Overview

🔧 Key Features

🛠️ Installation

🚀 Usage

📈 Model Training

📊 Output Example

🧩 File Structure

🎯 Future Improvements

🤝 Contributing

About

Uh oh!

Releases

Packages

Languages

Yashsonaar/LayoutLMv3-Fine-Tuning

Folders and files

Latest commit

History

Repository files navigation

🧾 LayoutLMv3 Fine-Tuning for Invoice Extraction

📄 Overview

🔧 Key Features

🛠️ Installation

🚀 Usage

📈 Model Training

📊 Output Example

🧩 File Structure

🎯 Future Improvements

🤝 Contributing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages