Welcome to the LayoutLMv3 Fine-Tuning project! π This project focuses on extracting structured data from invoices and PDFs using LayoutLMv3, PaddleOCR, and Label Studio. The system extracts key fields like invoice number, date, vendor GSTIN, PAN, product description, rate, quantity, and amount.
This repository contains code to fine-tune LayoutLMv3 for extracting fields from invoice PDFs. It supports various formats, including scanned documents, by leveraging PaddleOCR for text extraction and Label Studio for annotation.
- Fine-Tunes LayoutLMv3 to extract structured data from PDFs.
- OCR Integration using PaddleOCR for text extraction from scanned PDFs.
- Annotation with Label Studio to create labeled datasets.
- Handles Multiple Formats of invoices with different label placements.
- Extracts Multiple Fields:
- π Invoice Number and Date
- π’ Vendor GSTIN and PAN
- π¦ Product Description
- π² Rate, Quantity, and Amount
- π HSN Codes
- π¬ Client Address
- Clone the repository:
git clone https://github.com/YashSonar/LayoutLMv3-Invoice-Extraction.git
- Install dependencies:
pip install -r requirements.txt
- Download pre-trained LayoutLMv3 weights:
mkdir input && cd input
wget <layoutlmv3-pretrained-model-url>
- Install PaddleOCR for text extraction:
pip install paddleocr
- Install Label Studio for annotation:
pip install label-studio
- Annotate PDFs with Label Studio to generate JSON datasets.
- Fine-tune LayoutLMv3 using the provided
main.py
. - Extract text from scanned PDFs using PaddleOCR.
- Run the model to extract fields from invoices:
python main.py
- The model is fine-tuned on annotated data created using Label Studio.
- Text from scanned documents is extracted using PaddleOCR.
Invoice No: 12345
Date: 25-12-2023
Vendor GSTIN: 29ABCDE1234F1Z5
PAN: ABCDE1234F
Description: Office Chair
HSN: 940330
Rate: 5000
Quantity: 10
Amount: 50000
Address: 123, Business Street, City
|-- src/
| |-- main.py
| |-- engine.py
| |-- loader.py
| |-- trainer.py
|-- input/
|-- README.md
- Add support for multilingual invoices.
- Enhance OCR accuracy using advanced PaddleOCR models.
- Expand extraction to receipts and other documents.
Feel free to open issues or pull requests. Let's make this project better together! π