Skip to content

End-to-end Document Classification and Information Extraction system using Deep Learning (CNN), OCR (Tesseract), and a Flask API. Classifies documents like ID cards, invoices, and resumes, and extracts key fields.

Notifications You must be signed in to change notification settings

jsonusuman351/DocumentClassification-extraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

28 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“„ Document Classification & Information Extraction System

Python TensorFlow Scikit-learn Transformers Flask Pandas NumPy OpenCV Pillow spaCy SQLAlchemy EasyOCR Joblib

This is an end-to-end DL(CNN) project that automatically classifies documents from images and extracts key information. The system can identify document types like Certificates, ID Cards, Invoices, and Resumes, and then pull relevant text fields using OCR and Regex.

The entire workflow is served via a Flask API, and every prediction is logged into a SQLite database for tracking.


✨ Features

  • Multi-class Document Classification: A Deep Learning (CNN) model to classify images into 4 document categories.
  • Optical Character Recognition (OCR): Uses Tesseract OCR to extract all text from the document image.
  • Key Information Extraction: Employs Regex to parse and extract specific fields like names, dates, amounts, and ID numbers from the OCR text.
  • REST API: A Flask-based API to serve the model and provide predictions on-the-fly.
  • Prediction Logging: Automatically logs every prediction request and result into a SQLite database.
  • Structured MLOps Pipelines: Separate, modular pipelines for model training and prediction.

πŸ“Έ Demo / Screenshot

Here is a sample response from the prediction API when an ID card image is sent:

Image Image Image Image Image Image Image Image Image Image Image

πŸ› οΈ Tech Stack

  • Backend & API: Python, Flask
  • ML/DL Framework: TensorFlow, Keras, Scikit-learn
  • Data Processing: Pandas, NumPy
  • Image Processing: OpenCV
  • OCR Engine: Pytesseract
  • Database: SQLite
  • Packaging: Setuptools

βš™οΈ Setup and Installation

Follow these steps to set up and run the project locally.

  1. Clone the repository:

    # Replace with your repository URL
    git clone [https://github.com/your-username/DocumentClassification_extraction.git](https://github.com/your-username/DocumentClassification_extraction.git)
    cd DocumentClassification_extraction
  2. Create and activate a virtual environment:

    # It is recommended to use Python 3.10 or higher
    python -m venv venv
    .\venv\Scripts\activate
  3. Install Tesseract OCR: This is a crucial external dependency.

    • Download and install it from the official Tesseract documentation.
    • After installation, you must update the Tesseract path in the ocr/ocr_engine.py file:
      # ocr/ocr_engine.py
      pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe' # <-- Update this path
  4. Install the required Python packages: The setup.py file is configured to install all dependencies from requirements.txt.

    pip install -r requirements.txt

πŸš€ Usage

The project has three main functionalities: training the model, running local predictions, and serving the API.

  1. Train the Model: This script handles data loading, preprocessing, model training, and saves the final model artifacts (classifier_model.h5 and label_encoder.pkl) into the model/ directory.

    python src/DocumentClassification_extraction/pipelines/training_pipeline.py
  2. Run a Prediction Locally: This script uses the trained model to predict a single document's type and extracts its information, then logs the result to logs.db.

    python src/DocumentClassification_extraction/pipelines/prediction_pipeline.py
  3. Run the Flask API: This command starts a local server to handle prediction requests via HTTP.

    python api/main.py

    The API will be available at http://127.0.0.1:5000.

    • Health Check: GET /health
    • Prediction: POST /predict (sends an image file)

🌐 API Usage Example (cURL)

You can use a tool like Postman or cURL to send a POST request with an image to the /predict endpoint.

# Replace the file path with the actual path to your image
curl -X POST -F "file=@C:\path\to\your\document\image.jpg" [http://127.0.0.1:5000/predict](http://127.0.0.1:5000/predict)

Sample JSON Response:

{
  "document_type": "ID_Card",
  "extracted_fields": {
    "id_number": "ABC12345",
    "name": "Suman Jaiswal"
  },
  "text": "Suman Jaiswal\nID: ABC12345\n..."
}

πŸ—„οΈ Logging

Every prediction made through the prediction_pipeline.py is logged in the logs.db SQLite database.

To view the logs:

  1. Open a terminal in the project's root directory.
  2. Run the following commands:
    sqlite3 logs.db
    .tables
    SELECT * FROM predictions;
    .exit

πŸ“‚ Project Structure

Click to view the folder structure
DOCUMENTCLASSIFICATION&EXTRACTION/
β”‚
β”œβ”€β”€ api/
β”‚   └── main.py
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ certificates/
β”‚   β”œβ”€β”€ id_cards/
β”‚   β”œβ”€β”€ invoices/
β”‚   └── resumes/
β”œβ”€β”€ model/
β”‚   β”œβ”€β”€ classifier_model.h5
β”‚   β”œβ”€β”€ field_extraction.py
β”‚   └── label_encoder.pkl
β”œβ”€β”€ ocr/
β”‚   └── ocr_engine.py
β”œβ”€β”€ src/
β”‚   └── DocumentClassification_extraction/
β”‚       β”œβ”€β”€ components/
β”‚       β”œβ”€β”€ pipelines/
β”‚       β”‚   β”œβ”€β”€ training_pipeline.py
β”‚       β”‚   └── prediction_pipeline.py
β”‚       β”œβ”€β”€ exception.py
β”‚       └── logger.py
β”œβ”€β”€ utils/
β”‚   β”œβ”€β”€ preprocessing.py
β”‚   └── logger.py
β”‚
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ setup.py
└── README.md


About

End-to-end Document Classification and Information Extraction system using Deep Learning (CNN), OCR (Tesseract), and a Flask API. Classifies documents like ID cards, invoices, and resumes, and extracts key fields.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published