This is an end-to-end DL(CNN) project that automatically classifies documents from images and extracts key information. The system can identify document types like Certificates, ID Cards, Invoices, and Resumes, and then pull relevant text fields using OCR and Regex.
The entire workflow is served via a Flask API, and every prediction is logged into a SQLite database for tracking.
- Multi-class Document Classification: A Deep Learning (CNN) model to classify images into 4 document categories.
- Optical Character Recognition (OCR): Uses Tesseract OCR to extract all text from the document image.
- Key Information Extraction: Employs Regex to parse and extract specific fields like names, dates, amounts, and ID numbers from the OCR text.
- REST API: A Flask-based API to serve the model and provide predictions on-the-fly.
- Prediction Logging: Automatically logs every prediction request and result into a SQLite database.
- Structured MLOps Pipelines: Separate, modular pipelines for model training and prediction.
Here is a sample response from the prediction API when an ID card image is sent:











- Backend & API: Python, Flask
- ML/DL Framework: TensorFlow, Keras, Scikit-learn
- Data Processing: Pandas, NumPy
- Image Processing: OpenCV
- OCR Engine: Pytesseract
- Database: SQLite
- Packaging: Setuptools
Follow these steps to set up and run the project locally.
-
Clone the repository:
# Replace with your repository URL git clone [https://github.com/your-username/DocumentClassification_extraction.git](https://github.com/your-username/DocumentClassification_extraction.git) cd DocumentClassification_extraction
-
Create and activate a virtual environment:
# It is recommended to use Python 3.10 or higher python -m venv venv .\venv\Scripts\activate
-
Install Tesseract OCR: This is a crucial external dependency.
- Download and install it from the official Tesseract documentation.
- After installation, you must update the Tesseract path in the
ocr/ocr_engine.py
file:# ocr/ocr_engine.py pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe' # <-- Update this path
-
Install the required Python packages: The
setup.py
file is configured to install all dependencies fromrequirements.txt
.pip install -r requirements.txt
The project has three main functionalities: training the model, running local predictions, and serving the API.
-
Train the Model: This script handles data loading, preprocessing, model training, and saves the final model artifacts (
classifier_model.h5
andlabel_encoder.pkl
) into themodel/
directory.python src/DocumentClassification_extraction/pipelines/training_pipeline.py
-
Run a Prediction Locally: This script uses the trained model to predict a single document's type and extracts its information, then logs the result to
logs.db
.python src/DocumentClassification_extraction/pipelines/prediction_pipeline.py
-
Run the Flask API: This command starts a local server to handle prediction requests via HTTP.
python api/main.py
The API will be available at
http://127.0.0.1:5000
.- Health Check:
GET /health
- Prediction:
POST /predict
(sends an image file)
- Health Check:
You can use a tool like Postman or cURL
to send a POST
request with an image to the /predict
endpoint.
# Replace the file path with the actual path to your image
curl -X POST -F "file=@C:\path\to\your\document\image.jpg" [http://127.0.0.1:5000/predict](http://127.0.0.1:5000/predict)
Sample JSON Response:
{
"document_type": "ID_Card",
"extracted_fields": {
"id_number": "ABC12345",
"name": "Suman Jaiswal"
},
"text": "Suman Jaiswal\nID: ABC12345\n..."
}
Every prediction made through the prediction_pipeline.py
is logged in the logs.db
SQLite database.
To view the logs:
- Open a terminal in the project's root directory.
- Run the following commands:
sqlite3 logs.db .tables SELECT * FROM predictions; .exit
Click to view the folder structure
DOCUMENTCLASSIFICATION&EXTRACTION/
β
βββ api/
β βββ main.py
βββ data/
β βββ certificates/
β βββ id_cards/
β βββ invoices/
β βββ resumes/
βββ model/
β βββ classifier_model.h5
β βββ field_extraction.py
β βββ label_encoder.pkl
βββ ocr/
β βββ ocr_engine.py
βββ src/
β βββ DocumentClassification_extraction/
β βββ components/
β βββ pipelines/
β β βββ training_pipeline.py
β β βββ prediction_pipeline.py
β βββ exception.py
β βββ logger.py
βββ utils/
β βββ preprocessing.py
β βββ logger.py
β
βββ requirements.txt
βββ setup.py
βββ README.md