This project extracts handwritten Kannada text from PDF images using PyMuPDF and Tesseract OCR. It processes images for better contrast using OpenCV, improving text recognition accuracy. The script also filters out unwanted patterns like URLs and digits, ensuring clean output. This tool is ideal for digitizing Kannada handwritten documents.
This repository contains a Python code for extracting handwritten Kannada text from images embedded in PDF files. The code uses PyMuPDF for extracting images and Tesseract OCR for recognizing the text. The text is pre-processed to enhance contrast for better OCR results.
- Extracts images from PDFs.
- Pre-processes images to enhance contrast and improve OCR accuracy.
- Extracts handwritten Kannada text using Tesseract OCR.
- Filters out unwanted patterns (emails, URLs, IP addresses) from the extracted text.
Before you begin, ensure you have the following installed:
- Python 3.x
- Tesseract-OCR: Download and install from here.
- Ensure to set the
pytesseract.pytesseract.tesseract_cmd
path correctly in the code.
- Ensure to set the
- Required Python libraries:
pytesseract
Pillow
PyMuPDF
(fitz)opencv-python
re
- The quality and of the text extrated depends on the scanned images.
- The accuracy of the extracted text increased for good and clear scanned images.
- Problem may arise if the images are not clear.
To install the required libraries, run the following command:
pip install pytesseract Pillow PyMuPDF opencv-python