GitHub - icvasu/OCR-RenAIssance: OCR model for Renaissance-era scanned text transcription using CNN-RNN/Transformers.

OCR-RenAIssance

An OCR pipeline project for extracting main text from Renaissance-era scanned documents while ignoring embellishments. This project leverages EasyOCR (based on a CRNN architecture) to convert PDF pages to images and extract text efficiently.

Introduction

The OCR-RenAIssance project focuses on developing an Optical Character Recognition system for Renaissance-era documents. The goal is to accurately extract the main textual content while filtering out any embellishments. This is crucial for historical document digitization and analysis.

Features

PDF to Image Conversion: Utilizes pdf2image to convert scanned documents into images.
Text Extraction: Implements EasyOCR for efficient text recognition.
Evaluation Metrics: Calculates Character Error Rate (CER) and Word Error Rate (WER) to assess OCR accuracy.
Clean Output: Outputs results in a well-structured text file with clear page delineations.

Installation

Clone the Repository:

git clone https://github.com/icvasu/OCR-RenAIssance.git
cd OCR-RenAIssance

2.Create a Virtual Environment

python -m venv venv
source venv/bin/activate   # On Windows: venv\Scripts\activate

3.Install Dependencies:

pip install -r requirements.txt

4.Install Poppler:

Windows Users: Download Poppler from Poppler for Windows, extract it (e.g., to C:\poppler), and add C:\poppler\Library\bin to your system PATH.

Mac/Linux Users: Install via Homebrew (brew install poppler) or your package manager. Usage 1.Prepare Your PDF:

Place your Renaissance PDF in the project folder.

Update the pdf_path variable in main.py with the correct path.

2.Run the OCR Script:

python main.py

3.Check the Output:

The OCR results will be saved in ocr_output.txt inside the repository.

Outputs are also organized in the outputs/ folder.

Output The ocr_output.txt file includes the extracted text from each page, clearly separated by page numbers.

Evaluation Metrics To assess the performance of the OCR model, the following metrics are used:

Character Error Rate (CER): Measures the accuracy at the character level.

Word Error Rate (WER): Measures the accuracy at the word level.

These metrics help in fine-tuning the OCR pipeline and ensuring reliable performance.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
ES-AHPHU - J-000312-0014 – 1579.pdf		ES-AHPHU - J-000312-0014 – 1579.pdf
README.md		README.md
RenAIssance_OCR_Evaluation.ipynb		RenAIssance_OCR_Evaluation.ipynb
RenAIssance_OCR_Evaluation.pdf.pdf		RenAIssance_OCR_Evaluation.pdf.pdf
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Table of Contents

Introduction

Features

Installation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

icvasu/OCR-RenAIssance

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Introduction

Features

Installation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages