📸AAI-521 Computer Vision Final Project - University of San Diego, School of Engineering - Masters of Applied Artificial Intelligence
In an age dominated by digital transformation, extracting actionable insights from unstructured visual data is a significant challenge across industries. Optical Character Recognition (OCR) serves as a powerful bridge between visual content and textual information, enabling applications like document digitization, automated data processing, and accessibility enhancements for individuals with disabilities.
Our project, developed for AAI-521: Computer Vision, explores the creation of a robust and versatile OCR pipeline. By leveraging state-of-the-art models—EasyOCR and TrOCR—we tackle diverse real-world challenges such as multilingual text, arbitrary orientations, and noisy image conditions. This pipeline provides a scalable and efficient solution, showcasing the potential of OCR technology to redefine data extraction. 🌟
Here’s what you’ll find in this repository:
-
📄 Final Report
- A detailed technical report outlining our methods, experiments, results, and conclusions.
- File: Technical Report
-
💻 Codebase
-
🎥 Presentation Video
- A concise 10-12 minute video summarizing our work, methods, and findings.
- YouTube Link
-
📂 Organized Files
/final
: Contains the polished deliverables, including the final report, presentation, and base code./processed
: Files used for preprocessing and experimental work./images
: Images contained in our code file "Model_EasyOCR-andTrOCR_Complete.ipynb"requirement.txt
: Library and Dependency packagesModel_EasyOCR-and-TrOCR_Complete.ipynb
: The primary notebook for running the OCR pipeline.
This project addresses real-world challenges in text recognition by:
- Implementing Advanced OCR Models: Comparing EasyOCR and TrOCR for multilingual and complex text recognition.
- Designing a Text Extraction Pipeline: Developing preprocessing steps to optimize input quality and enhance model accuracy
- Optimizing Performance: Fine-tuning models and evaluating their scalability for practical applications like document digitization and accessibility tools.
We utilized the TextOCR Dataset, available on Kaggle: 🌐TextOCR: Text Extraction from Images Dataset
-
Source: TextVQA images annotated for OCR tasks.
-
Key Features:
- 900,000+ word-level annotations.
- Annotated images with varied resolutions, orientations, and multilingual content.
- JSON labels for structured parsing.
-
Challenges Addressed:
- Diverse image qualities and text layouts.
- Multilingual and curved text.
- Noise from shadows, clutter, and distortions.
-
Clone the repository:
Run the following command in your terminal to clone the repository to your local machine:git clone https://github.com/oxayavongsa/aai-521-computer-vision-final.git
-
Install required dependencies:
pip install -r requirements.txt
-
Open the main notebook: Run locally or open directly in Google Colab via the provided badge.
-
Run the pipeline: Preprocess images, train models, and evaluate results directly within the notebook.
- Python: Programming language for data processing and modeling.
- EasyOCR: Lightweight OCR model for simple layouts.
- TrOCR: Transformer-based VisionEncoderDecoder model for complex and multilingual scenarios.
- Jiwer Library: Evaluation metrics for WER (Word Error Rate) and CER (Character Error Rate).
- Hugging Face Transformers: Framework for implementing TrOCR.
-
TrOCR demonstrated superior accuracy, achieving:
- WER: 1.00%
- CER: 0.99%
-
EasyOCR offered faster inference but struggled with complex layouts:
- WER: 3.10%
- CER: 3.87%
-
Advanced preprocessing techniques—grayscale conversion, binarization, and deskewing—significantly improved OCR performance.
-
Both models excel in different scenarios:
- EasyOCR is ideal for simple, structured text.
- TrOCR is better suited for noisy, multilingual, or distorted text.
Outhai Xayavongsa (Ms. Thai) (Team Leader)
- Created task lists, organized meets, and managed team coordination.
- Implemented and validated the TrOCR model.
- Integrated and Completed EasyOCR and TrOCR models.
- Evaluated model performance and fine-tuned metrics.
- Created pipelines and documented processes.
Jay Patel (Team Member)
- Selected dataset and assessed quality.
- Cleaned and preprocessed the dataset.
- Experimented with various Model Methods
Daniel Shifrin (Team Member)
- Performed EDA and extracted dataset features.
- Assisted with EasyOCR implementation and optimization.
✨ Future Work
To further enhance OCR performance:
- Address challenges with low-light and low-resolution images.
- Explore additional models like PaddleOCR or fine-tune existing ones.
- Implement real-time OCR pipelines for large-scale applications.
📫 Contact
For inquiries or collaboration opportunities, feel free to reach out to the Team Leader:
- Outhai Xayavongsa (Ms. Thai): LinkedIn Profile