Skip to content

avinash064/VisTalk-X

Repository files navigation

📘 VisTalk-X: Unified OCR + Reasoning + Audio Narration

VisTalk-X is a production-grade multimodal OCR system designed to:

  • Extract text from natural or scanned images
  • Correct outputs using a lightweight language model
  • Optionally narrate the content for blind and low-vision users

🔁 Combines computer vision, NLP, and accessibility in one robust pipeline.


📂 Folder Structure

VisTalk-X/
├── configs/                  # YAML configs for training
├── vistext/                 # All model code
│   ├── models/              # Vision + Language modules
│   ├── data/                # Dataset loading + preprocessing
│   ├── engine/              # Trainer, losses, metrics
│   └── export/              # ONNX / TensorRT export
├── scripts/                 # Shell scripts to run training/inference
├── dataset/                 # Downloaded datasets go here
├── examples/                # 📸 Example images + output JSON
├── requirements.txt
└── README.md

🖼️ Example

Input Image:

example

Predicted Text:

INVOICE NO: 78932-A
DATE: 12 June 2023
TOTAL: ₹4,899.00
Thank you for shopping with us.

Corrected Text (via LM):

Invoice No: 78932-A  
Date: 12 June 2023  
Total: ₹4,899.00  
Thank you for shopping with us.

Audio Narration Output:

✔️ Spoken using pyttsx3 or edge-tts

🔧 Setup Instructions

1. Clone and enter project

git clone https://github.com/avinash064/VisTalk-X.git
cd VisTalk-X

2. Create environment

conda create -n vistalkx python=3.10 -y
conda activate vistalkx

3. Install dependencies

pip install -r requirements.txt

4. Install PyTorch + MMCV

pip install torch==2.5.1+cu121 torchvision==0.17.1+cu121 --index-url https://download.pytorch.org/whl/cu121
pip install mmcv-full -f https://download.openmmlab.com/mmcv/dist/cu121/torch2.5/index.html

📥 Load & Prepare Datasets

Place datasets inside the dataset/ folder like so:

dataset/
├── totaltext/
│   ├── imgs/
│   └── annotations.json
├── synthtext/
│   ├── images/
│   └── gt.json

Preprocessing is handled via vistext/data/__init__.py.


🚀 Run Inference

python scripts/demo_tts.py \
  --image examples/sample_invoice.jpg \
  --tts yes

🏋️‍♂️ Training Commands

🔧 Pretrain SSL module (optional)

bash scripts/pretrain_ssl.sh

🔍 Train detection (DBNet++)

bash scripts/train_det.sh

🔁 Train full pipeline (detection + recognition + correction)

bash scripts/train_joint.sh

💬 Speech Narration Support

Enable text-to-speech with either:

🔊 Offline Voice:

pip install pyttsx3

🌐 Realistic Online Voice (Edge TTS):

pip install edge-tts

⚙️ Features

Feature Status
Deformable OCR detection
SVTR++ recognizer (ViT)
T5/BART-based correction
Self-supervised MAE pretraining
Active learning module
Text-to-speech for blind
Export to ONNX/TensorRT

📚 Requirements Summary

torch==2.5.1+cu121
torchvision==0.17.1+cu121
mmcv-full==2.1.0
transformers>=4.39.0
datasets>=2.18.0
albumentations>=1.3
pyttsx3  # or edge-tts

🧪 Evaluation (Example)

python scripts/eval_pipeline.py \
  --dataset totaltext \
  --checkpoint weights/joint_best.pt

🔭 Roadmap

  • Whisper-based voice input (voice → OCR)
  • Gradio-based Web App
  • Indic language support
  • Few-shot LM correction via LoRA

🧾 Citation

@misc{VisTalkX2025,
  author = {Avinash Kashyap},
  title = {VisTalk-X: Unified OCR pipeline with reasoning and accessibility},
  year = 2025,
  url = {https://github.com/avinash064/VisTalk-X}
}

🙌 Acknowledgements

Inspired by:

  • DBNet++, SVTR++, MAE
  • HuggingFace Transformers
  • PyTorch, MMCV, Albumentations

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages