VisTalk-X is a production-grade multimodal OCR system designed to:
- Extract text from natural or scanned images
- Correct outputs using a lightweight language model
- Optionally narrate the content for blind and low-vision users
🔁 Combines computer vision, NLP, and accessibility in one robust pipeline.
VisTalk-X/
├── configs/ # YAML configs for training
├── vistext/ # All model code
│ ├── models/ # Vision + Language modules
│ ├── data/ # Dataset loading + preprocessing
│ ├── engine/ # Trainer, losses, metrics
│ └── export/ # ONNX / TensorRT export
├── scripts/ # Shell scripts to run training/inference
├── dataset/ # Downloaded datasets go here
├── examples/ # 📸 Example images + output JSON
├── requirements.txt
└── README.md
Input Image:
Predicted Text:
INVOICE NO: 78932-A
DATE: 12 June 2023
TOTAL: ₹4,899.00
Thank you for shopping with us.
Corrected Text (via LM):
Invoice No: 78932-A
Date: 12 June 2023
Total: ₹4,899.00
Thank you for shopping with us.
Audio Narration Output:
✔️ Spoken using pyttsx3 or edge-tts
git clone https://github.com/avinash064/VisTalk-X.git
cd VisTalk-X
conda create -n vistalkx python=3.10 -y
conda activate vistalkx
pip install -r requirements.txt
pip install torch==2.5.1+cu121 torchvision==0.17.1+cu121 --index-url https://download.pytorch.org/whl/cu121
pip install mmcv-full -f https://download.openmmlab.com/mmcv/dist/cu121/torch2.5/index.html
Place datasets inside the dataset/
folder like so:
dataset/
├── totaltext/
│ ├── imgs/
│ └── annotations.json
├── synthtext/
│ ├── images/
│ └── gt.json
Preprocessing is handled via vistext/data/__init__.py
.
python scripts/demo_tts.py \
--image examples/sample_invoice.jpg \
--tts yes
bash scripts/pretrain_ssl.sh
bash scripts/train_det.sh
bash scripts/train_joint.sh
Enable text-to-speech with either:
🔊 Offline Voice:
pip install pyttsx3
🌐 Realistic Online Voice (Edge TTS):
pip install edge-tts
Feature | Status |
---|---|
Deformable OCR detection | ✅ |
SVTR++ recognizer (ViT) | ✅ |
T5/BART-based correction | ✅ |
Self-supervised MAE pretraining | ✅ |
Active learning module | ✅ |
Text-to-speech for blind | ✅ |
Export to ONNX/TensorRT | ✅ |
torch==2.5.1+cu121
torchvision==0.17.1+cu121
mmcv-full==2.1.0
transformers>=4.39.0
datasets>=2.18.0
albumentations>=1.3
pyttsx3 # or edge-tts
python scripts/eval_pipeline.py \
--dataset totaltext \
--checkpoint weights/joint_best.pt
- Whisper-based voice input (voice → OCR)
- Gradio-based Web App
- Indic language support
- Few-shot LM correction via LoRA
@misc{VisTalkX2025,
author = {Avinash Kashyap},
title = {VisTalk-X: Unified OCR pipeline with reasoning and accessibility},
year = 2025,
url = {https://github.com/avinash064/VisTalk-X}
}
Inspired by:
- DBNet++, SVTR++, MAE
- HuggingFace Transformers
- PyTorch, MMCV, Albumentations