📘 VisTalk-X: Unified OCR + Reasoning + Audio Narration

VisTalk-X is a production-grade multimodal OCR system designed to:

Extract text from natural or scanned images
Correct outputs using a lightweight language model
Optionally narrate the content for blind and low-vision users

🔁 Combines computer vision, NLP, and accessibility in one robust pipeline.

📂 Folder Structure

VisTalk-X/
├── configs/                  # YAML configs for training
├── vistext/                 # All model code
│   ├── models/              # Vision + Language modules
│   ├── data/                # Dataset loading + preprocessing
│   ├── engine/              # Trainer, losses, metrics
│   └── export/              # ONNX / TensorRT export
├── scripts/                 # Shell scripts to run training/inference
├── dataset/                 # Downloaded datasets go here
├── examples/                # 📸 Example images + output JSON
├── requirements.txt
└── README.md

🖼️ Example

Input Image:

Predicted Text:

INVOICE NO: 78932-A
DATE: 12 June 2023
TOTAL: ₹4,899.00
Thank you for shopping with us.

Corrected Text (via LM):

Invoice No: 78932-A  
Date: 12 June 2023  
Total: ₹4,899.00  
Thank you for shopping with us.

Audio Narration Output:

✔️ Spoken using pyttsx3 or edge-tts

🔧 Setup Instructions

1. Clone and enter project

git clone https://github.com/avinash064/VisTalk-X.git
cd VisTalk-X

2. Create environment

conda create -n vistalkx python=3.10 -y
conda activate vistalkx

3. Install dependencies

pip install -r requirements.txt

4. Install PyTorch + MMCV

pip install torch==2.5.1+cu121 torchvision==0.17.1+cu121 --index-url https://download.pytorch.org/whl/cu121
pip install mmcv-full -f https://download.openmmlab.com/mmcv/dist/cu121/torch2.5/index.html

📥 Load & Prepare Datasets

Place datasets inside the dataset/ folder like so:

dataset/
├── totaltext/
│   ├── imgs/
│   └── annotations.json
├── synthtext/
│   ├── images/
│   └── gt.json

Preprocessing is handled via vistext/data/__init__.py.

🚀 Run Inference

python scripts/demo_tts.py \
  --image examples/sample_invoice.jpg \
  --tts yes

🏋️‍♂️ Training Commands

🔧 Pretrain SSL module (optional)

bash scripts/pretrain_ssl.sh

🔍 Train detection (DBNet++)

bash scripts/train_det.sh

🔁 Train full pipeline (detection + recognition + correction)

bash scripts/train_joint.sh

💬 Speech Narration Support

Enable text-to-speech with either:

🔊 Offline Voice:

pip install pyttsx3

🌐 Realistic Online Voice (Edge TTS):

pip install edge-tts

⚙️ Features

Feature	Status
Deformable OCR detection	✅
SVTR++ recognizer (ViT)	✅
T5/BART-based correction	✅
Self-supervised MAE pretraining	✅
Active learning module	✅
Text-to-speech for blind	✅
Export to ONNX/TensorRT	✅

📚 Requirements Summary

torch==2.5.1+cu121
torchvision==0.17.1+cu121
mmcv-full==2.1.0
transformers>=4.39.0
datasets>=2.18.0
albumentations>=1.3
pyttsx3  # or edge-tts

🧪 Evaluation (Example)

python scripts/eval_pipeline.py \
  --dataset totaltext \
  --checkpoint weights/joint_best.pt

🔭 Roadmap

Whisper-based voice input (voice → OCR)
Gradio-based Web App
Indic language support
Few-shot LM correction via LoRA

🧾 Citation

@misc{VisTalkX2025,
  author = {Avinash Kashyap},
  title = {VisTalk-X: Unified OCR pipeline with reasoning and accessibility},
  year = 2025,
  url = {https://github.com/avinash064/VisTalk-X}
}

🙌 Acknowledgements

Inspired by:

DBNet++, SVTR++, MAE
HuggingFace Transformers
PyTorch, MMCV, Albumentations

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data/funsd/__MACOSX		data/funsd/__MACOSX
vistext		vistext
.gitignore		.gitignore
README.md		README.md
data_config.yaml		data_config.yaml
downloader.py		downloader.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📘 VisTalk-X: Unified OCR + Reasoning + Audio Narration

📂 Folder Structure

🖼️ Example

🔧 Setup Instructions

1. Clone and enter project

2. Create environment

3. Install dependencies

4. Install PyTorch + MMCV

📥 Load & Prepare Datasets

🚀 Run Inference

🏋️‍♂️ Training Commands

🔧 Pretrain SSL module (optional)

🔍 Train detection (DBNet++)

🔁 Train full pipeline (detection + recognition + correction)

💬 Speech Narration Support

⚙️ Features

📚 Requirements Summary

🧪 Evaluation (Example)

🔭 Roadmap

🧾 Citation

🙌 Acknowledgements

About

Uh oh!

Releases

Packages

Languages

avinash064/VisTalk-X

Folders and files

Latest commit

History

Repository files navigation

📘 VisTalk-X: Unified OCR + Reasoning + Audio Narration

📂 Folder Structure

🖼️ Example

🔧 Setup Instructions

1. Clone and enter project

2. Create environment

3. Install dependencies

4. Install PyTorch + MMCV

📥 Load & Prepare Datasets

🚀 Run Inference

🏋️‍♂️ Training Commands

🔧 Pretrain SSL module (optional)

🔍 Train detection (DBNet++)

🔁 Train full pipeline (detection + recognition + correction)

💬 Speech Narration Support

⚙️ Features

📚 Requirements Summary

🧪 Evaluation (Example)

🔭 Roadmap

🧾 Citation

🙌 Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages