VisionScope-R2

A comprehensive multi-modal AI application that integrates four specialized vision-language models for advanced image and video analysis. VisionScope-R2 offers OCR capabilities, spatial reasoning, handwriting recognition, and structural video captioning through an intuitive Gradio interface.

Features

Multi-Model Architecture: Four specialized models optimized for different vision tasks
Image Analysis: Advanced OCR, handwriting recognition, and spatial reasoning
Video Processing: Structural video captioning and scene analysis
Real-time Streaming: Progressive response generation for immediate feedback
Advanced Controls: Fine-tunable parameters for optimal performance
Comprehensive Examples: Pre-loaded examples for quick testing

Supported Models

1. SkyCaptioner-V1 (Skywork)

Purpose: Structural video captioning with specialized sub-expert models
Best for: High-quality video descriptions and scene understanding
Model: Skywork/SkyCaptioner-V1
Architecture: Qwen2.5-VL based

2. SpaceThinker-3B (RemyxAI)

Purpose: Enhanced spatial reasoning and multimodal thinking
Best for: Distance estimation, spatial relationships, and geometric analysis
Model: remyxai/SpaceThinker-Qwen2.5VL-3B
Architecture: Qwen2.5-VL 3B parameters

3. CoreOCR-7B (Preview)

Purpose: Document-level optical character recognition
Best for: Long-context document understanding and text extraction
Model: prithivMLmods/coreOCR-7B-050325-preview
Architecture: Qwen2-VL 7B based

4. Imgscope-OCR-2B

Purpose: Specialized handwriting and mathematical content recognition
Best for: Messy handwriting, mathematical equations with LaTeX formatting
Model: prithivMLmods/Imgscope-OCR-2B-0527
Architecture: Qwen2-VL 2B Instruct

Installation

Prerequisites

Python 3.8+
CUDA-compatible GPU (recommended)
At least 16GB RAM
25GB+ free disk space for models

Dependencies

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers
pip install gradio
pip install spaces
pip install opencv-python
pip install pillow
pip install numpy
pip install requests

Clone Repository

git clone https://github.com/PRITHIVSAKTHIUR/DocScope-R1.git
cd DocScope-R1

Usage

Running the Application

python app.py

The application will start and provide you with a local URL (typically http://127.0.0.1:7860) to access the web interface.

Image Analysis

Select the "Image Inference" tab
Enter your query in the text box
Upload an image
Choose your preferred model based on your task
Adjust advanced parameters if needed
Click "Submit"

Example Use Cases:

Handwriting Recognition: "Type out the messy hand-writing as accurately as you can"
Object Counting: "Count the number of birds and explain the scene in detail"
Spatial Reasoning: "How far is the Goal from the penalty taker in this image?"
Distance Estimation: "Approximately how many meters apart are the chair and bookshelf?"
Complex Scene Analysis: "How far is the man in the red hat from the pallet of boxes in feet?"

Video Analysis

Select the "Video Inference" tab
Enter your query describing what you want to analyze
Upload a video file
Select the appropriate model (SkyCaptioner-V1 recommended for videos)
Configure generation parameters
Click "Submit"

Example Use Cases:

Movie Scene Analysis: "Give the highlights of the movie scene video"
Advertisement Analysis: "Explain the advertisement in detail"
Action Recognition: "Describe the sequence of events in the video"

Model Selection Guide

Task Type	Recommended Model	Use Case
Handwritten Text	Imgscope-OCR-2B	Messy handwriting, math equations
Document OCR	CoreOCR-7B	Clean text, documents, long context
Spatial Analysis	SpaceThinker-3B	Distance, positioning, geometry
Video Content	SkyCaptioner-V1	Scene description, video analysis

Configuration

Advanced Parameters

Max New Tokens (1-2048): Maximum length of generated response
Temperature (0.1-4.0): Controls creativity and randomness
Top-p (0.05-1.0): Nucleus sampling for diverse outputs
Top-k (1-1000): Vocabulary limitation per generation step
Repetition Penalty (1.0-2.0): Prevents repetitive content

Environment Variables

MAX_INPUT_TOKEN_LENGTH: Maximum input context length (default: 4096)

Technical Architecture

Video Processing Pipeline

Videos are automatically processed through the following steps:

Frame extraction (10 evenly spaced frames)
Timestamp annotation for each frame
Sequential processing with context preservation
Comprehensive scene understanding across temporal dimension

Model Architecture Details

Mixed Precision: All models use float16 for memory efficiency
GPU Acceleration: CUDA optimization with automatic fallback to CPU
Streaming Generation: Real-time text streaming for immediate feedback
Memory Management: Efficient GPU memory utilization across multiple models

Performance Optimizations

Single model loading at startup to reduce initialization time
Automatic device detection and optimal resource allocation
Streaming responses for better user experience
Smart buffer management to prevent token overflow

System Requirements

Minimum Requirements

GPU: 12GB VRAM (RTX 3060 Ti or equivalent)
RAM: 16GB system memory
Storage: 30GB free space (SSD recommended)
CPU: Multi-core processor (Intel i5/AMD Ryzen 5 or better)

Recommended Requirements

GPU: 16GB+ VRAM (RTX 4080 or better)
RAM: 32GB system memory
Storage: SSD with 50GB free space
CPU: High-performance processor (Intel i7/AMD Ryzen 7 or better)

File Structure

VisionScope-R2/
├── app.py              # Main application file
├── README.md           # This documentation
├── requirements.txt    # Python dependencies
├── images/            # Example images
│   ├── 1.jpg          # Handwriting sample
│   ├── 2.jpeg         # Bird counting example
│   ├── 3.png          # Sports field analysis
│   ├── 4.png          # Indoor scene measurement
│   └── 5.jpg          # Distance estimation
└── videos/            # Example videos
    ├── 1.mp4          # Movie scene
    └── 2.mp4          # Advertisement sample

Advanced Features

Multi-Modal Understanding

Simultaneous processing of text and visual information
Context-aware responses based on image content
Cross-modal reasoning capabilities

Specialized OCR Capabilities

Handwritten text recognition with high accuracy
Mathematical equation parsing with LaTeX output
Document structure understanding
Multi-language text support

Spatial Intelligence

Distance estimation between objects
Geometric relationship analysis
3D scene understanding from 2D images
Perspective-aware measurements

Troubleshooting

Common Issues

CUDA Out of Memory

Reduce max_new_tokens to 512 or lower
Use smaller models (Imgscope-OCR-2B, SpaceThinker-3B)
Enable CPU inference mode
Close other GPU-intensive applications

Model Loading Errors

Verify internet connection for initial downloads
Check Hugging Face Hub access
Ensure sufficient disk space (30GB+)
Clear Hugging Face cache if corrupted

Poor OCR Performance

Use CoreOCR-7B for clean document text
Use Imgscope-OCR-2B for handwritten content
Ensure image resolution is adequate (minimum 300 DPI recommended)
Check image quality and contrast

Video Processing Issues

Supported formats: MP4, AVI, MOV, MKV
Maximum recommended video length: 5 minutes
Ensure video file is not corrupted
Check available system memory during processing

Performance Optimization Tips

Model Selection: Choose the smallest suitable model for your task
Image Preprocessing: Resize large images before upload
Batch Processing: Process multiple similar images with the same model
Memory Management: Restart application periodically for long sessions

API Integration

The application can be extended with API endpoints for programmatic access:

# Example API call structure
response = generate_image(
    model_name="SpaceThinker-3B",
    text="How far apart are these objects?",
    image=your_image,
    max_new_tokens=512
)

Contributing

We welcome contributions! Please follow these guidelines:

Development Setup

Fork the repository
Create a feature branch: git checkout -b feature-name
Make your changes with proper testing
Update documentation as needed
Submit a pull request with detailed description

Code Standards

Follow PEP 8 style guidelines
Add docstrings to new functions
Include error handling for edge cases
Test with multiple model configurations

Future Enhancements

Additional Models: Integration of more specialized vision models
Batch Processing: Support for multiple image/video processing
API Endpoints: RESTful API for external integrations
Model Quantization: Support for INT8 quantization for faster inference
Cloud Integration: Support for cloud-based model hosting

License

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgments

Skywork Team: For the SkyCaptioner-V1 model
RemyxAI: For the SpaceThinker spatial reasoning model
Qwen Team: For the foundational architecture
Hugging Face: For the transformers library and model hosting
Gradio Team: For the user interface framework

Contact

For questions, issues, or collaborations:

GitHub Issues: Open an issue for bug reports or feature requests
Discussions: Use GitHub Discussions for general questions
Email: Contact the maintainer through GitHub profile

Note: This application requires significant computational resources. Ensure your system meets the minimum requirements and has adequate cooling for extended usage.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
images		images
videos		videos
LICENSE		LICENSE
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

License

PRITHIVSAKTHIUR/VisionScope-R2

Folders and files

Latest commit

History

Repository files navigation

VisionScope-R2

Features

Supported Models

1. SkyCaptioner-V1 (Skywork)

2. SpaceThinker-3B (RemyxAI)

3. CoreOCR-7B (Preview)

4. Imgscope-OCR-2B

Installation

Prerequisites

Dependencies

Clone Repository

Usage

Running the Application

Image Analysis

Video Analysis

Model Selection Guide

Configuration

Advanced Parameters

Environment Variables

Technical Architecture

Video Processing Pipeline

Model Architecture Details

Performance Optimizations

System Requirements

Minimum Requirements

Recommended Requirements

File Structure

Advanced Features

Multi-Modal Understanding

Specialized OCR Capabilities

Spatial Intelligence

Troubleshooting

Common Issues

Performance Optimization Tips

API Integration

Contributing

Development Setup

Code Standards

Future Enhancements

License

Acknowledgments

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages