A comprehensive multi-modal AI application that integrates four specialized vision-language models for advanced image and video analysis. VisionScope-R2 offers OCR capabilities, spatial reasoning, handwriting recognition, and structural video captioning through an intuitive Gradio interface.
- Multi-Model Architecture: Four specialized models optimized for different vision tasks
- Image Analysis: Advanced OCR, handwriting recognition, and spatial reasoning
- Video Processing: Structural video captioning and scene analysis
- Real-time Streaming: Progressive response generation for immediate feedback
- Advanced Controls: Fine-tunable parameters for optimal performance
- Comprehensive Examples: Pre-loaded examples for quick testing
- Purpose: Structural video captioning with specialized sub-expert models
- Best for: High-quality video descriptions and scene understanding
- Model:
Skywork/SkyCaptioner-V1
- Architecture: Qwen2.5-VL based
- Purpose: Enhanced spatial reasoning and multimodal thinking
- Best for: Distance estimation, spatial relationships, and geometric analysis
- Model:
remyxai/SpaceThinker-Qwen2.5VL-3B
- Architecture: Qwen2.5-VL 3B parameters
- Purpose: Document-level optical character recognition
- Best for: Long-context document understanding and text extraction
- Model:
prithivMLmods/coreOCR-7B-050325-preview
- Architecture: Qwen2-VL 7B based
- Purpose: Specialized handwriting and mathematical content recognition
- Best for: Messy handwriting, mathematical equations with LaTeX formatting
- Model:
prithivMLmods/Imgscope-OCR-2B-0527
- Architecture: Qwen2-VL 2B Instruct
- Python 3.8+
- CUDA-compatible GPU (recommended)
- At least 16GB RAM
- 25GB+ free disk space for models
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers
pip install gradio
pip install spaces
pip install opencv-python
pip install pillow
pip install numpy
pip install requests
git clone https://github.com/PRITHIVSAKTHIUR/DocScope-R1.git
cd DocScope-R1
python app.py
The application will start and provide you with a local URL (typically http://127.0.0.1:7860
) to access the web interface.
- Select the "Image Inference" tab
- Enter your query in the text box
- Upload an image
- Choose your preferred model based on your task
- Adjust advanced parameters if needed
- Click "Submit"
Example Use Cases:
- Handwriting Recognition: "Type out the messy hand-writing as accurately as you can"
- Object Counting: "Count the number of birds and explain the scene in detail"
- Spatial Reasoning: "How far is the Goal from the penalty taker in this image?"
- Distance Estimation: "Approximately how many meters apart are the chair and bookshelf?"
- Complex Scene Analysis: "How far is the man in the red hat from the pallet of boxes in feet?"
- Select the "Video Inference" tab
- Enter your query describing what you want to analyze
- Upload a video file
- Select the appropriate model (SkyCaptioner-V1 recommended for videos)
- Configure generation parameters
- Click "Submit"
Example Use Cases:
- Movie Scene Analysis: "Give the highlights of the movie scene video"
- Advertisement Analysis: "Explain the advertisement in detail"
- Action Recognition: "Describe the sequence of events in the video"
Task Type | Recommended Model | Use Case |
---|---|---|
Handwritten Text | Imgscope-OCR-2B | Messy handwriting, math equations |
Document OCR | CoreOCR-7B | Clean text, documents, long context |
Spatial Analysis | SpaceThinker-3B | Distance, positioning, geometry |
Video Content | SkyCaptioner-V1 | Scene description, video analysis |
- Max New Tokens (1-2048): Maximum length of generated response
- Temperature (0.1-4.0): Controls creativity and randomness
- Top-p (0.05-1.0): Nucleus sampling for diverse outputs
- Top-k (1-1000): Vocabulary limitation per generation step
- Repetition Penalty (1.0-2.0): Prevents repetitive content
MAX_INPUT_TOKEN_LENGTH
: Maximum input context length (default: 4096)
Videos are automatically processed through the following steps:
- Frame extraction (10 evenly spaced frames)
- Timestamp annotation for each frame
- Sequential processing with context preservation
- Comprehensive scene understanding across temporal dimension
- Mixed Precision: All models use float16 for memory efficiency
- GPU Acceleration: CUDA optimization with automatic fallback to CPU
- Streaming Generation: Real-time text streaming for immediate feedback
- Memory Management: Efficient GPU memory utilization across multiple models
- Single model loading at startup to reduce initialization time
- Automatic device detection and optimal resource allocation
- Streaming responses for better user experience
- Smart buffer management to prevent token overflow
- GPU: 12GB VRAM (RTX 3060 Ti or equivalent)
- RAM: 16GB system memory
- Storage: 30GB free space (SSD recommended)
- CPU: Multi-core processor (Intel i5/AMD Ryzen 5 or better)
- GPU: 16GB+ VRAM (RTX 4080 or better)
- RAM: 32GB system memory
- Storage: SSD with 50GB free space
- CPU: High-performance processor (Intel i7/AMD Ryzen 7 or better)
VisionScope-R2/
├── app.py # Main application file
├── README.md # This documentation
├── requirements.txt # Python dependencies
├── images/ # Example images
│ ├── 1.jpg # Handwriting sample
│ ├── 2.jpeg # Bird counting example
│ ├── 3.png # Sports field analysis
│ ├── 4.png # Indoor scene measurement
│ └── 5.jpg # Distance estimation
└── videos/ # Example videos
├── 1.mp4 # Movie scene
└── 2.mp4 # Advertisement sample
- Simultaneous processing of text and visual information
- Context-aware responses based on image content
- Cross-modal reasoning capabilities
- Handwritten text recognition with high accuracy
- Mathematical equation parsing with LaTeX output
- Document structure understanding
- Multi-language text support
- Distance estimation between objects
- Geometric relationship analysis
- 3D scene understanding from 2D images
- Perspective-aware measurements
CUDA Out of Memory
- Reduce max_new_tokens to 512 or lower
- Use smaller models (Imgscope-OCR-2B, SpaceThinker-3B)
- Enable CPU inference mode
- Close other GPU-intensive applications
Model Loading Errors
- Verify internet connection for initial downloads
- Check Hugging Face Hub access
- Ensure sufficient disk space (30GB+)
- Clear Hugging Face cache if corrupted
Poor OCR Performance
- Use CoreOCR-7B for clean document text
- Use Imgscope-OCR-2B for handwritten content
- Ensure image resolution is adequate (minimum 300 DPI recommended)
- Check image quality and contrast
Video Processing Issues
- Supported formats: MP4, AVI, MOV, MKV
- Maximum recommended video length: 5 minutes
- Ensure video file is not corrupted
- Check available system memory during processing
- Model Selection: Choose the smallest suitable model for your task
- Image Preprocessing: Resize large images before upload
- Batch Processing: Process multiple similar images with the same model
- Memory Management: Restart application periodically for long sessions
The application can be extended with API endpoints for programmatic access:
# Example API call structure
response = generate_image(
model_name="SpaceThinker-3B",
text="How far apart are these objects?",
image=your_image,
max_new_tokens=512
)
We welcome contributions! Please follow these guidelines:
- Fork the repository
- Create a feature branch:
git checkout -b feature-name
- Make your changes with proper testing
- Update documentation as needed
- Submit a pull request with detailed description
- Follow PEP 8 style guidelines
- Add docstrings to new functions
- Include error handling for edge cases
- Test with multiple model configurations
- Additional Models: Integration of more specialized vision models
- Batch Processing: Support for multiple image/video processing
- API Endpoints: RESTful API for external integrations
- Model Quantization: Support for INT8 quantization for faster inference
- Cloud Integration: Support for cloud-based model hosting
This project is licensed under the MIT License. See the LICENSE file for details.
- Skywork Team: For the SkyCaptioner-V1 model
- RemyxAI: For the SpaceThinker spatial reasoning model
- Qwen Team: For the foundational architecture
- Hugging Face: For the transformers library and model hosting
- Gradio Team: For the user interface framework
For questions, issues, or collaborations:
- GitHub Issues: Open an issue for bug reports or feature requests
- Discussions: Use GitHub Discussions for general questions
- Email: Contact the maintainer through GitHub profile
Note: This application requires significant computational resources. Ensure your system meets the minimum requirements and has adequate cooling for extended usage.