Cosmos-x-DocScope

Cosmos-x-DocScope is a Gradio-based application that integrates and provides an interface for interacting with three distinct vision-language models: Cosmos-Reason1-7B, docscopeOCR-7B-050425-exp, and Captioner-Relaxed-7B. This application allows users to perform both image and video inference, leveraging the unique capabilities of each model for tasks such as visual reasoning, optical character recognition, and detailed image captioning.

Features

Image Inference: Upload an image and provide a text query to get a response from the selected model.
Video Inference: Upload a video and provide a text query to get a summarized understanding of the video content. The video is downsampled into key frames for analysis.
Multiple Models: Choose between:
- Cosmos-Reason1-7B: Designed for understanding physical common sense and generating embodied decisions.
- docscopeOCR-7B-050425-exp: Optimized for document-level optical character recognition and long-context vision-language understanding.
- Captioner-Relaxed-7B: Built with a hand-curated dataset for text-to-image models, providing significantly more detailed descriptions or captions of given images.
Adjustable Parameters: Fine-tune generation parameters like Max new tokens, Temperature, Top-p, Top-k, and Repetition penalty for customized output.
Interactive Interface: A user-friendly Gradio interface with separate tabs for image and video inputs.
Examples: Pre-defined examples to quickly test the image and video inference functionalities.

Installation

To set up and run this application, follow these steps:

Clone the repository:

git clone https://github.com/PRITHIVSAKTHIUR/Cosmos-x-DocScope.git
cd Cosmos-x-DocScope

Create a virtual environment (recommended):

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install the required dependencies:
```
pip install -r requirements.txt
```
- Note: The requirements.txt file should include gradio, transformers, torch, Pillow, numpy, opencv-python, and spaces (if spaces is a specific dependency, otherwise it's typically for Hugging Face Spaces deployment).

Usage

To launch the Gradio application, run the following command from the project's root directory:

python app.py

(Assuming your main application file is named app.py from the provided code snippet.)

Once the application is running, open the provided local URL (e.g., http://127.0.0.1:7860) in your web browser.

Interface Overview

Query Input: Enter your text query for the model.
Image/Video Upload: Upload your image or video file.
Submit Button: Click to initiate the generation process.
Output: The model's generated response will appear here.
Select Model: Choose which of the three models you want to use.
Advanced options: Expand this section to adjust generation parameters.

Model Information

Cosmos-Reason1-7B:
- Hugging Face Model Card: nvidia/Cosmos-Reason1-7B
- Purpose: Specializes in understanding physical common sense and generating appropriate embodied decisions.
docscopeOCR-7B-050425-exp:
- Hugging Face Model Card: prithivMLmods/docscopeOCR-7B-050425-exp
- Purpose: Optimized for document-level optical character recognition (OCR) and long-context vision-language understanding, making it suitable for analyzing complex documents.
Captioner-Relaxed-7B:
- Hugging Face Model Card: Ertugrul/Qwen2.5-VL-7B-Captioner-Relaxed
- Purpose: Designed to provide significantly more detailed descriptions or captions of given images, often built with hand-curated datasets for text-to-image models.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
images		images
videos		videos
LICENSE		LICENSE
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Cosmos-x-DocScope

Features

Installation

Usage

Interface Overview

Model Information

About

Uh oh!

Languages

License

PRITHIVSAKTHIUR/Cosmos-x-DocScope

Folders and files

Latest commit

History

Repository files navigation

Cosmos-x-DocScope

Features

Installation

Usage

Interface Overview

Model Information

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages