Qwen3-VL-Outpost is a Gradio-based web application for vision-language tasks, leveraging multiple Qwen vision-language models to process images and videos. It provides an intuitive interface for users to input queries, upload media, and generate detailed responses using advanced models like Qwen3-VL and Qwen2.5-VL.
-
Image and Video Inference: Upload images or videos and input text queries to generate detailed responses.
-
Multiple Model Support: Choose from the following models:
- Qwen3-VL-4B-Instruct
- Qwen3-VL-8B-Instruct
- Qwen3-VL-4B-Thinking
- Qwen2.5-VL-3B-Instruct
- Qwen2.5-VL-7B-Instruct
-
Customizable Parameters: Adjust advanced settings such as max new tokens, temperature, top-p, top-k, and repetition penalty.
-
Real-time Streaming: View model outputs as they are generated.
-
Custom Theme: Uses a tailored SteelBlueTheme for an enhanced user interface.
-
Example Inputs: Predefined examples for quick testing of image and video inference.
- Python 3.10 or higher
- Git
- CUDA-compatible GPU (recommended for optimal performance)
git clone https://github.com/PRITHIVSAKTHIUR/Qwen3-VL-Outpost.git
cd Qwen3-VL-Outpostpython -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activateInstall the required packages using:
pip install -r requirements.txtrequirements.txt includes:
git+https://github.com/huggingface/accelerate.git
git+https://github.com/huggingface/peft.git
transformers-stream-generator
transformers==4.57.1
huggingface_hub
albumentations
qwen-vl-utils
pyvips-binary
sentencepiece
opencv-python
docling-core
python-docx
torchvision
supervision
matplotlib
pdf2image
num2words
reportlab
html2text
xformers
markdown
requests
pymupdf
loguru
hf_xet
spaces
pyvips
pillow
gradio
einops
httpx
click
torch
fpdf
timm
av
Start the Gradio interface with:
python app.pyThis will launch the web interface, accessible via your browser. The application supports queuing with a maximum size of 50.
-
Select a Model: Choose one of the available Qwen models from the radio buttons.
-
Upload Media: Use the image or video upload section to provide input media.
-
Enter Query: Input your text query in the provided textbox.
-
Adjust Settings: Optionally tweak advanced parameters like max new tokens or temperature in the accordion.
-
Submit: Click the Submit button to generate a response.
- Outputs are displayed in real-time in the Raw Output Stream and as formatted Markdown.
- “Explain the content in detail.” (with an uploaded image)
- “Jsonify Data.” (for images with tabular data)
- “Explain the ad in detail.” (with an uploaded video)
- “Identify the main actions in the video.”
Qwen3-VL-Outpost/
│
├── app.py # Main application script containing the Gradio interface and model logic
├── images/ # Directory for example image files
├── videos/ # Directory for example video files
├── requirements.txt # List of dependencies required for the project
└── README.md # Project documentation
- The application uses PyTorch with GPU acceleration (
torch.cuda) if available; otherwise, it falls back to CPU. - Video processing downsamples videos to a maximum of 10 frames to optimize memory usage.
- Ensure sufficient disk space and memory when loading large models such as Qwen3-VL-8B-Instruct.
- The application is designed to run in a browser via Gradio's web interface.
Contributions are welcome! To contribute:
-
Fork the repository.
-
Create a new branch:
git checkout -b feature-branch
-
Make your changes and commit:
git commit -m "Add new feature" -
Push to the branch:
git push origin feature-branch
-
Open a pull request.
This project is licensed under the Apache License 2.0. See the LICENSE file for details.