Skip to content

stlin256/VLM_Live

Repository files navigation

Real-Time VLM Visual Analysis Web App

—— A web application for real-time visual analysis using Vision Language Models

中文介绍



Introduction

This project provides a complete web application that leverages a Vision Language Model (VLM) to perform real-time analysis of visual content. The application can capture video from a webcam, a video file, or use a static image as input. It continuously processes the visual feed, generates textual descriptions using the VLM, and displays the video stream, the model's output, and the processing latency on a responsive web interface. All major settings can be configured dynamically through the web UI.

Features

  • Multi-Source Input: Supports real-time video capture from a webcam, looping playback from a local video file, or connecting to network video streams (e.g., RTSP, HTTP, MJPEG).
  • Dynamic Web UI Configuration: Easily switch inputs, change paths/URLs, adjust max_new_tokens, and modify the prompt directly from the web page, with changes taking effect instantly.
  • Optional Real-Time Translation: The llama.cpp version includes an optional feature to translate the VLM's English output to Chinese in real-time using a secondary LLM.
  • Web Interface: A clean, responsive web UI that displays the visual source, real-time latency, and the VLM's textual output side-by-side.
  • Asynchronous Processing: Utilizes multithreading to handle frame capture and VLM inference in parallel, maximizing performance.

Engine Versions

This project includes two inference engine implementations:

  1. realtime_vlm_app.py: Uses the standard PyTorch and Hugging Face Transformers library. It's easy to set up and is ideal for environments with powerful GPUs.
  2. realtime_vlm_app_llamacpp.py: A high-performance version using llama.cpp for inference, optimized for CPU and low-resource GPU environments.
  3. realtime_vlm_app_llamacpp_translate.py: An extension of the llama.cpp version that includes an optional feature to translate the VLM's English output to Chinese in real-time using a secondary LLM.
  4. mobile_vlm_app.py: A dedicated, mobile-optimized version that allows you to use your phone's camera to stream video directly to the server for analysis. It supports multiple concurrent users and features a dynamic UI.

Repository Structure

.
├── realtime_vlm_app.py             # Main app (PyTorch version)
├── realtime_vlm_app_llamacpp.py    # High-performance app (llama.cpp version, no translation)
├── realtime_vlm_app_llamacpp_translate.py # High-performance app with real-time translation
├── mobile_vlm_app.py               # Mobile-optimized version for phone camera streaming
├── templates/
│   ├── index.html                  # HTML template for the desktop web interface
│   └── mobile_index.html           # HTML template for the mobile web interface
├── requirements.txt                # Python dependencies
└── README.md                       # This file

Usage Guide

Environment Setup

It is recommended to use a virtual environment (e.g., conda or venv).

  1. Create a virtual environment (example with conda):
    conda create -n vlm_webapp python=3.10
  2. Activate the environment:
    conda activate vlm_webapp

Installation

  1. Clone the repository:
    git clone https://github.com/stlin256/VLM_Live.git
    cd VLM_Live
  2. Install PyTorch: Visit the official PyTorch website and install a version compatible with your CUDA setup.
  3. Install dependencies: This will install all necessary packages, including llama-cpp-python.
    pip install -r requirements.txt
  4. Download Models:
    • For the PyTorch version: Download the model files from SmolVLM2-256M-Video-Instruct and place them in a directory named SmolVLM-256M-Instruct at the project root.
    • For the llama.cpp version:
      • Download the VLM GGUF model and the multimodal projector file from SmolVLM2-256M-Video-Instruct-GGUF. Place them in a subfolder, e.g., SmolVLM-256M-Instruct-GGUF.
      • For the translation feature, download a translator model like Qwen3-0.6B-GGUF and place it in its own folder, e.g., Qwen3-0.6B-GGUF.
  5. Prepare media files: Place any video or image files you want to use (e.g., pic.jpg, a.mp4) in the project's root directory.

Running the Application

  • To run the standard PyTorch version:
    python realtime_vlm_app.py
  • To run the high-performance llama.cpp version (no translation):
    python realtime_vlm_app_llamacpp.py
  • To run the llama.cpp version with the real-time translation feature:
    python realtime_vlm_app_llamacpp_translate.py
  • To run the mobile-optimized version:
    1. Generate SSL Certificate: To access your phone's camera, the server must run over HTTPS. Generate a self-signed certificate by running this command once in your project directory:
      openssl req -x509 -newkey rsa:4096 -nodes -out cert.pem -keyout key.pem -days 365 -subj "/CN=localhost"
    2. Run the server:
      python mobile_vlm_app.py
    3. Access on your phone: The terminal will show an https:// link. Connect your phone to the same Wi-Fi network as your computer and navigate to that URL (e.g., https://192.168.1.10:5000).
    4. Trust the Certificate: Your phone's browser will show a security warning. This is expected. Click "Advanced" -> "Proceed to [your IP] (unsafe)" to continue. This step is necessary and safe in a local development environment.

After starting, the terminal will display the access link. For desktop versions, navigate to http://127.0.0.1:5000 to view the application.


Configuration

Initial default settings are defined at the top of each .py script. However, all key parameters can be adjusted dynamically via the settings panel on the web page itself.

  • Use Webcam: Check this box to switch to the webcam feed.
  • Video/Image Path: Specify the path to a local file or the URL of a network stream.
  • Max Tokens: Control the maximum length of the generated response.
  • Prompt: Change the instruction given to the VLM.
  • Translate: (Only available in the llama.cpp version) Check this box to translate the VLM's output to Chinese.

Changes are applied instantly upon modification.


Demo

demo

The desktop interface displays the video feed on the left and the analysis results on the right. Below the main view, a settings panel allows for dynamic configuration of the application.

The mobile interface is optimized for phone screens, featuring a fullscreen camera view with results overlaid at the bottom. It includes a language toggle and a dynamic, animated border indicating that the AI analysis is active.



实时 VLM 视觉分析 Web 应用

—— 一个使用视觉语言模型进行实时视觉分析的Web应用



介绍

本项目提供了一个完整的Web应用程序,它利用视觉语言模型(VLM)对视觉内容进行实时分析。该应用可以从网络摄像头、视频文件捕获视频,或使用静态图像作为输入。它会持续处理视觉输入,使用VLM生成文本描述,并在一个响应式的Web界面上并排显示视频流、模型的输出以及处理的延迟。所有主要设置都可以通过Web UI动态配置。

功能特性

  • 多源输入: 支持从网络摄像头进行实时视频捕获、循环播放本地视频文件,或连接到网络视频流(例如 RTSP, HTTP, MJPEG 等)。
  • 动态Web UI配置: 可在网页上轻松切换输入源、更改路径/URL、调整 max_new_tokens 以及修改 prompt,所有更改即时生效。
  • 可选的实时翻译: llama.cpp 版本集成了一个可选功能,可以使用一个次级LLM将VLM的英文输出实时翻译为中文。
  • Web界面: 一个简洁、响应式的Web UI,可并排显示视觉源、实时延迟和VLM的文本输出。
  • 异步处理: 利用多线程并行处理帧捕获和VLM推理,以最大化性能。

引擎版本

本项目包含两种推理引擎的实现:

  1. realtime_vlm_app.py: 使用标准的 PyTorch 和 Hugging Face Transformers 库。它易于设置,非常适合拥有强大GPU的环境。
  2. realtime_vlm_app_llamacpp.py: 使用 llama.cpp 进行推理的高性能版本,为CPU和低资源GPU环境优化。
  3. realtime_vlm_app_llamacpp_translate.py: llama.cpp 版本的扩展,集成了一个可选功能,可以使用一个次级LLM将VLM的英文输出实时翻译为中文。
  4. mobile_vlm_app.py: 一个专门为移动端优化的版本,允许您使用手机摄像头直接将视频推流至服务器进行分析。它支持多用户并发使用,并拥有一个动态UI。

仓库结构

.
├── realtime_vlm_app.py             # 主应用 (PyTorch 版本)
├── realtime_vlm_app_llamacpp.py    # 高性能应用 (llama.cpp 版本, 无翻译)
├── realtime_vlm_app_llamacpp_translate.py # 带实时翻译功能的高性能应用
├── mobile_vlm_app.py               # 用于手机摄像头推流的移动优化版本
├── templates/
│   ├── index.html                  # 桌面版Web界面的HTML模板
│   └── mobile_index.html           # 移动版Web界面的HTML模板
├── requirements.txt                # Python依赖项
└── README.md                       # 本文件

使用指南

环境设置

建议使用虚拟环境(例如 conda 或 venv)。

  1. 创建虚拟环境 (使用 conda 的示例):
    conda create -n vlm_webapp python=3.10
  2. 激活环境:
    conda activate vlm_webapp

安装

  1. 克隆仓库:
    git clone https://github.com/stlin256/VLM_Live.git
    cd VLM_Live
  2. 安装 PyTorch: 访问 PyTorch 官网 并安装与您的 CUDA 环境兼容的版本。
  3. 安装依赖: 这将安装所有必要的包,包括 llama-cpp-python
    pip install -r requirements.txt
  4. 下载模型:
    • 对于 PyTorch 版本: 从 SmolVLM2-256M-Video-Instruct 下载模型文件,并将它们放置在项目根目录下名为 SmolVLM-256M-Instruct 的文件夹中。
    • 对于 llama.cpp 版本:
      • SmolVLM2-256M-Video-Instruct-GGUF 下载GGUF格式的VLM模型和多模态投影文件(mmproj)。将它们放置在一个子文件夹中,例如 SmolVLM-256M-Instruct-GGUF
      • 若要使用翻译功能,请下载一个翻译模型,例如 Qwen3-0.6B-GGUF,并将其放置在自己的文件夹中,例如 Qwen3-0.6B-GGUF
  5. 准备媒体文件: 将您希望使用的任何视频或图片文件(例如 pic.jpg, a.mp4)放置在项目根目录中。

运行应用

  • 运行标准的 PyTorch 版本:
    python realtime_vlm_app.py
  • 运行高性能的 llama.cpp 版本 (无翻译功能):
    python realtime_vlm_app_llamacpp.py
  • 运行带实时翻译功能的 llama.cpp 版本:
    python realtime_vlm_app_llamacpp_translate.py
  • 运行移动优化版本:
    1. 生成SSL证书: 为了能访问手机摄像头,服务器必须通过HTTPS运行。请在项目目录中运行一次以下命令来生成一个自签名证书:
      openssl req -x509 -newkey rsa:4096 -nodes -out cert.pem -keyout key.pem -days 365 -subj "/CN=localhost"
    2. 运行服务器:
      python mobile_vlm_app.py
    3. 在手机上访问: 终端会显示一个 https:// 链接。请将您的手机与电脑连接到同一个Wi-Fi网络,并访问该URL(例如 https://192.168.1.10:5000)。
    4. 信任证书: 您的手机浏览器会显示一个安全警告,这是正常现象。请点击“高级” -> “继续前往 [您的IP] (不安全)”来访问。在本地开发环境中,此操作是必要的且安全的。

启动后,终端会显示应用的访问链接。对于桌面版,请访问 http://127.0.0.1:5000 查看应用。


配置

初始的默认设置在每个 .py 脚本的顶部定义。然而,所有关键参数都可以在网页本身的设置面板中动态调整。

  • Use Webcam: 勾选此框以切换到摄像头画面。
  • Video/Image Path: 指定本地文件路径或网络流URL。
  • Max Tokens: 控制生成的回复的最大长度。
  • Prompt: 更改向VLM提出的指令。
  • Translate: (仅在 llama.cpp 版本中可用) 勾选此框以将VLM的输出翻译为中文。

所有更改在修改后会立刻生效。


演示

demo 桌面版界面在左侧显示视频,右侧显示分析结果。在主视图下方,有一个设置面板,允许用户动态配置应用程序。

移动版界面为手机屏幕进行了优化,采用全屏摄像头视图,并将分析结果叠加显示在底部。它包含一个语言切换按钮,以及一个动态的动画边框,用以指示AI正在进行分析。

About

Real-Time VLM Visual Analysis Web App

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published