Skip to content

stlin256/SmolVLM_with_LLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SmolVLM&LLM

Combining SmolVLM, SmolVLM2-video models with LLM (Qwen3)

中文介绍


Why combine VLM and LLM?

Small-parameter visual models *are suitable for edge deployment, but due to their parameter limitations, capabilities in logical reasoning and multilingual support are limited. By combining them with a remote LLM, we can enhance overall performance.

Using VLM on the edge and sending its output to a cloud-based LLM service for further reasoning helps reduce network overhead.


Repository Contents

This repository provides four basic Python scripts that integrate the SmolVLM-Instruct,
SmolVLM2-Video-Instruct models with LLMs, compensating for their limitations in multilingual and logical reasoning capabilities due to smaller parameter counts, while offering new insights into applying small-parameter VLM models.

  • smolVLM-LLM-basic.py provides a basic usage example of SmolVLM-256M-Instruct. You can load an image via URL for inference.

  • SmolVLM-LLM.py demonstrates how to combine SmolVLM-256M-Instruct with a remote LLM service (Qwen3-0.6B) to achieve image content reasoning and efficient translation.

  • SmolVLM-LLM-int4.py shows how to use SmolVLM-256M-Instruct (int4 quantized version) together with a remote LLM service (Qwen3-0.6B) for similar tasks: image content reasoning and efficient translation.

  • SmolVLM2-video-LLM.py demonstrates how to combine SmolVLM2-256M-Video-Instruct with a remote LLM service (Qwen3-0.6B) to perform video content reasoning, output cleaning, content summarization, and efficient translation.


Models Used

SmolVLM-Instruct and
SmolVLM2-Video-Instruct
are lightweight and efficient vision models developed by HuggingFaceTB.

Qwen3-0.6B is the 0.6B parameter version of Alibaba's Qwen3 model, known for strong translation capabilities and fast response speed.


SmolVLM&LLM

将SmolVLM、SmolVLM2-video模型和LLM(Qwen3)结合使用


为什么将VLM和LLM结合起来用?

小参数量的视觉模型*适合在端侧部署,但是由于其参数量的局限性,逻辑推理能力和多语言能力受限,将其与远程的LLM结合使用,则能提升使用效果

端侧使用VLM,再将输出内容送到云端LLM服务进行更进一步的推理,可以减少网络开销


仓库内容

本仓库提供四个基础的python脚本,将SmolVLM-IstructSmolVLM2-Video-Instruct模型与LLM结合使用,弥补二者由于参数量原因导致的多语言和逻辑推理能力上的不足,同时为小参数量VLM模型的应用提供新的思路。

smolVLM-LLM-basic.py中提供了基础的SmolVLM-256M-Istruct的使用样例,输入图片url加载图像即可进行推理。

SmolVLM-LLM.py中提供了将SmolVLM-256M-Istruct和远程LLM服务(Qwen3-0.6b)结合使用,实现图片内容推理高效翻译的使用样例。

SmolVLM-LLM-int4.py中提供了将SmolVLM-256M-Istruct(int4量化)和远程LLM服务(Qwen3-0.6b)结合使用,实现图片内容推理高效翻译的使用样例。

SmolVLM2-video-LLM.py中提供了将SmolVLM2-256M-Video-Instruct和远程LLM服务(Qwen3-0.6b)结合使用,实现视频内容推理输出内容清洗内容总结高效翻译的使用样例。


使用模型

SmolVLM-IstructSmolVLM2-Video-Instruct是由HuggingFaceTB开发的视觉模型,轻量且好用。

Qwen3-0.6b是由阿里开发的Qwen3模型的0.6B参数版本,翻译能力强,速度快。

About

Scripts for combining SmolVLM and LLM

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages