🚀 Qwen2.5VL-R1 is a project for fine-tuning Qwen2.5-VL on a synthetic video classification task, optimized for a single-GPU setup. It includes a complete pipeline for generating synthetic video data, applying augmentations, fine-tuning with LoRA, and running inference.
This project demonstrates fine-tuning a multimodal large language model (MLLM), specifically Qwen2.5-VL-3B-Instruct
, for a simple video classification task (similar to Kinetics-400). The task involves classifying the direction of a moving ball in synthetic videos into one of four classes:
- Left to Right
- Right to Left
- Falling Down
- Ascending
- OS: Ubuntu 20.04
- Python: 3.11.12
- GPU: NVIDIA > 16GB (Tested with A100-SXM4-40GB) and CUDA 12.4
- Dependencies: Listed in
requirements.txt
(install withpip install -r requirements.txt
)
Clone the repository:
git clone https://github.com/yourname/qwen2.5VL-R1.git
cd qwen2.5VL-R1
Create a virtual environment:
python3 -m venv .venv
source .venv/bin/activate
Install dependencies:
pip install -r requirements.txt
Generate synthetic videos of a moving ball with corresponding labels and optional support for:
- Video ugmentations (blur and crop) with default probability of 0.2
- CoT (Chain-of-Thought) thinking generation for reasoning models (Optional, the code will work in simple mode removing the --cot flag)
# Optional: Enable Chain-of-Thought generation
export OPENAI_API_KEY=your-openai-api-key
python video_generator.py \
--output_dir ./data/synthetic_videos \
--num_samples 20 \
--cot \ # or remove for no CoT generation
--frame_size 64 \
--video_length 30 \
--split 0.8 \
--augment_prob 0.2 \
--augment blur,crop
-
Output: Saves videos in
data/synthetic_videos/videos/
and metadata intrain.json
andval.json
. -
Note: If you do not have an OpenAI api key you can directly Download the CoT dataset
-
Video example:
Once generated the training dataset you can fine-tune Qwen2.5-VL-3B-Instruct
using LoRA for efficiency.
The pipeline leverages DeepSpeed ZeRO-2 for GPU memory optimization.
PYTHONPATH=src:$PYTHONPATH \
deepspeed src/training/train.py \
--use_liger True \
--deepspeed ./scripts/zero2_offload.json \
--model_id Qwen/Qwen2.5-VL-3B-Instruct \
--data_path ./data/synthetic_videos/train.json \
--image_folder ./data/synthetic_videos/videos \
--remove_unused_columns False \
--freeze_vision_tower True \
--freeze_llm True \
--tune_merger False \
--bf16 True \
--lora_enable True \
--vision_lora True \
--lora_rank 64 \
--lora_alpha 64 \
--lora_dropout 0.05 \
--num_lora_modules -1 \
--lora_namespan_exclude "['lm_head','embed_tokens']" \
--disable_flash_attn2 True \
--output_dir output/video_lora \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 5 \
--video_max_pixels 4096 \
--fps 1 \
--learning_rate 2e-4 \
--weight_decay 0.0 \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--gradient_checkpointing True \
--report_to wandb \
--lazy_preprocess True \
--save_strategy "steps" \
--save_steps 5 \
--save_total_limit 2 \
--dataloader_num_workers 2
- Metrics: Training loss and accuracy are logged every step to Wandb.
We also provide a GRPO (Group Relative Policy Optimization) training script with rewards for derivating a reasoning version of the model.
PYTHONPATH=src:$PYTHONPATH \
python src/training/train_grpo.py \
--model_id Qwen/Qwen2.5-VL-3B-Instruct \
--model_ckpt ./output/video_lora/checkpoint-19 \ #
--data_path ./data/synthetic_videos/train.json \ # Put here the correct ckp after SFT!!
--image_folder ./data/synthetic_videos/videos \
--output_dir output/grpo_video_lora \
--per_device_train_batch_size 8 \
--gradient_accumulation_steps 1 \
--num_train_epochs 1 \
--learning_rate 2e-4 \
--disable_flash_attn2 True \
--bf16 True \
--fp16 False \
--freeze_llm True \
--freeze_vision_tower True \
--tune_merger False \
--lora_enable True \
--vision_lora True \
--lora_rank 64 \
--lora_alpha 64 \
--lora_dropout 0.05 \
--num_lora_modules -1 \
--lora_namespan_exclude "['lm_head','embed_tokens']" \
--gradient_checkpointing True \
--logging_steps 1 \
--dataloader_num_workers 2
- Output: Checkpoints are saved in
./output/video_lora/
or./output/grpo_video_lora/
. - Metrics: Training loss and accuracy are logged every step to Wandb.
- Optimization: Uses bf16 precision, gradient checkpointing, and DeepSpeed ZeRO-2 for memory optimization.
Test the fine-tuned model on a video:
python scripts/demo.py \
--model_ckpt ./output/video_lora/checkcheckpoint-25 \ # Find the right checkcheckpoint path after finetuning
--base_model Qwen/Qwen2.5-VL-3B-Instruct \
--video_path ./data/synthetic_videos/videos/000.mp4 \ # Find the right video path
--prompt "In which direction is the ball moving?\nOptions:\n(A) Left to Right\n(B) Right to Left\n(C) Falling Down\n(D) Ascending" \
--fps 1.0
- Output: Prints the model’s prediction, including reasoning steps (if trained with CoT) and the final answer.
-
Base model documentation: Transformers - Qwen2.5-VL
📌 Caveat: Supports video inference, but not video training.
-
Fine-tuning code adapted from:
- 2U1/Qwen2-VL-Finetune
- QwenLM/Qwen2.5-VL (includes only for full FT, not PEFT)
-
GRPO approach inspired by:
📌 Caveat: Do not support videos
qwen2.5VL-R1/
├── README.md
├── requirements.txt
├── video_generator.py
├── data/
│ └── synthetic_videos/
│ ├── train.json
│ ├── val.json
│ └── videos/
├── scripts/
│ ├── demo.py
│ └── zero2_offload.json
└── src/
└── training/
├── __init__.py
├── constants.py
├── data.py
├── modality_patch.py
├── params.py
├── rewards.py
├── train.py
├── train_grpo.py
├── train_utils.py
└── trainer.py