This repository contains a script for training Qwen2-VL and Qwen2.5-VL with only using HuggingFace and Liger-Kernel.
[Phi3-Vision Finetuning]
[Llama3.2-Vision Finetuning]
[Molmo Finetune]
[Pixtral Finetune]
[SmolVLM Finetune]
[Gemma3 Finetune]
- [2025/05/29] 🔥Supports GRPO training.
- [2025/04/16] 🔥Supports DPO training.
- [2025/03/04] Add Option for using liger kernel.
- [2025/02/18] 🔥Supports mixed-modality dataset with zero3.
- [2025/02/05] Fixed code for properly use image.
- [2025/02/03] Support Liger-kernel for Qwen2.5-VL.
- [2025/02/03] 🔥Supports Qwen2.5-VL.
- [2025/01/24] Add option for using DoRA.
- [2025/01/24] Fix error in LoRA training.
- [2025/01/18] 🔥Supports mixed-modality data.
- [2025/01/11] Updated 8-bit training with ms_amp fp8 with opt_level O3.
- [2024/11/05] Add memory efficient 8-bit training.
- [2024/09/12] 🔥Now the model is trained using Liger-Kernel.
- [2024/09/11] Supports setting different learning rates to projector and vision model.
- [2024/09/11] 🔥Supports multi-image and video training.
- Fine-tuning Qwen2-VL Series
- Deepspeed
- LoRA/QLoRA
- Full-finetuning
- Enable finetuning
vision_model
while using LoRA. - Disable/enable Flash Attention 2
- Multi-image and video training
- Training optimized with liger kernel
- Mixed-modality dataset
- Direct Preference Optimization (DPO)
- Group Relative Policy Optimization (GRPO)
To simplfy the setting process for training, you could use the provided pre-build environments.
The settings are done in the conda env named train
.
You could find more information about the image here.
docker pull john119/vlm
docker run --gpus all -it -v /host/path:/docker/path --name vlm --ipc=host john119/vlm /bin/bash
- Ubuntu 22.04
- Nvidia-Driver 550.120
- Cuda version 12.4
Install the required packages using environment.yaml
.
pip install -r requirements.txt -f https://download.pytorch.org/whl/cu124
pip install qwen-vl-utils
pip install flash-attn --no-build-isolation
conda env create -f environment.yaml
conda activate train
pip install qwen-vl-utils
pip install flash-attn --no-build-isolation
Note: You should install flash-attn after installing the other packages.
The script requires a dataset formatted according to the LLaVA specification. The dataset should be a JSON file where each entry contains information about conversations and images. Ensure that the image paths in the dataset match the provided --image_folder
.
When using a multi-image dataset, the image tokens should all be <image>
, and the image file names should have been in a list.
Please see the example below and follow format your data.
Example for single image dataset
[
{
"id": "000000033471",
"image": "000000033471.jpg",
"conversations": [
{
"from": "human",
"value": "<image>\nWhat are the colors of the bus in the image?"
},
{
"from": "gpt",
"value": "The bus in the image is white and red."
},
{
"from": "human",
"value": "What feature can be seen on the back of the bus?"
},
{
"from": "gpt",
"value": "The back of the bus features an advertisement."
},
{
"from": "human",
"value": "Is the bus driving down the street or pulled off to the side?"
},
{
"from": "gpt",
"value": "The bus is driving down the street, which is crowded with people and other vehicles."
}
]
}
...
]
Example for multi image dataset
[
{
"id": "000000033471",
"image": ["000000033471.jpg", "000000033472.jpg"],
"conversations": [
{
"from": "human",
"value": "<image>\n<image>\nIs the perspective of the camera differnt?"
},
{
"from": "gpt",
"value": "Yes, It the perspective of the camera is different."
}
]
}
...
]
Example for video dataset
[
{
"id": "sample1",
"video": "sample1.mp4",
"conversations": [
{
"from": "human",
"value": "<video>\nWhat is going on in this video?"
},
{
"from": "gpt",
"value": "A man is walking down the road."
}
]
}
...
]
Example for DPO dataet
[
{
"id": "e4cd017f-3854-4019-9cc1-9dacacb5d424",
"image": "example_0_0.png",
"prompt": "<image>\nDescribe the key features of the image in great detail.",
"chosen": "The image features a modern, multi-colored train parked on a train track, occupying a significant portion of the scene. The train has a unique and striking design, with red, white, and blue colors, as well as yellow accents. It appears to be a high-speed train, indicating its advanced engineering and potential for fast travel.\n\nThe train is situated next to a platform, suggesting that it may be at a train station. The overall scene conveys a sense of transportation and modern technology, with the vibrant and sleek train as the main subject.",
"rejected": "The image features a modern, colorful train parked on a set of train tracks. The train has a striking design, with red, white, and blue colors as well as yellow accents. It appears to be a high-speed train, ready for departure.\n\nThe train spans a significant portion of the image, from the left to the right side. The train tracks can be seen clearly beneath the train, emphasizing its position on the railway line. The overall scene gives off an impression of a contemporary and efficient mode of transportation."
},
{
"id": "5e19e647-e5d3-4bcf-82e9-d262570743ae",
"image": "example_1_0.png",
"prompt": "<image>\nIs this bus in the USA?",
"chosen": "Yes, based on the image, it can be assumed that this bus is in the USA. The location of the bus cannot be accurately determined.",
"rejected": "No, it's not in the USA. The image does not provide specific information on where the bus is located. However, we can say that it's not in the United States."
}
...
]
Example for GRPO dataset
[
{
"id": "06bc8a17-bb1c-4007-8c08-92c41e2628b2",
"image": "image_2.jpg",
"conversations": [
{
"from": "human",
"value": "Based on the image, which geometric method is used to determine the bearing angle, and why is it the most appropriate choice?"
},
{
"from": "gpt",
"value": "<think>Let's analyze the image step-by-step. The image shows a right-angled triangle with points B, C, and A. The angle at point B is a right angle, indicating that trigonometric functions can be applied. To find the bearing angle, we need to relate the sides of the triangle. The tangent function is suitable here because it relates the opposite side (BC) to the adjacent side (AB) in a right-angled triangle. By using the tangent function, we can calculate the angle at point A, which is the bearing angle. Therefore, the most appropriate geometric method is the use of trigonometric functions.</think>\n\n<answer>A</answer>"
}
]
}
...
]
Adding the new domain-specific data on top of the general data from open-source data will enhance downstream capabilities while retaining the foundational skills. Of course, you can also choose to fine-tune solely on the new data based on your requirements.
Note: Deepspeed zero2 is faster than zero3, however it consumes more memory. Also, most of the time zero2 is more stable than zero3.
Tip: You could use adamw_bnb_8bit
for optimizer to save memory.
To run the training script, use the following command:
bash scripts/finetune.sh
Note: Liger-kernel won't work with QLoRA. You need to disable to use QLoRA.
If you want to train only the language model with LoRA and perform full training for the vision model:
bash scripts/finetune_lora.sh
If you want to train both the language model and the vision model with LoRA:
bash scripts/finetune_lora_vision.sh
IMPORTANT: If you want to tune the embed_token
with LoRA, You need to tune lm_head
together.
Training arguments
--deepspeed
(str): Path to DeepSpeed config file (default: "scripts/zero2.json").--data_path
(str): Path to the LLaVA formatted training data (a JSON file). (Required)--image_folder
(str): Path to the images folder as referenced in the LLaVA formatted training data. (Required)--model_id
(str): Path to the Qwen2-VL model. (Required)--use_liger
(bool): Option for using liger kernel to save memory.--output_dir
(str): Output directory for model checkpoints--num_train_epochs
(int): Number of training epochs (default: 1).--per_device_train_batch_size
(int): Training batch size per GPU per forwarding step.--gradient_accumulation_steps
(int): Gradient accumulation steps (default: 4).--freeze_vision_tower
(bool): Option to freeze vision_model (default: False).--freeze_llm
(bool): Option to freeze LLM (default: False).--freeze_merger
(bool): Option to tune projector (default: False).--num_lora_modules
(int): Number of target modules to add LoRA (-1 means all layers).--vision_lr
(float): Learning rate for vision_model.--merger_lr
(float): Learning rate for merger(projector).--learning_rate
(float): Learning rate for language module.--bf16
(bool): Option for using bfloat16.--fp16
(bool): Option for using fp16.--image_min_pixels
(int): Option for minimum input tokens for image.--image_max_pixles
(int): Option for maximum maxmimum tokens for image.--video_min_pixels
(int): Option for minimum input tokens for video.--video_max_pixles
(int): Option for maximum maxmimum tokens for video.--image_resized_width
(int): Option for setting the width of the input image.--image_resized_height
(int): Option for setting the height of the input image.--video_resized_width
(int): Option for setting the width of the input video.--video_resized_height
(int): Option for setting the height of the input video.--lora_enable
(bool): Option for using LoRA.--vision_lora
(bool): Option for includingvision_tower
in LoRA module.lora_enable
should beTrue
to use this option.--use_dora
(bool): Option for using DoRA instead of LoRA.lora_enable
should beTrue
to use this option.--lora_namespan_exclude
(str): Exclude modules with namespans to add LoRA.--max_seq_length
(int): Maximum sequence length (default: 32K).--bits
(int): Quantization bits (default: 16).--disable_flash_attn2
(bool): Disable Flash Attention 2.--report_to
(str): Reporting tool (choices: 'tensorboard', 'wandb', 'none') (default: 'tensorboard').--logging_dir
(str): Logging directory (default: "./tf-logs").--lora_rank
(int): LoRA rank (default: 128).--lora_alpha
(int): LoRA alpha (default: 256).--lora_dropout
(float): LoRA dropout (default: 0.05).--logging_steps
(int): Logging steps (default: 1).--dataloader_num_workers
(int): Number of data loader workers (default: 4).--dpo_loss
(str): Loss type for dpo. (default: 'sigmoid')--precompute_ref_log_probs
(bool): Wheter to precompute the reference log probs (default: False)--beta
(float): The beta value for DPO (default: 0.1)
Note: The learning rate of vision_model
should be 10x ~ 5x smaller than the language_model
.
You can train the model using a video dataset. You can set LoRA configs and use for LoRA too.
bash scripts/finetune_video.sh
Note: When training with video, it just as multi-image so you should adjust the max_pixels
for maximum resolution and fps
based on the available VRAM.
If you run out of vram, you can use zero3_offload instead of zero3.
You could use zero2_offload for a bit faster training.
The model supprots a wide range of resolution inputs. By default, it uses the native resolution for input.
For better performance using native or higer pixel numbers are recommended, however it takes too much memory and computation time for large images. So you could adjust the pixel numbers for it.
The model splits the image into token * 28 * 28
so you could just change the the token_num part in the script.
For example:
--image_min_pixels $((256 * 28 * 28))
--image_max_pixels $((1280 * 28 * 28))
--video_min_pixels $((128 * 28 * 28))
--video_max_pixels $((768 * 28 * 28))
Besides you could directly set the image/video height and width to control over the memory.
--resized_height 448
--resized_width 448
These values will be rounded to the nearest multiple of 28.
bash scripts/merge_lora.sh
Note: Remember to replace the paths in finetune.sh
or finetune_lora.sh
with your specific paths. (Also in merge_lora.sh
when using LoRA.)
You can train the model using Direct Preference Optimization (DPO).
The process is quite similar to Supervised Fine-Tuning (SFT), and you can also apply LoRA during DPO training just like in SFT.
bash scripts/finetune_dpo.sh
Most of the training arugments are same as SFT, but few other arguments are added for DPO training.
Training arguments
--dpo_loss
(str): Loss type for dpo. (default: 'sigmoid')--precompute_ref_log_probs
(bool): Wheter to precompute the reference log probs (default: False)--beta
(float): The beta value for DPO (default: 0.1)
You can traing the model using Group Relative Policy Optimization (GRPO)
The process is quite similar to Supervised Fine-Tuning (SFT), and you can also apply LoRA during GRPO training just like in SFT.
What | Where | Notes |
---|---|---|
Reward functions | src/train/reward_funcs.py |
Add any function that ends with _reward . The training script picks them up automatically. |
Custom system prompts | src/constants.py |
Append your own prompt strings here. |
You could start training using this script.
bash scripts/finetune_grpo.sh
Most of the training arugments are same as SFT, but few other arguments are added for GRPO training.
Training arguments
--temperature
(float): Generation config (default: 0.9)--top_p
(float): Generation config (default: 1.0)--top_k
(int): Generation config (default: 50)--min_p
(float): Generation config (default: None)--repetition_penalty
(float): Generation config (default: 1.0)--max_completion_length
(int): Max length for the completion (default: 256)--max_prompt_length
(int): Max length for the prompt (default: 512)--beta
(float): KL Coefficient. (default: 0.04)
Note: Liger GRPO loss and vLLM back-end are not yet supported. Both will be added soon.
Note: You should use the merged weight when trained with LoRA.
- Install gradio
pip install gradio
- Launch app
python -m src.serve.app \
--model-path /path/to/merged/weight
You can launch gradio based demo with this command. This can also set some other generation configs like repetition_penalty
, temperature
etc.
Could not load library libcudnn_cnn_train.so.8. Error: /usr/local/cuda-12.1/lib/libcudnn_cnn_train.so.8: undefined symbol: _ZN5cudnn3cnn34layerNormFwd_execute_internal_implERKNS_7backend11VariantPackEP11CUstream_stRNS0_18LayerNormFwdParamsERKNS1_20NormForwardOperationEmb, version libcudnn_cnn_infer.so.8
You could run unset LD_LIBRARY_PATH
for this error.
You could see this issue
- Support for video data
- Add demo for multi-image and video
- Handle mixed-modality data in dataset and collator
- Support Qwen2.5-VL
- Monkey-patch liger-kernel for Qwen2.5-VL
- Update the code base to the latest transformers.
- Add DPO
- Add GRPO
- Fix GRPO liger loss to work
This project is licensed under the Apache-2.0 License. See the LICENSE file for details.
If you find this repository useful in your project, please consider giving a ⭐ and citing:
@misc{Qwen2-VL-Finetuning,
author = {Yuwon Lee},
title = {Qwen2-VL-Finetune},
year = {2024},
publisher = {GitHub},
url = {https://github.com/2U1/Qwen2-VL-Finetune}
}
This project is based on
- LLaVA-NeXT: An amazing open-source project of LMM.
- Mipha: Open-source projcet of SMM with amazing capabilites.
- Qwen2-VL-7B-Instruct: Awesome pretrained MLLM based on Qwen2.
- Liger-Kernel: Collection of Tirton kernels designed specifically for LLM training.
- VLM-R1: Open-source project of Reinforcement Learning with VLMs.