⚡️ This repository provides training recipes for the AMD efficient vision-language models, which are designed to improve the inference efficiency of VLM.
You can use the following command to install necessary packages.
pip3 install -r requirements.txt
pip3 install flash-attn --no-build-isolation
You can follow to LLaVA-OneVision to prepare the pre-training and the finetuning data.
You can run pre-training using scripts/pretrain.sh. The mm projector is pre-trianing during this stage. Please set the data_path and image_folder according to your data path.
bash scripts/pretrain.sh
You can use follwing command to start training the model.
bash scripts/train.sh
You can use following command to start supervised fine-tuning the model.
bash scripts/sft.sh
We provide a simple script for inference with a single image input.
python3 llava/test_generate.py --model_path ./checkpoints/ --image_path ./images/dino.png --question "Please describe this image."
You are welcome to download and try this model. To get more information about the training, inferencing and insights of this model, please visit the AMD Hugging Face Model Card to get access to the codes, and to download the model file. Additionally, AMD opened a dedicated cloud infrastructure that includes latest GPU instances to AI developers. Visit AMD Developer Cloud for specific accessing request and usage. Furthermore, you can deploy advanced AI models on AMD Ryzen AI PCs and can learn more here.
For any questions, you may reach out to the AMD team at amd_ai_mkt@amd.com.