LLaVA: Large Language and Vision Assistant. From data to deployment.
Please note that this is only supported on Linux systems.
1. Clone repository
git clone https://github.com/Lornatang/llava.git
cd llava
2. Install Package
conda create -n llava python=3.11.13 -y
conda activate llava
pip3 install --upgrade pip
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu128
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.2/flash_attn-2.8.2+cu12torch2.7cxx11abiTRUE-cp311-cp311-linux_x86_64.whl
pip3 install ./flash_attn-2.8.2+cu12torch2.7cxx11abiTRUE-cp311-cp311-linux_x86_64.whl
pip3 install -e .
pass
The training is mainly divided into two stages. The first stage is to achieve basic alignment of cross-modal features, referred to as pre-training. The second stage is based on feature alignment and end-to-end fine-tuning to allow the model to learn to follow diverse visual-language instructions and generate responses that meet the requirements.
The LAION-CC-SBU with BLIP captions dataset with 558k data is used for pre-training.
This is the official mixed dataset annotation, please download it for fine-tuning.
- llava_v1_5_mix665k.json
- coco train2017
- gqa images
- ocr_vqa images
- textvqa train_val_images
- vg part1
- vg part2
Please place the downloaded files into the datasets directory according to the following format requirements.
- datasets
- llava_pretrain
- blip_laion_cc_sbu_558k
- blip_laion_cc_sbu_558k.json
- blip_laion_cc_sbu_558k_meta.json
- images
- llava_finetune
- llava_v1_5_mix665k.json
- coco
- train2017
- gqa
- images
- ocr_vqa
- images
- textvqa
- train_images
- vg
- VG_100K
- VG_100K_2
You can try different combinations of visual language architectures as shown below!
We use LLM(lmsys/vicuna-13b-v1.5) and VIT(openai/clip-vit-large-patch14-336) for the following examples, but you can also use other LLMs and VIT.
hf download openai/clip-vit-large-patch14-336 --local-dir ./results/pretrained_models/openai/clip-vit-large-patch14-336
hf download lmsys/vicuna-13b-v1.5 --local-dir ./results/pretrained_models/lmsys/vicuna-13b-v1.5
# Stage1: Pretrain (Visual Feature Alignment).
bash ./tools/stage1_pretrain.sh
# Stage2: Full Finetune (Instruction Tuning with Image-Text Pairs)
bash ./tools/stage2_finetune.sh
- LLaVA: Providing the most original implementation, thanks.
- LLaVA-NeXT: Provide many methods that are beneficial to the implementation of this project.
- Vicuna: Provides many optional multimodal data processing methods.
- Qwen: Provides LLM that is easy to fine-tune and has excellent performance.
@misc{liu2024llavanext,
title={LLaVA-NeXT: Improved reasoning, OCR, and world knowledge},
url={https://llava-vl.github.io/blog/2024-01-30-llava-next/},
author={Liu, Haotian and Li, Chunyuan and Li, Yuheng and Li, Bo and Zhang, Yuanhan and Shen, Sheng and Lee, Yong Jae},
month={January},
year={2024}
}
@misc{liu2023improvedllava,
title={Improved Baselines with Visual Instruction Tuning},
author={Liu, Haotian and Li, Chunyuan and Li, Yuheng and Lee, Yong Jae},
publisher={arXiv:2310.03744},
year={2023},
}
@misc{liu2023llava,
title={Visual Instruction Tuning},
author={Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
publisher={NeurIPS},
year={2023},
}