- Multi-modal training
- Long context training via ring attention with visual span aware chunking
- Tested with 8H100 - full parameter finetuning a 72B model with 64k context
we will need dataset id:
Each dataset_id will have their own data.json
All dataset will be registered in training/LMF/data/dataset_info.json
. Our is vision_agent_claude37
corresponding with all_trajectories.json
. I recommend you to register your own ID and re-run convert_all_trajectories.py
to generate images folder and all_trajectories.json
here is an example:
[
{
"messages": [
{
"role": "user",
"content": "<image>You are given a question and an image from the user. Your task is to write a python program utilizing the vision tools given to you to answer the user's question in a generalized way"
},
{
"role": "assistant",
"content": "<thinking>In this problem, I need to identify which ba\n</code>"
},
],
"images": [
"training/images/ID"
]
}
]
Currently, we did not upload training images to github, therefore, you need to extract images and formated trajectories using this script convert_all_trajectories.py
, it will take trajectories
folder (location of trajectories produced by run_chat.py
script)
python convert_all_trajectories.py
After that, just need to run:
bash sft.sh
Here are some important hyper-params to consider
cutoff_len 32000 (currently, I strictly remove all instances that has length exceeding 32000 due to overflow issue)
gradient_accumulation_steps * per_device_train_batch_size * n_GPUs = theoretical batch size (should be around ~64)
learning_rate 5e-6
sequence_parallel_size should be divisible by cutoff len
finetuning_type full | lora