Skip to content

FSoft-AI4Code/SFT-Infra

Features:

  1. Multi-modal training
  2. Long context training via ring attention with visual span aware chunking
  3. Tested with 8H100 - full parameter finetuning a 72B model with 64k context

Setup training dataset

we will need dataset id:

Each dataset_id will have their own data.json

All dataset will be registered in training/LMF/data/dataset_info.json. Our is vision_agent_claude37 corresponding with all_trajectories.json. I recommend you to register your own ID and re-run convert_all_trajectories.py to generate images folder and all_trajectories.json

here is an example:

[
  {
    "messages": [
      {
        "role": "user",
        "content": "<image>You are given a question and an image from the user. Your task is to write a python program utilizing the vision tools given to you to answer the user's question in a generalized way"
      },
      {
        "role": "assistant",
        "content": "<thinking>In this problem, I need to identify which ba\n</code>"
      },
    ],
    "images": [
        "training/images/ID"
    ]
  }
]

Currently, we did not upload training images to github, therefore, you need to extract images and formated trajectories using this script convert_all_trajectories.py, it will take trajectories folder (location of trajectories produced by run_chat.py script)

python convert_all_trajectories.py 

After that, just need to run:

bash sft.sh

Here are some important hyper-params to consider

cutoff_len 32000 (currently, I strictly remove all instances that has length exceeding 32000 due to overflow issue)
gradient_accumulation_steps * per_device_train_batch_size * n_GPUs = theoretical batch size (should be around ~64)
learning_rate 5e-6
sequence_parallel_size should be divisible by cutoff len
finetuning_type full | lora

About

Supervised Fine-tuning Infra for Software Research

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages