Skip to content

[BUG] Multi-modal models training error #290

@yomin-y

Description

@yomin-y

Bug Description

I use Qwen2.5vl for training. Configs set to mix_chord. The error I meet is "ValueError: Transformers 4.53 is not supported"

Image

My whole configs are as follows:
project: "mix_chord"
name: "test_mix_chord"
checkpoint_root_dir: ${oc.env:TRINITY_CHECKPOINT_ROOT_DIR,./checkpoints}
algorithm:
algorithm_type: mix_chord
repeat_times: 8 # or 16 for better performance in math related tasks
kl_loss_fn_args:
kl_coef: 0.0
sample_strategy_args:
expert_data_ratio: 0.20
policy_loss_fn_args: # feel free to change, we encourage you to try out different hyperparameters
mu_warmup_steps: 200 # 0 for chord-mu and chord-phi
mu_decay_steps: 400 # 200 for chord-mu and 0 for chord-phi
mu_peak: 0.5 # 0.9 for chord-mu and 0.1 for chord-phi
mu_valley: 0.02 # 0.05 for chord-mu and 0.1 for chord-phi
enable_phi_function: true # false for chord-mu and true for chord-phi
clip_range: 0.2
use_token_level_loss_in_sft: true
use_dynamic_bsz: true
ppo_mini_batch_size: 320 # 320 = 256 + 64; if you set repeat times = 16, then it shoudle be 32 * 16 + 64
ppo_micro_batch_size_per_gpu: 4
ngpus_trainer: 4
train_batch_size_expert: 64
train_batch_size_usual: 256 # 32 batchsize * 8 repeat times
model:
model_path: ${oc.env:TRINITY_MODEL_PATH,/apdcephfs_qy3/share_301069248/users/yominyan/qwen25vl/LLaMA-Factory-main/Qwen2.5-VL-7B-Instruct}
max_response_tokens: 10240
max_model_len: 11264
cluster:
node_num: 1
gpu_per_node: 8
buffer:
total_epochs: 4
batch_size: 32
train_batch_size: 320
explorer_input:
taskset:
name: all_general_all
storage_type: file
path: ./data/all_general_all/ #${oc.env:TRINITY_TASKSET_PATH}
format:
prompt_key: 'problem'
response_key: 'answer'
rollout_args:
temperature: 1.0
logprobs: 0
workflow_args:
with_think: true
eval_tasksets: [] # you can add your own eval tasksets here
default_workflow_type: 'math_boxed_workflow'
trainer_input:
experience_buffer:
name: math_buffer
storage_type: queue
path: 'sqlite:///test_mix_chord.db'
auxiliary_buffers:
sft_dataset:
total_epochs: 25
name: SFT_data
storage_type: file
path: ${oc.env:TRINITY_SFT_DATASET_PATH,./data/}
split: 'train'
format:
prompt_type: messages
messages_key: 'messages'
images_key: 'images'
explorer:
eval_interval: 10
runner_per_model: 8
rollout_model:
engine_num: 4
tensor_parallel_size: 1
enable_prefix_caching: false
enforce_eager: true
dtype: bfloat16
seed: 42
synchronizer:
sync_method: 'nccl'
sync_interval: 1
sync_timeout: 1200
trainer:
save_interval: 50
trainer_config:
actor_rollout_ref:
model:
use_remove_padding: true
actor:
use_dynamic_bsz: true
ppo_max_token_len_per_gpu: 25600
ulysses_sequence_parallel_size: 2
optim:
lr: 1e-6 # or 5e-6, larger lr with warm up can result in better performance for SFT training.
ref:
log_prob_use_dynamic_bsz: ${trainer.trainer_config.actor_rollout_ref.actor.use_dynamic_bsz}
log_prob_max_token_len_per_gpu: ${trainer.trainer_config.actor_rollout_ref.actor.ppo_max_token_len_per_gpu}
ulysses_sequence_parallel_size: ${trainer.trainer_config.actor_rollout_ref.actor.ulysses_sequence_parallel_size}
monitor:
monitor_type: wandb

Environment Information

main python packages are listed below:

Image

Expected Behavior

Is it possible to have a complete config of mix_chord for running a multimodal large model (such as Qwen25vl)?

Log Information

If applicable, include any relevant log output here.

Are You Willing to Fix This Issue?

  • Yes, I am willing to fix this issue!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions