Skip to content

Phi3 has a nearly constant DPO loss of 0.69xx #17

@Arnav0400

Description

@Arnav0400

Issue: Implementing Iterative DPO on Phi3-4k-instruct

Hi, thanks for the great work and open source!

I am trying to implement iterative DPO on Phi3-4k-instruct. The following outlines my approach:

  1. Generation Step:

    python generation/gen_hf.py --ports 8000 8001 8002 8003 --tokenizer microsoft/Phi-3-mini-4k-instruct --dataset_name_or_path $jsonl_input --output_dir $json_output --K 8 --temperature 1.0
  2. Reward Annotation:

    accelerate launch annotate_data/get_rewards.py --dataset_name_or_path $json_output --output_dir $model_output

    Note: I have commented line 124 and uncommented line 123 in this file to handle the chat template of Phi3 differently from the Llama3-based reward model. This might be incorrect as I have not modified the change_of_format() function!

  3. DPO Iteration:

    accelerate launch dpo_iteration/run_dpo.py --run_name $iteration --output_dir $iteration --model_name_or_path microsoft/Phi-3-mini-4k-instruct --ref_model microsoft/Phi-3-mini-4k-instruct --learning_rate 5e-7 --max_steps 1200 --choose_type max_min --train_dir $model_output --eval_dir $model_output --loss_type sigmoid --lr_scheduler_type cosine

After performing these steps, the DPO loss is stuck at 0.69xx. I am running at a batch size of 128 and a learning rate of 5e-7.

Any insights to help get a Phi3 variant of iterative DPO would be greatly appreciated.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions