-
Notifications
You must be signed in to change notification settings - Fork 49
Open
Description
Issue: Implementing Iterative DPO on Phi3-4k-instruct
Hi, thanks for the great work and open source!
I am trying to implement iterative DPO on Phi3-4k-instruct
. The following outlines my approach:
-
Generation Step:
python generation/gen_hf.py --ports 8000 8001 8002 8003 --tokenizer microsoft/Phi-3-mini-4k-instruct --dataset_name_or_path $jsonl_input --output_dir $json_output --K 8 --temperature 1.0
-
Reward Annotation:
accelerate launch annotate_data/get_rewards.py --dataset_name_or_path $json_output --output_dir $model_output
Note: I have commented line 124 and uncommented line 123 in this file to handle the chat template of Phi3 differently from the Llama3-based reward model. This might be incorrect as I have not modified the
change_of_format()
function! -
DPO Iteration:
accelerate launch dpo_iteration/run_dpo.py --run_name $iteration --output_dir $iteration --model_name_or_path microsoft/Phi-3-mini-4k-instruct --ref_model microsoft/Phi-3-mini-4k-instruct --learning_rate 5e-7 --max_steps 1200 --choose_type max_min --train_dir $model_output --eval_dir $model_output --loss_type sigmoid --lr_scheduler_type cosine
After performing these steps, the DPO loss is stuck at 0.69xx. I am running at a batch size of 128 and a learning rate of 5e-7.
Any insights to help get a Phi3 variant of iterative DPO would be greatly appreciated.
Thanks!
Metadata
Metadata
Assignees
Labels
No labels