Phi3 has a nearly constant DPO loss of 0.69xx


### Issue: Implementing Iterative DPO on Phi3-4k-instruct

Hi, thanks for the great work and open source!

I am trying to implement iterative DPO on `Phi3-4k-instruct`. The following outlines my approach:

1. **Generation Step:**
   ```bash
   python generation/gen_hf.py --ports 8000 8001 8002 8003 --tokenizer microsoft/Phi-3-mini-4k-instruct --dataset_name_or_path $jsonl_input --output_dir $json_output --K 8 --temperature 1.0
   ```

2. **Reward Annotation:**
   ```bash
   accelerate launch annotate_data/get_rewards.py --dataset_name_or_path $json_output --output_dir $model_output
   ```

   **Note:** I have commented line 124 and uncommented line 123 in this file to handle the chat template of Phi3 differently from the Llama3-based reward model. This might be incorrect as I have not modified the `change_of_format()` function!

3. **DPO Iteration:**
   ```bash
   accelerate launch dpo_iteration/run_dpo.py --run_name $iteration --output_dir $iteration --model_name_or_path microsoft/Phi-3-mini-4k-instruct --ref_model microsoft/Phi-3-mini-4k-instruct --learning_rate 5e-7 --max_steps 1200 --choose_type max_min --train_dir $model_output --eval_dir $model_output --loss_type sigmoid --lr_scheduler_type cosine
   ```

After performing these steps, the DPO loss is stuck at 0.69xx. I am running at a batch size of 128 and a learning rate of 5e-7.

Any insights to help get a Phi3 variant of iterative DPO would be greatly appreciated.

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Phi3 has a nearly constant DPO loss of 0.69xx #17

Issue: Implementing Iterative DPO on Phi3-4k-instruct

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Phi3 has a nearly constant DPO loss of 0.69xx #17

Description

Issue: Implementing Iterative DPO on Phi3-4k-instruct

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions