cannot reproduce the results reported in the Espresso paper

Hi, this is a really good and useful codebase. I tried to reproduce the results reported in the paper but failed. I used the code in `README_ESE.md`:

```
WANDB_MODE=disabled CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 --master_port=1234 -m angle_emb.angle_trainer \
--model_name_or_path WhereIsAI/UAE-Large-V1 \
--train_name_or_path SeanLee97/nli_for_simcse --save_dir ckpts/UAE-Large-Espresso \
--ibn_w 10.0 --cosine_w 0. --angle_w 1.0 --angle_tau 20.0 --learning_rate 1e-6 --maxlen 75 \
--workers 16 \
--pooling_strategy cls \
--epochs 1 \
--batch_size 128 \
--logging_steps 100 \
--warmup_steps 200 \
--save_steps 1000 \
--fp16 1 \
--gradient_accumulation_steps 4 \
--apply_ese 1 \
--ese_compression_size 128 \
--ese_kl_temperature 1.0
```
However, it only gave the following results: 
|sts12|sts13|sts14|sts15|sts16|STSB|SICKR|Avg.|
|---|---|---|---|---|---|---|---|
|79.25|88.63|84.15|89.61|85.99|87.79|79.59|85.00|

I also change `--cosine_w 0.` to `--cosine_w 1.0` and `--ibn_w 10.0` to `--ibn_w 35.0`, but the results were even worse.

The results reported in your paper are: 
|sts12|sts13|sts14|sts15|sts16|STSB|SICKR|Avg.|
|---|---|---|---|---|---|---|---|
|79.64|90.40|85.76|90.33|86.64|88.54|81.09|86.06|,

 If I purely evaluate the `WhereIsAI/UAE-Large-V1` model, the results are:
|sts12|sts13|sts14|sts15|sts16|STSB|SICKR|Avg.|
|---|---|---|---|---|---|---|---|
|79.09|89.62|85.02|89.51|86.61|89.06|82.09|85.86|

This means fine-tuning gave me worse performance. In addition, I noticed that the more epochs I train, the worse the performance gets.
Besides, I also tried the code in `examples/NLI/README.md` to train `Qwen1.5-0.5B`:
```
CUDA_VISIBLE_DEVICES=1,2,3,4 torchrun --nproc_per_node=4 --master_port=1234 train_angle.py \
--task NLI-STS --save_dir ckpts/NLI-STS-angle-Qwen1.5-0.5B \
--model_name Qwen/Qwen1.5-0.5B \
--w2 35 --learning_rate 1e-4 --maxlen 50 \
--lora_r 32 --lora_alpha 32 --lora_dropout 0.1 \
--save_steps 500 --batch_size 120 --seed 42 --do_eval 0 --load_kbit 4 --gradient_accumulation_steps 4 --epochs 1
```
It gave me an average score of 70.23, whereas the paper reports 82.82.

I wonder whether these scripts are the ones you used to train your model, especially regarding the parameter values. It would be really helpful if you could assist me in reproducing the results so I can use this codebase. I really appreciate your time and help! Thank you!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cannot reproduce the results reported in the Espresso paper #80

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

cannot reproduce the results reported in the Espresso paper #80

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions