proxy-tuning seems ineffective in some settings

I use Qwen-2-0.5b as anti_expert_model, Qwen-2-0.5b tuned on codex_humaneval as expert_model, and Qwen2-7B as base_model, the EM score of proxy-tuned Qwen2-7B is: 0.4167682926829268.
```
# Evaluating DExperts with codex_humaneval expert
size=13
echo "Results dir: results/codex_humaneval/dexperts-7B"
python -m eval.codex_humaneval.run_eval_new \
    --data_file data/eval/codex_humaneval/HumanEval.jsonl \
    --save_dir results/codex_humaneval/dexperts-7B \
    --base_model_name_or_path Qwen2-7B \
    --expert_model_name_or_path qwen-2-codealpaca-0.5b \
    --eval_batch_size 20
```

But the EM score of Qwen2-7B as base_model result is 0.463109756097561, which is even higher than proxy-tuned Qwen2-7B. (runned in https://github.com/allenai/open-instruct/)
```
size=7
echo "Results dir: results/codex_humaneval/Qwen2-${size}B"
python -m eval.codex_humaneval.run_eval \
   --data_file data/eval/codex_humaneval/HumanEval.jsonl \
   --save_dir results/codex_humaneval/Qwen2-${size}B \
   --model_name_or_path Qwen2-${size}B \
   --eval_batch_size 20
```

I doubt what's wrong with it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

proxy-tuning seems ineffective in some settings #8

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

proxy-tuning seems ineffective in some settings #8

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions