Skip to content

proxy-tuning seems ineffective in some settings #8

@NuoJohnChen

Description

@NuoJohnChen

I use Qwen-2-0.5b as anti_expert_model, Qwen-2-0.5b tuned on codex_humaneval as expert_model, and Qwen2-7B as base_model, the EM score of proxy-tuned Qwen2-7B is: 0.4167682926829268.

# Evaluating DExperts with codex_humaneval expert
size=13
echo "Results dir: results/codex_humaneval/dexperts-7B"
python -m eval.codex_humaneval.run_eval_new \
    --data_file data/eval/codex_humaneval/HumanEval.jsonl \
    --save_dir results/codex_humaneval/dexperts-7B \
    --base_model_name_or_path Qwen2-7B \
    --expert_model_name_or_path qwen-2-codealpaca-0.5b \
    --eval_batch_size 20

But the EM score of Qwen2-7B as base_model result is 0.463109756097561, which is even higher than proxy-tuned Qwen2-7B. (runned in https://github.com/allenai/open-instruct/)

size=7
echo "Results dir: results/codex_humaneval/Qwen2-${size}B"
python -m eval.codex_humaneval.run_eval \
   --data_file data/eval/codex_humaneval/HumanEval.jsonl \
   --save_dir results/codex_humaneval/Qwen2-${size}B \
   --model_name_or_path Qwen2-${size}B \
   --eval_batch_size 20

I doubt what's wrong with it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions