-
Notifications
You must be signed in to change notification settings - Fork 16
Open
Description
I use Qwen-2-0.5b as anti_expert_model, Qwen-2-0.5b tuned on codex_humaneval as expert_model, and Qwen2-7B as base_model, the EM score of proxy-tuned Qwen2-7B is: 0.4167682926829268.
# Evaluating DExperts with codex_humaneval expert
size=13
echo "Results dir: results/codex_humaneval/dexperts-7B"
python -m eval.codex_humaneval.run_eval_new \
--data_file data/eval/codex_humaneval/HumanEval.jsonl \
--save_dir results/codex_humaneval/dexperts-7B \
--base_model_name_or_path Qwen2-7B \
--expert_model_name_or_path qwen-2-codealpaca-0.5b \
--eval_batch_size 20
But the EM score of Qwen2-7B as base_model result is 0.463109756097561, which is even higher than proxy-tuned Qwen2-7B. (runned in https://github.com/allenai/open-instruct/)
size=7
echo "Results dir: results/codex_humaneval/Qwen2-${size}B"
python -m eval.codex_humaneval.run_eval \
--data_file data/eval/codex_humaneval/HumanEval.jsonl \
--save_dir results/codex_humaneval/Qwen2-${size}B \
--model_name_or_path Qwen2-${size}B \
--eval_batch_size 20
I doubt what's wrong with it.
Metadata
Metadata
Assignees
Labels
No labels