Significantly different results with different backends #2851

wenhuach21 · 2025-03-27T15:12:24Z

For the model kaitchup/Qwen2.5-72B-Instruct-AutoRoundGPTQ-8bit, the leaderboard_ifeval results differ significantly between the HF backend and the vLLM backend. Could you provide insights into the possible reasons or help debug the issue? Thanks in advance!

HF backend

CUDA_VISIBLE_DEVICES=0,1 lm-eval --model hf --model_args pretrained=./,parallelize=True,dtype=float16 --tasks leaderboard_ifeval --batch_size 16 --limit 10

hf (pretrained=./,parallelize=True,dtype=float16), gen_kwargs: (None), limit: 10.0, num_fewshot: None, batch_size: 16

Tasks	Version	Filter	Metric		Value		Stderr
leaderboard_ifeval	3	none	inst_level_loose_acc	↑	0.8333	±	N/A
		none	inst_level_strict_acc	↑	0.7222	±	N/A
		none	prompt_level_loose_acc	↑	0.7000	±	0.1528
		none	prompt_level_strict_acc	↑	0.5000	±	0.1667

vllm backend

CUDA_VISIBLE_DEVICES=0,1 lm-eval  --model vllm --model_args pretrained=./,tensor_parallel_size=2,dtype=float16 --tasks leaderboard_ifeval --batch_size auto  --limit 10

vllm (pretrained=./,tensor_parallel_size=2,dtype=float16), gen_kwargs: (None), limit: 10.0, num_fewshot: None, batch_size: auto

limit 10

Tasks	Version	Filter	Metric		Value		Stderr
leaderboard_ifeval	3	none	inst_level_loose_acc	↑	0.2222	±	N/A
		none	inst_level_strict_acc	↑	0.2222	±	N/A
		none	prompt_level_loose_acc	↑	0.1000	±	0.1000
		none	prompt_level_strict_acc	↑	0.1000	±	0.1000

The text was updated successfully, but these errors were encountered:

kunxiongzhu · 2025-03-27T15:36:39Z

For the mlc-llm and llama-cpp-python, I have the similar problem

For-rest2005 · 2025-03-29T04:37:54Z

@kunxiongzhu For llama-cpp-python, the problem may come from abetlen/llama-cpp-python#1983.

wenhuach21 mentioned this issue Mar 27, 2025

unstable results of qwen-72b-instruct on IFEVAL? intel/auto-round#476

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Significantly different results with different backends #2851

Significantly different results with different backends #2851

wenhuach21 commented Mar 27, 2025 •

edited

Loading

kunxiongzhu commented Mar 27, 2025

For-rest2005 commented Mar 29, 2025

Significantly different results with different backends #2851

Significantly different results with different backends #2851

Comments

wenhuach21 commented Mar 27, 2025 • edited Loading

HF backend

vllm backend

kunxiongzhu commented Mar 27, 2025

For-rest2005 commented Mar 29, 2025

wenhuach21 commented Mar 27, 2025 •

edited

Loading