evalscope 多次评测结果时发现输出不一致

我正在使用最新的evalscope评测humaneval数据集，并使用vllm推理，下面是我的评测脚本
使用evalscope后我观察最终的输出发现output并不完全一致
输入的prompt是同一个prompt，同时vllm使用temperature=0
请问这个现象是合理现象么？

```
# humaneval
evalscope eval \
  --model ${MODEL_PATH} \
  --api-url ${BASE_URL}/${VERSION}/chat/completions \
  --api-key "EMPTY" \
  --eval-type openai_api --datasets humaneval \
  --dataset-args '{"humaneval": {"subset_list":["openai_humaneval"], "filters": {"remove_until": "</think>"}}}' \
  --generation-config max_tokens=16384,temperature=0,safety_level=none \
  --timeout 10000 \
  --limit 2 \
  --seed 1234 \
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

evalscope 多次评测结果时发现输出不一致 #883

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

evalscope 多次评测结果时发现输出不一致 #883

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions