-
Notifications
You must be signed in to change notification settings - Fork 18
Description
great work for tool use!!!
how ever, I had some question about the result. I would be grateful if you reply~~
1. In StableToolBench
I find pass rate result in https://zhichengg.github.io/stb.github.io/ show that gpt3.5 > gpt4 in DFS, any analysis on such result?


2. much diff vs paper report
below is my rerun result on pass rate
.
gpt-4-turbo-preview_cot(report on github)
is paper report.
gpt-4-turbo-preview_cot(based data_baselines rerun)
, first download inference answer from https://huggingface.co/datasets/stabletoolbench/baselines, then use dir gpt-4-turbo-preview_cot
run pass rate which eval-model is gpt-4-turbo-2024-04-09
. which has much diff vs report.
gpt-4-turbo-2024-04-09_cot
, run inference via script inference_chatgpt_pipeline_virtual.sh
, GPT_MODEL is gpt-4-turbo-2024-04-09
. eval model is gpt-4-turbo-2024-04-09
.
gpt-4-turbo-2024-04-09_cot_rerun
, is same as gpt-4-turbo-2024-04-09_cot
but run again. which show that eval pass rate is stable.