gpt3.5 > gpt4 on pass rate?

great work for tool use!!!
how ever, I had some question about the result.  I would be grateful if you reply~~

### 1. In `StableToolBench` I find pass rate result in https://zhichengg.github.io/stb.github.io/ show that gpt3.5 > gpt4 in DFS, any analysis on such result?
<img width="1035" alt="image" src="https://github.com/THUNLP-MT/StableToolBench/assets/13925796/8c13e45c-33a7-4b10-baab-8877970a3ca5">
while in `ToolBench` which is gpt4 > gpt3.5 (both react and dfs)
<img width="886" alt="image" src="https://github.com/THUNLP-MT/StableToolBench/assets/13925796/df1308c8-eb75-4ca6-a4df-a9cf438dd6c6">

### 2. much diff vs paper report
below is my rerun result on `pass rate`.
<img width="1154" alt="image" src="https://github.com/THUNLP-MT/StableToolBench/assets/13925796/5e9b1c1e-c080-4027-899e-d518ceed258d">

`gpt-4-turbo-preview_cot(report on github)` is paper report.
`gpt-4-turbo-preview_cot(based data_baselines rerun)`, first download inference answer from https://huggingface.co/datasets/stabletoolbench/baselines, then use dir `gpt-4-turbo-preview_cot` run pass rate which eval-model is `gpt-4-turbo-2024-04-09`.  which has much diff vs report.
`gpt-4-turbo-2024-04-09_cot`, run inference via  script `inference_chatgpt_pipeline_virtual.sh`, GPT_MODEL is `gpt-4-turbo-2024-04-09`. eval model is `gpt-4-turbo-2024-04-09`.
`gpt-4-turbo-2024-04-09_cot_rerun`, is same as `gpt-4-turbo-2024-04-09_cot` but run again. which show that eval pass rate is stable.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gpt3.5 > gpt4 on pass rate? #14

1. In `StableToolBench` I find pass rate result in https://zhichengg.github.io/stb.github.io/ show that gpt3.5 > gpt4 in DFS, any analysis on such result?

2. much diff vs paper report

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

gpt3.5 > gpt4 on pass rate? #14

Description

1. In StableToolBench I find pass rate result in https://zhichengg.github.io/stb.github.io/ show that gpt3.5 > gpt4 in DFS, any analysis on such result?

2. much diff vs paper report

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. In `StableToolBench` I find pass rate result in https://zhichengg.github.io/stb.github.io/ show that gpt3.5 > gpt4 in DFS, any analysis on such result?