Skip to content

gpt3.5 > gpt4 on pass rate? #14

@stanpcf

Description

@stanpcf

great work for tool use!!!
how ever, I had some question about the result. I would be grateful if you reply~~

1. In StableToolBench I find pass rate result in https://zhichengg.github.io/stb.github.io/ show that gpt3.5 > gpt4 in DFS, any analysis on such result?

image while in `ToolBench` which is gpt4 > gpt3.5 (both react and dfs) image

2. much diff vs paper report

below is my rerun result on pass rate.
image

gpt-4-turbo-preview_cot(report on github) is paper report.
gpt-4-turbo-preview_cot(based data_baselines rerun), first download inference answer from https://huggingface.co/datasets/stabletoolbench/baselines, then use dir gpt-4-turbo-preview_cot run pass rate which eval-model is gpt-4-turbo-2024-04-09. which has much diff vs report.
gpt-4-turbo-2024-04-09_cot, run inference via script inference_chatgpt_pipeline_virtual.sh, GPT_MODEL is gpt-4-turbo-2024-04-09. eval model is gpt-4-turbo-2024-04-09.
gpt-4-turbo-2024-04-09_cot_rerun, is same as gpt-4-turbo-2024-04-09_cot but run again. which show that eval pass rate is stable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions