This repository contains a benchmarking to measure the token usage and response time impact of passing different numbers of tools to a large language model.
When working with LLMs that support function/tool calling, there's a tradeoff between:
- Providing more tools (giving the model more capabilities)
- Increasing token usage (which affects cost and latency)
This benchmark quantifies that relationship by measuring:
- Cost overhead vs. number of tools provided
- Response time vs. number of tools provided
Set up your API keys using .env.example
as example.
To run the benchmark:
uv run main.py
The script will:
- Fetch all available toolkits from UnifAI
- For each benchmark run:
- Randomly select n toolkits (where n ranges from 1 to the total number)
- Get the tools from these toolkits
- Make a LLM call with these static tools
- Record token usage and response time
- Save results to
benchmark_results.jsonl
- Plot the results with linear regression
After running the benchmark, you'll get:
- A scatter plot showing cost overhead vs. number of tools
- A scatter plot showing response time vs. number of tools
- Linear regression lines for both relationships
- The plots will be saved as
benchmark_results.png
Results are accumulated across runs in the benchmark_results.jsonl
file, so you can run the script multiple times to gather more data points.