Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
9e50bff
feat: async loading benchmark data
yaoyifan-yyf Sep 25, 2025
bc14084
opt: code format
yaoyifan-yyf Sep 25, 2025
4de41d1
feat: benchmark post_dispatch service
yaoyifan-yyf Sep 26, 2025
92c251d
opt: async load benchmark data on init
yaoyifan-yyf Sep 29, 2025
6293629
feat(benchmark): execute benchmark task
Oct 8, 2025
4875df1
Merge remote-tracking branch 'origin/feat_dataset_benchmark' into fea…
Oct 8, 2025
57e8bab
feat: query benchmark dataset api
yaoyifan-yyf Oct 9, 2025
c923e64
feat: add benchmark result query api
yaoyifan-yyf Oct 9, 2025
61e350f
Merge remote-tracking branch 'origin/feat_dataset_benchmark' into fea…
chenliang15405 Oct 10, 2025
0330dd3
chore: resolve confict
chenliang15405 Oct 10, 2025
f6351ae
feat(benchmark): optimize benchmark task and write evaluate result ro…
chenliang15405 Oct 10, 2025
81e2a1c
fix: add table mapping
yaoyifan-yyf Oct 11, 2025
8217408
feat(benchmark): create benchmark task
chenliang15405 Oct 11, 2025
5b27b6e
Merge remote-tracking branch 'origin/feat_dataset_benchmark' into fea…
chenliang15405 Oct 11, 2025
33a4e04
fix(benchmark): fix post dispatch param
chenliang15405 Oct 13, 2025
9b81a10
opt: compare result write to excel not db
yaoyifan-yyf Oct 13, 2025
87f11b5
feat(benchmark): multi model post process
chenliang15405 Oct 13, 2025
8d8d455
opt: multi model compare write result
yaoyifan-yyf Oct 13, 2025
c14b68d
feat(benchmark): query benchmark task list
chenliang15405 Oct 13, 2025
ab96ebb
opt: add standard result col to output excel
yaoyifan-yyf Oct 13, 2025
838bc35
Merge remote-tracking branch 'origin/feat_dataset_benchmark' into fea…
yaoyifan-yyf Oct 13, 2025
5df8d94
feat(benchmark): benchmark result file download
chenliang15405 Oct 14, 2025
92243cb
fix(benchmark): parse multi standard anwser
chenliang15405 Oct 15, 2025
65fd87b
fix(benchmark): update standard anwser result field
chenliang15405 Oct 15, 2025
24064d7
fix: ant_icube table mapping correct
yaoyifan-yyf Oct 15, 2025
05b1fb6
fix: col name sanitize modification
yaoyifan-yyf Oct 15, 2025
ff34064
fix: benchmark compare summary write to db
yaoyifan-yyf Oct 16, 2025
41da1b3
fix: benchmark compare summary write to db
yaoyifan-yyf Oct 16, 2025
fb83e30
opt: benchmark result api output adjust
yaoyifan-yyf Oct 16, 2025
ba80df5
opt: api name adjuest
yaoyifan-yyf Oct 16, 2025
2a823ee
feat: support multi benchmark datasets
yaoyifan-yyf Oct 16, 2025
8e025c8
feat(benchmark): update benchmark task status & benchmark task info list
chenliang15405 Oct 16, 2025
8f0b2c3
fix: fix page list request
chenliang15405 Oct 16, 2025
39ac73d
fix(benchmark): execute benchmark with model param
chenliang15405 Oct 16, 2025
cda5b74
fix(benchmark): process sql query timeout
chenliang15405 Oct 16, 2025
ed59a71
fix(benchmark): fix sql query db timeout for blocking thread
chenliang15405 Oct 17, 2025
b410e85
fix(benchmark): remove useless code
chenliang15405 Oct 17, 2025
d2e92e9
feat: add datasets evaluation page (#2908)
iterminatorheart Oct 17, 2025
19bbc6f
feat: evaluation dataset info pages (#2911)
iterminatorheart Oct 19, 2025
9de988f
fix(benchmark): custom model temperature and max token
chenliang15405 Oct 19, 2025
cfdb1db
feat: multi language for models evaluation (#2912)
iterminatorheart Oct 20, 2025
e69a4e5
fix: table error fix
yaoyifan-yyf Oct 20, 2025
3f0d63b
Merge branch 'feat_dataset_benchmark' of github.com:eosphoros-ai/DB-G…
yaoyifan-yyf Oct 20, 2025
48a3798
fix: table error fix
yaoyifan-yyf Oct 20, 2025
48aa204
feat(benchmark): show task name
chenliang15405 Oct 20, 2025
9ada90b
chore: web build file
chenliang15405 Oct 20, 2025
2caa317
fix(benchmark): fix download result url
chenliang15405 Oct 21, 2025
2cd22a2
chore: update ignore
chenliang15405 Oct 21, 2025
8c1bfed
Merge remote-tracking branch 'origin/feat_dataset_benchmark' into fea…
chenliang15405 Oct 21, 2025
6904d02
fix: answer adjustment
yaoyifan-yyf Oct 21, 2025
3beb281
Merge branch 'feat_dataset_benchmark' of github.com:eosphoros-ai/DB-G…
yaoyifan-yyf Oct 21, 2025
59e39b2
chore: update falcon repo url
chenliang15405 Oct 21, 2025
61bbba3
docs: add dataset benchmark docs
chenliang15405 Oct 21, 2025
81cee77
docs: update image
chenliang15405 Oct 22, 2025
f3263c8
docs: update benchmark doc
chenliang15405 Oct 22, 2025
f0de206
docs: update benchmark doc
chenliang15405 Oct 22, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -152,6 +152,8 @@ logswebserver.log.*
.plugin_env
/pilot/meta_data/alembic/versions/*
/pilot/meta_data/*.db
/pilot/benchmark_meta_data/*.db
/pilot/benchmark_meta_data/result/*
# Ignore for now
thirdparty

Expand Down
309 changes: 309 additions & 0 deletions docs/docs/api/benchmark.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,309 @@
# Datasets Benchmark

Get started with the Benchmark API


### Create Dataset Benchmark Task

```python
POST /api/v2/serve/evaluate/execute_benchmark_task
```

```shell
DBGPT_API_KEY=dbgpt
SPACE_ID={YOUR_SPACE_ID}

curl -X POST "http://localhost:5670/api/v2/serve/evaluate/execute_benchmark_task" \
-H "Authorization: Bearer $DBGPT_API_KEY" \
-H "accept: application/json" \
-H "Content-Type: application/json" \
-d '{
"scene_value": "Falcon_benchmark_01",
"model_list": ["DeepSeek-V3.1", "Qwen3-235B-A22B"]
}'

```

#### The Benchmark Request Object

________
<b>scene_key</b> <font color="gray"> string </font> <font color="red"> Required </font>

The scene type of the evaluation, e.g. support app, recall

--------
<b>scene_value</b> <font color="gray"> string </font> <font color="red"> Required </font>

The scene value of the benchmark, e.g. The marking evaluation task name

--------
<b>model_list</b> <font color="gray"> object </font> <font color="red"> Required </font>

The model name list of the benchmark will execute, e.g. ["DeepSeek-V3.1","Qwen3-235B-A22B"]
Notice: The model name configured on the db-gpt platform needs to be entered.

--------
<b>temperature</b> <font color="gray"> float </font>

The temperature of the llm model, Default is 0.7

--------
<b>max_tokens</b> <font color="gray"> int </font>

The max tokens of the llm model, Default is None

--------


#### The Benchmark Result

________
<b>status</b> <font color="gray">string</font>

The benchmark status,e.g. success, failed, running
________


### Query Benchmark Task List

```python
GET /api/v2/serve/evaluate/benchmark_task_list
```

```shell
DBGPT_API_KEY=dbgpt
SPACE_ID={YOUR_SPACE_ID}

curl -X GET "http://localhost:5670/api/v2/serve/evaluate/benchmark_task_list?page=1&page_size=20" \
-H "Authorization: Bearer $DBGPT_API_KEY" \
-H "accept: application/json" \
-H "Content-Type: application/json"

```

#### The Benchmark Task List Request Object

________
<b>page</b> <font color="gray"> string </font> <font color="red"> Required </font>

Query task list page number, Default is 1

--------
<b>page_size</b> <font color="gray"> string </font> <font color="red"> Required </font>

Query task list page size, Default is 20

--------


#### The Benchmark Task List Result

```json
{
"success": true,
"err_code": null,
"err_msg": null,
"data": {
"items": [
{
"evaluate_code": "1ec15dcbf5d54124bd5a5d23992af35d",
"scene_key": "dataset",
"scene_value": "local_benchmark_task_for_Qwen",
"datasets_name": "Falcon评测集",
"input_file_path": "2025_07_27_public_500_standard_benchmark_question_list.xlsx",
"output_file_path": "/DB-GPT/pilot/benchmark_meta_data/result/1ec15dcbf5d54124bd5a5d23992af35d/202510201650_multi_round_benchmark_result.xlsx",
"model_list": [
"Qwen3-Coder-480B-A35B-Instruct"
],
"context": {
"benchmark_config": "{\"file_parse_type\":\"EXCEL\", \"format_type\":\"TEXT\", \"content_type\":\"SQL\", \"benchmark_mode_type\":\"EXECUTE\", \"scene_key\":\"dataset\", \"temperature\":0.6, \"max_tokens\":6000}"
},
"user_name": null,
"user_id": null,
"sys_code": "benchmark_system",
"parallel_num": 1,
"state": "running",
"temperature": null,
"max_tokens": null,
"log_info": null,
"gmt_create": "2025-10-20 16:50:46",
"gmt_modified": "2025-10-20 16:50:46",
"cost_time": null,
"round_time": 1
}
],
"total_count": 80,
"total_pages": 4,
"page": 1,
"page_size": 20
}
}
```

________
<b>evaluate_code</b> <font color="gray">string</font>

The benchmark task unique code
________
<b>scene_key</b> <font color="gray">string</font>

The benchmark task scene, e.g. dataset
________
<b>scene_value</b> <font color="gray">string</font>

The benchmark task name
________
<b>datasets_name</b> <font color="gray">string</font>

The benchmark execute dataset name
________
<b>input_file_path</b> <font color="gray">string</font>

The benchmark dataset file path
________
<b>output_file_path</b> <font color="gray">string</font>

The benchmark execute result file path
________
<b>model_list</b> <font color="gray">object</font>

The benchmark execute model list
________
<b>context</b> <font color="gray">object</font>

The benchmark task context
________
<b>user_name</b> <font color="gray">string</font>

The benchmark task user name
________
<b>user_id</b> <font color="gray">string</font>

The benchmark task user id
________
<b>sys_code</b> <font color="gray">string</font>

The benchmark task system code, e.g. benchmark_system
________
<b>parallel_num</b> <font color="gray">int</font>

The benchmark task execute parallel num
________
<b>state</b> <font color="gray">string</font>

The benchmark task state, e.g. running, success, failed
________
<b>temperature</b> <font color="gray">float</font>

The benchmark task LLM temperature
________
<b>max_tokens</b> <font color="gray">int</font>

The benchmark task LLM max tokens
________
<b>log_info</b> <font color="gray">int</font>

If benchmark task execute error, It will show error message,
________
<b>gmt_create</b> <font color="gray">string</font>

Task create time
________
<b>gmt_modified</b> <font color="gray">string</font>

Task Finish time
________
<b>cost_time</b> <font color="gray">int</font>

Benchmark Task cost time
________
<b>round_time</b> <font color="gray">int</font>

Benchmark Task execute round time
________


### Benchmark Compare Result

```python
GET /api/v2/serve/evaluate/benchmark/result/{evaluate_code}
```

```shell
DBGPT_API_KEY=dbgpt
SPACE_ID={YOUR_SPACE_ID}

curl -X GET "http://localhost:5670/api/v2/serve/evaluate/benchmark/result/{evaluate_code}" \
-H "Authorization: Bearer $DBGPT_API_KEY" \
-H "accept: application/json" \
-H "Content-Type: application/json"

```

#### The Benchmark Request Object

________
<b>evaluate_code</b> <font color="gray"> string </font> <font color="red"> Required </font>

The benchMark task unique code

--------

#### The Benchmark Result

```json
{
"success": true,
"err_code": null,
"err_msg": null,
"data": {
"evaluate_code": "c827a274b4084f5dbce4c630f5267239",
"scene_value": "Falcon评测集_benchmark",
"summaries": [
{
"roundId": 1,
"llmCode": "Qwen3-Coder-480B-A35B-Instruct",
"right": 136,
"wrong": 269,
"failed": 95,
"exception": 0,
"accuracy": 0.272,
"execRate": 0.81,
"outputPath": "/DB-GPT/pilot/benchmark_meta_data/result/c827a274b4084f5dbce4c630f5267239/202510181449_multi_round_benchmark_result.xlsx"
}
]
}
}
```

________
<b>roundId</b> <font color="gray">string</font>

The benchmark task execute round time
________
<b>llmCode</b> <font color="gray">string</font>

The benchmark task execute model name
________
<b>right</b> <font color="gray">int</font>
The benchmark task execute right question number
________
<b>wrong</b> <font color="gray">int</font>
The benchmark task execute wrong question number
________
<b>failed</b> <font color="gray">int</font>
The benchmark task execute failed question number
________
<b>exception</b> <font color="gray">int</font>
The benchmark task execute exception question number
________
<b>accuracy</b> <font color="gray">float</font>
The benchmark task question list execute accuracy rate
________
<b>execRate</b> <font color="gray">float</font>
The benchmark task question list executable rate
________
<b>outputPath</b> <font color="gray">string</font>
The benchmark task execute result output file path
________

Loading