BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs

Guilong Lu^1,* , Xuntao Guo^1,2,*, Rongjunchen Zhang^1,♠, Wenqiao Zhu¹, Ji Liu^1,♠

¹Hithink Research, ²Harbin Institute of Technology
^*Co-first authors, ^♠Corresponding author, {zhangrongjunchen, liuji}@myhexin.com

📖Paper |🏠Homepage|🤗Huggingface

Large language models excel across general tasks, yet judging their reliability in logic‑heavy, precision‑critical domains such as finance, law and healthcare is still difficult. To address this challenge, we propose BizFinBench, the first benchmark grounded in real-world financial applications. BizFinBench comprises over 100,000+ bilingual (English & Chinese) financial questions, each rooted in real-world business scenarios. The first public release, BizFinBench.v1, delivers 6,781 well annotated Chinese queries, covering five dimensions: numerical calculation, reasoning, information extraction, prediction recognition and knowledge‐based question answering, which are mapped to nine fine-grained categories.

📢 News

🚀 [04/07/2025] External API support is now live—evaluate BizFinBench with your own endpoints in just a few calls.
🚀 [16/05/2025] We released BizFinBench.v1 benchmark, the first benchmark grounded in real-world financial applications.

💡 Highlights

🔥 Benchmark: We propose BizFinBench, the first evaluation benchmark in the financial domain that integrates business-oriented tasks, covering 5 dimensions and 9 categories. It is designed to assess the capacity of LLMs in real-world financial scenarios.
🔥 Judge model: We design a novel evaluation method, i.e., Iterajudge, which enhances the capability of LLMs as a judge by refining their decision boundaries in specific financial evaluation tasks.
🔥 key insights: We conduct a comprehensive evaluation with 25 LLMs based on BizFinBench, uncovering key insights into their strengths and limitations in financial applications.

📕 Data Distrubution

This dataset contains multiple subtasks, each focusing on a different financial understanding and reasoning ability, as follows:

Dataset	Description	Evaluation Dimensions	Volume
Anomalous Event Attribution	A financial anomaly attribution evaluation dataset assessing models' ability to trace stock fluctuations based on given information (e.g., timestamps, news articles, financial reports, and stock movements).	Causal consistency, information relevance, noise resistance	1,064
Financial Numerical Computation	A financial numerical computation dataset evaluating models' ability to perform accurate numerical calculations in financial scenarios, including interest rate calculations, gain/loss computations, etc.	Calculation accuracy, unit consistency	581
Financial Time Reasoning	A financial temporal reasoning evaluation dataset assessing models' ability to comprehend and reason about time-based financial events, such as "the previous trading day" or "the first trading day of the year."	Temporal reasoning correctness	514
Financial Data Description	A financial data description evaluation dataset measuring models' ability to analyze and describe structured/unstructured financial data, e.g., "the stock price first rose to XX before falling to XX."	Trend accuracy, data consistency	1,461
Stock Price Prediction	A stock price movement prediction dataset evaluating models' ability to forecast future stock price trends based on historical data, financial indicators, and market news.	Trend judgment, causal rationality	497
Financial Named Entity Recognition	A financial named entity recognition dataset assessing models' ability to identify entities (Person, Organization, Market, Location, Financial Products, Date/Time) in short/long financial news.	Recognition accuracy, entity category correctness	433
Emotion_Recognition	A financial sentiment recognition dataset evaluating models' ability to discern nuanced user emotions in complex financial market environments. Inputs include multi-dimensional data such as market conditions, news, research reports, user holdings, and queries, covering six emotion categories: optimism, anxiety, pessimism, excitement, calmness, and regret.	Emotion classification accuracy, implicit information extraction and reasoning correctness	600
Financial Tool Usage	A financial tool usage dataset evaluating models' ability to understand user queries and appropriately utilize various financial tools (investment analysis, market research, information retrieval, etc.) to solve real-world problems. Tools include calculators, financial encyclopedia queries, search engines, data queries, news queries, economic calendars, and company lookups. Models must accurately interpret user intent, select appropriate tools, input correct parameters, and coordinate multiple tools when necessary.	Tool selection rationality, parameter input accuracy, multi-tool coordination capability	641
Financial Knowledge QA	A financial encyclopedia QA dataset assessing models' understanding and response accuracy regarding core financial knowledge, covering key domains: financial fundamentals, markets, investment theories, macroeconomics, etc.	Query comprehension accuracy, knowledge coverage breadth, answer accuracy and professionalism	990

📚 Example

🛠️ Usage

llm-eval
├── README.md
├── benchmark_code
├── config # All custom sample configs can be found in this folder
├── envs  #env settings
├── inference # All inference-engine-related code is in this folder
├── post_eval.py # Evaluation launcher after inference is finished
├── reqirements.txt
├── run.py # Entry point for the entire evaluation workflow
├── run.sh # Sample execution script for launching an evaluation; maintain your own run.sh as needed
├── scripts # Reference run.sh scripts
├── tools # tools
├── statistic.py # Aggregates final evaluation statistics
└── utils

Install requirements

pip install -r requirements.txt

Quick Start – Evaluate a Local Model

export MODEL_PATH=model/Qwen2.5-0.5B   # Path to the model to be evaluated
export REMOTE_MODEL_PORT=16668
export REMOTE_MODEL_URL=http://127.0.0.1:${REMOTE_MODEL_PORT}/model
export MODEL_NAME=Qwen2.5-0.5B
export PROMPT_TYPE=chat_template   # Hithink llama3 llama2 none qwen chat_template; chat_template is recommended

# First start the model as a service
python inference/predict_multi_gpu.py \
    --model ${MODEL_PATH} \
    --server_port ${REMOTE_MODEL_PORT} \
    --prompt ${PROMPT_TYPE} \
    --preprocess preprocess \
    --run_forever \
    --max_new_tokens 4096 \
    --tensor_parallel ${TENSOR_PARALLEL} & 

# Pass in the config file path to start evaluation
python run.py --config config/offical/eval_fin_eval_diamond.yaml --model_name ${MODEL_NAME}

Quick Start – Evaluate a Local Model and Score with a Judge Model

export MODEL_PATH=model/Qwen2.5-0.5B   # Path to the model to be evaluated
export REMOTE_MODEL_PORT=16668
export REMOTE_MODEL_URL=http://127.0.0.1:${REMOTE_MODEL_PORT}/model
export MODEL_NAME=Qwen2.5-0.5B
export PROMPT_TYPE=chat_template   # llama3 llama2 none qwen chat_template; chat_template is recommended

# First start the model as a service
python inference/predict_multi_gpu.py \
    --model ${MODEL_PATH} \
    --server_port ${REMOTE_MODEL_PORT} \
    --prompt ${PROMPT_TYPE} \
    --preprocess preprocess \
    --run_forever \
    --max_new_tokens 4096 \
    --tensor_parallel ${TENSOR_PARALLEL} \
    --low_vram & 

# Start the judge model
export JUDGE_MODEL_PATH=/mnt/data/llm/models/base/Qwen2.5-7B
export JUDGE_TENSOR_PARALLEL=1
export JUDGE_MODEL_PORT=16667
python inference/predict_multi_gpu.py \
    --model ${JUDGE_MODEL_PATH} \
    --server_port ${JUDGE_MODEL_PORT} \
    --prompt chat_template \
    --preprocess preprocess \
    --run_forever \
    --manual_start \
    --max_new_tokens 4096 \
    --tensor_parallel ${JUDGE_TENSOR_PARALLEL} \
    --low_vram &

# Pass in the config file path to start evaluation
python run.py --config "config/offical/eval_fin_eval.yaml" --model_name ${MODEL_NAME}

Note: Add the --manual_start argument when launching the judge model, because the judge must wait until the main model finishes inference before starting (this is handled automatically by the maybe_start_judge_model function in run.py).

Quick Start – Evaluate external apis (e.g., chatgpt)

export API_NAME=chatgpt # The api name, currently support chatgpt
export API_KEY=xxx # Your api key
export MODEL_NAME=gpt-4.1

# Pass in the config file path to start evaluation
python run.py --config config/offical/eval_fin_eval_diamond.yaml --model_name ${MODEL_NAME}

Note: You can adjust the API’s queries-per-second limit by modifying the semaphore_limit setting in envs/constants.py. e.g., GPTClient(api_name=api_name,api_key=api_key,model_name=model_name,base_url='https://api.openai.com/v1/chat/completions', timeout=600, semaphore_limit=5)

✒️Results

The models are evaluated across multiple tasks, with results color-coded to represent the top three performers for each task:

🥇 indicates the top-performing model.
🥈 represents the second-best result.
🥉 denotes the third-best performance.

Model	AEA	FNC	FTR	FTU	FQA	FDD	ER	SP	FNER	Average
Proprietary LLMs
ChatGPT-o3	🥈 86.23	61.30	🥈 75.36	🥇 89.15	🥈 91.25	🥉 98.55	🥉 44.48	53.27	65.13	🥇 73.86
ChatGPT-o4-mini	🥉 85.62	60.10	71.23	74.40	90.27	95.73	🥇 47.67	52.32	64.24	71.29
GPT-4o	79.42	56.51	🥇 76.20	82.37	87.79	🥇 98.84	🥈 45.33	54.33	65.37	🥉 71.80
Gemini-2.0-Flash	🥇 86.94	🥉 62.67	73.97	82.55	90.29	🥈 98.62	22.17	🥉 56.14	54.43	69.75
Claude-3.5-Sonnet	84.68	🥈 63.18	42.81	🥈 88.05	87.35	96.85	16.67	47.60	63.09	65.59
Open Source LLMs
Qwen2.5-7B-Instruct	73.87	32.88	39.38	79.03	83.34	78.93	37.50	51.91	30.31	56.35
Qwen2.5-72B-Instruct	69.27	54.28	70.72	85.29	87.79	97.43	35.33	55.13	54.02	67.70
Qwen2.5-VL-3B	53.85	15.92	17.29	8.95	81.60	59.44	39.50	52.49	21.57	38.96
Qwen2.5-VL-7B	73.87	32.71	40.24	77.85	83.94	77.41	38.83	51.91	33.40	56.68
Qwen2.5-VL-14B	37.12	41.44	53.08	82.07	84.23	7.97	37.33	54.93	47.47	49.52
Qwen2.5-VL-32B	76.79	50.00	62.16	83.57	85.30	95.95	40.50	54.93	🥉 68.36	68.62
Qwen2.5-VL-72B	69.55	54.11	69.86	85.18	87.37	97.34	35.00	54.94	54.41	67.53
Qwen3-1.7B	77.40	35.80	33.40	75.82	73.81	78.62	22.40	48.53	11.23	50.78
Qwen3-4B	83.60	47.40	50.00	78.19	82.24	80.16	42.20	50.51	25.19	59.94
Qwen3-14B	84.20	58.20	65.80	82.19	84.12	92.91	33.00	52.31	50.70	67.05
Qwen3-32B	83.80	59.60	64.60	85.12	85.43	95.37	39.00	52.26	49.19	68.26
Xuanyuan3-70B	12.14	19.69	15.41	80.89	86.51	83.90	29.83	52.62	37.33	46.48
Llama-3.1-8B-Instruct	73.12	22.09	2.91	77.42	76.18	69.09	29.00	54.21	36.56	48.95
Llama-3.1-70B-Instruct	16.26	34.25	56.34	80.64	79.97	86.90	33.33	🥇 62.16	45.95	55.09
Llama 4 Scout	73.60	45.80	44.20	85.02	85.21	92.32	25.60	55.76	43.00	61.17
DeepSeek-V3 (671B)	74.34	61.82	72.60	🥈 86.54	🥉 91.07	98.11	32.67	55.73	🥈 71.24	71.57
DeepSeek-R1 (671B)	80.36	🥇 64.04	🥉 75.00	81.96	🥇 91.44	98.41	39.67	55.13	🥇 71.46	🥈 73.05
QwQ-32B	84.02	52.91	64.90	84.81	89.60	94.20	34.50	🥈 56.68	30.27	65.77
DeepSeek-R1-Distill-Qwen-14B	71.33	44.35	16.95	81.96	85.52	92.81	39.50	50.20	52.76	59.49
DeepSeek-R1-Distill-Qwen-32B	73.68	51.20	50.86	83.27	87.54	97.81	41.50	53.92	56.80	66.29

✒️Citation

@article{lu2025bizfinbench,
  title={BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs},
  author={Lu, Guilong and Guo, Xuntao and Zhang, Rongjunchen and Zhu, Wenqiao and Liu, Ji},
  journal={arXiv preprint arXiv:2505.19457},
  year={2025}
}

📄 License

Usage and License Notices: The data and code are intended and licensed for research use only. License: Attribution-NonCommercial 4.0 International It should abide by the policy of OpenAI: https://openai.com/policies/terms-of-use

💖 Acknowledgement

We would like to thank Weijie Zhang for his contribution to the development of the inference engine.
This work leverages vLLM as the backend model server for evaluation purposes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs

📢 News

💡 Highlights

📕 Data Distrubution

📚 Example

🛠️ Usage

Contents

Install requirements

Quick Start – Evaluate a Local Model

Quick Start – Evaluate a Local Model and Score with a Judge Model

Quick Start – Evaluate external apis (e.g., chatgpt)

✒️Results

✒️Citation

📄 License

💖 Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
benchmark_code/BizFinBench		benchmark_code/BizFinBench
config/offical		config/offical
datasets		datasets
envs		envs
inference		inference
scripts		scripts
static		static
tools		tools
utils		utils
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
post_eval.py		post_eval.py
requirements.txt		requirements.txt
run.py		run.py
statistic.py		statistic.py

HiThink-Research/BizFinBench

Folders and files

Latest commit

History

Repository files navigation

BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs

📢 News

💡 Highlights

📕 Data Distrubution

📚 Example

🛠️ Usage

Contents

Install requirements

Quick Start – Evaluate a Local Model

Quick Start – Evaluate a Local Model and Score with a Judge Model

Quick Start – Evaluate external apis (e.g., chatgpt)

✒️Results

✒️Citation

📄 License

💖 Acknowledgement

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages