TechSupportEval is an automated evaluation framework for technical support QA.
[UPD 1] The paper TechSupportEval: An Automated Evaluation Framework for Technical Support Question Answering has been accepted by IJCNN 2025.
TechSupportEval introduces two novel techniques:
-
ClozeFact formulates fact verification as a cloze test, utilizing an LLM to fill in missing key terms, ensuring precise matching of key information.
-
StepRestore shuffles the ground truth steps and uses an LLM to reconstruct the steps in the correct order, verifying both the step sequence and completeness.
As shown in the illustrative example below:
We propose a benchmark dataset built upon the publicly available TechQA dataset, which includes responses generated by various levels of QA systems. TechSupportEval achieves an AUC of 0.91, outperforming the state-of-the-art method by 7.6%.
Detailed results is shown in the table below:
To use this evaluation framework, follow these steps:
-
Download the repository and navigate to the directory
cd TechSupportEval
-
Set up the environment (Conda recommended)
conda create --name=tseval python=3.10 conda activate tseval pip install -r requirements.txt
-
Configure the LLM for evaluation
By default, we use OpenAI's GPT-4o-mini for evaluation.
cp .env.example .env
Then, update
.env
with yourOPENAI_API_KEY
and optionallyOPENAI_API_BASE
. -
Run an example test
python -m tseval.metric examples/1.json
python -m tseval.metric <input_path> [output_path]
<input_path>
: A JSON file containing three required fields: question
, ground_truth
, and answer
.
The evaluation results will be displayed in the console.
If <output_path>
is specified, a JSON report will be saved to that location.
The evaluation results for examples/1.json and examples/2.json are available in:
We constructed a benchmark dataset based on TechQA, the most comprehensive publicly available technical support QA dataset. We generated responses using multiple QA systems with varying capability levels and obtained human expert annotations for their true accuracy.
We implemented three RAG-based QA systems using LangChain, each leveraging a different foundation model:
- GPT-4o Mini
- LLaMA 3 (70B)
- LLaMA 3 (8B)
Each QA system was used to generate responses for all 282 filtered questions. We then conducted a human evaluation with 5 domain experts, who annotated each response for accuracy.
The implementations of these 3 models can be found in the rag directory. The generated answers and human evaluation data correspond to data/final/techqa{1,2,3}.json
, which collectively form the dataset used in this work.
We provide all experiment configurations, raw results, and evaluation reports:
- Experiment configurations: config/experiments/
- Raw experiment results: output/
- Evaluation reports: output/reports/
The experiment files related to the 3 research questions are:
- RQ1 (Effectiveness)
- RQ2 (Impact of LLM_eval)
- RQ3 (Efficiency and Cost)
You can find the raw experiment results in the output directory under their respective subdirectories.
To reproduce experiments, install the development dependencies:
pip install -r requirements.dev.txt
python -m scripts.auto_run --schema <id> --metric_only
where <id>
corresponds to the experiment configuration file config/experiments/{id}.txt
.
python -m scripts.gen_table1
First, remove existing output files:
rm -rf outputs/*
Then, execute:
python -m scripts.auto_run --schema <id>
This will run all experiments specified in config/experiments/{id}.txt
and automatically configure environments and dependencies. Various environment configurations can be found in config/envs/. Virtual environments will be created in the venv
directory.
Note: Due to the stochastic nature of LLMs, results may vary slightly from those in the paper. If failures occur, the script will retry up to 3 times. If issues persist, rerun the command to complete unfinished or failed test cases.
To run RQ2 experiments, ensure your OPENAI_API_BASE
supports multiple models. We recommend using one-api.