Skip to content

NetManAIOps/TechSupportEval

Repository files navigation

TechSupportEval

TechSupportEval is an automated evaluation framework for technical support QA.

[UPD 1] The paper TechSupportEval: An Automated Evaluation Framework for Technical Support Question Answering has been accepted by IJCNN 2025.

Introduction

TechSupportEval introduces two novel techniques:

  • ClozeFact formulates fact verification as a cloze test, utilizing an LLM to fill in missing key terms, ensuring precise matching of key information.

  • StepRestore shuffles the ground truth steps and uses an LLM to reconstruct the steps in the correct order, verifying both the step sequence and completeness.

As shown in the illustrative example below:

Pipeline

We propose a benchmark dataset built upon the publicly available TechQA dataset, which includes responses generated by various levels of QA systems. TechSupportEval achieves an AUC of 0.91, outperforming the state-of-the-art method by 7.6%.

Detailed results is shown in the table below:

Evaluation result

Usage

To use this evaluation framework, follow these steps:

  1. Download the repository and navigate to the directory

    cd TechSupportEval
  2. Set up the environment (Conda recommended)

    conda create --name=tseval python=3.10  
    conda activate tseval  
    pip install -r requirements.txt  
  3. Configure the LLM for evaluation

    By default, we use OpenAI's GPT-4o-mini for evaluation.

    cp .env.example .env  

    Then, update .env with your OPENAI_API_KEY and optionally OPENAI_API_BASE.

  4. Run an example test

    python -m tseval.metric examples/1.json  

Command format

python -m tseval.metric <input_path> [output_path]

<input_path>: A JSON file containing three required fields: question, ground_truth, and answer.

The evaluation results will be displayed in the console.

If <output_path> is specified, a JSON report will be saved to that location.

Example outputs

The evaluation results for examples/1.json and examples/2.json are available in:

Datasets

We constructed a benchmark dataset based on TechQA, the most comprehensive publicly available technical support QA dataset. We generated responses using multiple QA systems with varying capability levels and obtained human expert annotations for their true accuracy.

We implemented three RAG-based QA systems using LangChain, each leveraging a different foundation model:

  • GPT-4o Mini
  • LLaMA 3 (70B)
  • LLaMA 3 (8B)

Each QA system was used to generate responses for all 282 filtered questions. We then conducted a human evaluation with 5 domain experts, who annotated each response for accuracy.

The implementations of these 3 models can be found in the rag directory. The generated answers and human evaluation data correspond to data/final/techqa{1,2,3}.json, which collectively form the dataset used in this work.

Experiments

We provide all experiment configurations, raw results, and evaluation reports:

The experiment files related to the 3 research questions are:

  • RQ1 (Effectiveness)
  • RQ2 (Impact of LLM_eval)
    • Experiment configurations: 4.txt
    • Evaluation reports: 4.csv
  • RQ3 (Efficiency and Cost)
    • Experiment configurations: 5.txt
    • Evaluation reports: 5.csv

You can find the raw experiment results in the output directory under their respective subdirectories.

Reproducing Experiments

To reproduce experiments, install the development dependencies:

pip install -r requirements.dev.txt

Generate reports from existing results

python -m scripts.auto_run --schema <id> --metric_only

where <id> corresponds to the experiment configuration file config/experiments/{id}.txt.

Generating LaTeX table for RQ1

python -m scripts.gen_table1

Running experiments from scratch

First, remove existing output files:

rm -rf outputs/*

Then, execute:

python -m scripts.auto_run --schema <id>

This will run all experiments specified in config/experiments/{id}.txt and automatically configure environments and dependencies. Various environment configurations can be found in config/envs/. Virtual environments will be created in the venv directory.

Note: Due to the stochastic nature of LLMs, results may vary slightly from those in the paper. If failures occur, the script will retry up to 3 times. If issues persist, rerun the command to complete unfinished or failed test cases.

Running RQ2 (Impact of LLM_eval) experiments

To run RQ2 experiments, ensure your OPENAI_API_BASE supports multiple models. We recommend using one-api.

About

An automated evaluation framework for technical support QA.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published