Run Evaluation through vLLM API

Overview

Run the model through vLLM with an OpenAI compatible API.

For Liquid models, run the on-prem stack, or use Liquid labs.
For other models, use the run-vllm.sh script, or use 3rd party providers.

Run the evaluation script with the model API endpoint and API key.

The evaluation can be run with Docker (recommended) or locally without Docker.

Run Evaluation with Docker

Generate model answers:

bin/api/run_docker_eval.sh generate \
  --model-name <model-name> \
  --model-url <model-url> \
  --model-api-key <model-api-key>

Results will be output in llm_judge/data/japanese_mt_bench/model_answer/<model-name>.jsonl

Run judge:

The judge script will use the judge model to compare GPT-4 results with the model results. The judge model defaults to GPT-4.

bin/api/run_docker_eval.sh judge \
  --model-name <model-name> \
  --judge-model-name <judge-model-name> \
  --judge-model-url <judge-model-url> \
  --judge-model-api-key <judge-model-api-key>

Judge results will be output to llm_judge/data/japanese_mt_bench/model_judgment/<judge-model-name>_<model-name>.jsonl.

The final scores will be output in llm_judge/data/japanese_mt_bench/<judge-model-name>-score-<model-name>.json.

Examples

Run evaluation for lfm-3b-jp on-prem:

bin/api/run_docker_eval.sh generate \
  --model-name lfm-3b-jp \
  --model-url http://localhost:8000/v1 \
  --model-api-key <ON-PREM-API-SECRET>

bin/api/run_docker_eval.sh judge \
  --model-name lfm-3b-jp \
  --judge-model-name gpt-4o \
  --judge-model-url https://api.openai.com/v1 \
  --judge-model-api-key <OPENAI-API-KEY>

Run eval for lfm-3b-ichikara on-prem:

bin/api/run_docker_eval.sh generate \
  --model-name lfm-3b-ichikara \
  --model-url http://localhost:8000/v1 \
  --model-api-key <ON-PREM-API-SECRET>

bin/api/run_docker_eval.sh judge \
  --model-name lfm-3b-ichikara \
  --openai-api-key <OPENAI-API-KEY>

Run eval for lfm-3b-jp on labs:

bin/api/run_docker_eval.sh generate \
  --model-name lfm-3b-jp \
  --model-url https://inference-1.liquid.ai/v1 \
  --model-api-key <API-KEY>

bin/api/run_docker_eval.sh judge \
  --model-name lfm-3b-jp \
  --judge-model-name gpt-4o \
  --judge-model-url https://api.openai.com/v1 \
  --judge-model-api-key <OPENAI-API-KEY>

Run Evaluation without Docker

(click to see details)

Install

It is recommended to create a brand new conda environment first. But this step is optional.

conda create -n mt_bench python=3.10
conda activate mt_bench

Run the following command to set up the environment and install the dependencies:

bin/api/prepare.sh

Run Evaluation

Run bin/api/run_api_eval.sh script to generate model answers.

bin/api/run_api_eval.sh \
  --model-name <model-name> \
  --model-url <model-url> \
  --model-api-key <API-KEY>

Results will be output in llm_judge/data/japanese_mt_bench/model_answer/<model-name>.jsonl.

Run the following scripts to generate GPT-4 judgement scores for the model answers.

bin/api/run_openai_judge.sh \
  --model-name <model-name> \
  --judge-model-name <judge-model-name> \
  --judge-model-url <judge-model-url> \
  --judge-model-api-key <judge-model-api-key>

# examples:
bin/api/run_openai_judge.sh \
  --model-name lfm-3b-jp \
  --judge-model-name gpt-4o \
  --judge-model-url https://api.openai.com/v1 \
  --judge-model-api-key <OPENAI-API-KEY>

bin/api/run_openai_judge.sh \
  --model-name lfm-3b-ichikara \
  --judge-model-name gpt-4o \
  --judge-model-url https://api.openai.com/v1 \
  --judge-model-api-key <OPENAI-API-KEY>

Judge results will be output to llm_judge/data/japanese_mt_bench/model_judgment/<judge-model-name>_<model-name>.jsonl.

The final scores will be output in llm_judge/data/japanese_mt_bench/<judge-model-name>-score-<model-name>.json.

Script Parameters

(click to see details)

Generate Script Params

This applies to both bin/api/run_docker_eval.sh generate and bin/api/run_api_eval.sh.

Argument	Description	Value for on-prem stack	Required
`--model-name`	Model name	`lfm-3b-jp`, `lfm-3b-ichikara`	Yes
`--model-url`	Model URL	`http://localhost:8000/v1`	Yes
`--model-api-key`	API key for the model	`API_SECRET` in `.env`	Yes
`--num-choices`	Number of responses to generate for each question	`5`	No. Default to 5.
`--question-count`	Number of questions to run	None	No. Default to None, which runs all questions.

Judge Script Params

This applies to both bin/api/run_docker_eval.sh judge and bin/api/run_openai_judge.sh.

Argument	Description	Required
`--model-name`	Model name to be evaluated	Yes
`--judge-model-name`	Name of the judge model (default: gpt-4)	No
`--judge-model-url`	Base URL for the judge model API	Yes
`--judge-model-api-key`	API key for the judge model	Yes
`--parallel`	Number of parallel API calls	No. Default to 5.

Acknowledgement

This repository is modified from FastChat/fastchat.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.github/workflows		.github/workflows
bin/api		bin/api
data		data
llm_judge		llm_judge
model		model
modules		modules
protocol		protocol
serve		serve
train		train
.gitignore		.gitignore
README.md		README.md
README_original.md		README_original.md
__init__.py		__init__.py
constants.py		constants.py
conversation.py		conversation.py
pyproject.toml		pyproject.toml
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Run Evaluation through vLLM API

Overview

Run Evaluation with Docker

Examples

Run Evaluation without Docker

Install

Run Evaluation

Script Parameters

Generate Script Params

Judge Script Params

Acknowledgement

About

Uh oh!

Uh oh!

Contributors 3

Uh oh!

Languages

Liquid4All/mt_bench

Folders and files

Latest commit

History

Repository files navigation

Run Evaluation through vLLM API

Overview

Run Evaluation with Docker

Examples

Run Evaluation without Docker

Install

Run Evaluation

Script Parameters

Generate Script Params

Judge Script Params

Acknowledgement

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 3

Uh oh!

Languages