- Run the model through vLLM with an OpenAI compatible API.
- For Liquid models, run the on-prem stack, or use Liquid
labs. - For other models, use the
run-vllm.shscript, or use 3rd party providers.
- Run the evaluation script with the model API endpoint and API key.
- The evaluation can be run with Docker (recommended) or locally without Docker.
- Generate model answers:
bin/api/run_docker_eval.sh generate \
--model-name <model-name> \
--model-url <model-url> \
--model-api-key <model-api-key>Results will be output in llm_judge/data/japanese_mt_bench/model_answer/<model-name>.jsonl
- Run judge:
The judge script will use the judge model to compare GPT-4 results with the model results. The judge model defaults to GPT-4.
bin/api/run_docker_eval.sh judge \
--model-name <model-name> \
--judge-model-name <judge-model-name> \
--judge-model-url <judge-model-url> \
--judge-model-api-key <judge-model-api-key>Judge results will be output to llm_judge/data/japanese_mt_bench/model_judgment/<judge-model-name>_<model-name>.jsonl.
The final scores will be output in llm_judge/data/japanese_mt_bench/<judge-model-name>-score-<model-name>.json.
Run evaluation for lfm-3b-jp on-prem:
bin/api/run_docker_eval.sh generate \
--model-name lfm-3b-jp \
--model-url http://localhost:8000/v1 \
--model-api-key <ON-PREM-API-SECRET>
bin/api/run_docker_eval.sh judge \
--model-name lfm-3b-jp \
--judge-model-name gpt-4o \
--judge-model-url https://api.openai.com/v1 \
--judge-model-api-key <OPENAI-API-KEY>Run eval for lfm-3b-ichikara on-prem:
bin/api/run_docker_eval.sh generate \
--model-name lfm-3b-ichikara \
--model-url http://localhost:8000/v1 \
--model-api-key <ON-PREM-API-SECRET>
bin/api/run_docker_eval.sh judge \
--model-name lfm-3b-ichikara \
--openai-api-key <OPENAI-API-KEY>Run eval for lfm-3b-jp on labs:
bin/api/run_docker_eval.sh generate \
--model-name lfm-3b-jp \
--model-url https://inference-1.liquid.ai/v1 \
--model-api-key <API-KEY>
bin/api/run_docker_eval.sh judge \
--model-name lfm-3b-jp \
--judge-model-name gpt-4o \
--judge-model-url https://api.openai.com/v1 \
--judge-model-api-key <OPENAI-API-KEY>(click to see details)
It is recommended to create a brand new conda environment first. But this step is optional.
conda create -n mt_bench python=3.10
conda activate mt_benchRun the following command to set up the environment and install the dependencies:
bin/api/prepare.sh- Run
bin/api/run_api_eval.shscript to generate model answers.
bin/api/run_api_eval.sh \
--model-name <model-name> \
--model-url <model-url> \
--model-api-key <API-KEY>Results will be output in llm_judge/data/japanese_mt_bench/model_answer/<model-name>.jsonl.
- Run the following scripts to generate GPT-4 judgement scores for the model answers.
bin/api/run_openai_judge.sh \
--model-name <model-name> \
--judge-model-name <judge-model-name> \
--judge-model-url <judge-model-url> \
--judge-model-api-key <judge-model-api-key>
# examples:
bin/api/run_openai_judge.sh \
--model-name lfm-3b-jp \
--judge-model-name gpt-4o \
--judge-model-url https://api.openai.com/v1 \
--judge-model-api-key <OPENAI-API-KEY>
bin/api/run_openai_judge.sh \
--model-name lfm-3b-ichikara \
--judge-model-name gpt-4o \
--judge-model-url https://api.openai.com/v1 \
--judge-model-api-key <OPENAI-API-KEY>Judge results will be output to llm_judge/data/japanese_mt_bench/model_judgment/<judge-model-name>_<model-name>.jsonl.
The final scores will be output in llm_judge/data/japanese_mt_bench/<judge-model-name>-score-<model-name>.json.
(click to see details)
This applies to both bin/api/run_docker_eval.sh generate and bin/api/run_api_eval.sh.
| Argument | Description | Value for on-prem stack | Required |
|---|---|---|---|
--model-name |
Model name | lfm-3b-jp, lfm-3b-ichikara |
Yes |
--model-url |
Model URL | http://localhost:8000/v1 |
Yes |
--model-api-key |
API key for the model | API_SECRET in .env |
Yes |
--num-choices |
Number of responses to generate for each question | 5 |
No. Default to 5. |
--question-count |
Number of questions to run | None | No. Default to None, which runs all questions. |
This applies to both bin/api/run_docker_eval.sh judge and bin/api/run_openai_judge.sh.
| Argument | Description | Required |
|---|---|---|
--model-name |
Model name to be evaluated | Yes |
--judge-model-name |
Name of the judge model (default: gpt-4) | No |
--judge-model-url |
Base URL for the judge model API | Yes |
--judge-model-api-key |
API key for the judge model | Yes |
--parallel |
Number of parallel API calls | No. Default to 5. |
This repository is modified from FastChat/fastchat.