FlexEval is a tool for designing custom metrics, completion functions, and LLM-graded rubrics for evaluating the behavior of LLM-powered systems.
Additional details about FlexEval can be found in our paper at the Educational Data Mining 2024 conference.
Basic usage:
import flexeval
from flexeval.schema import Eval, EvalRun, FileDataSource, Metrics, FunctionItem, Config
data_sources = [FileDataSource(path="vignettes/conversations.jsonl")]
eval = Eval(metrics=Metrics(function=[FunctionItem(name="flesch_reading_ease")]))
config = Config(clear_tables=True)
eval_run = EvalRun(
data_sources=data_sources,
database_path="eval_results.db",
eval=eval,
config=config,
)
flexeval.run(eval_run)
This example computes Flesch reading ease for every turn in a list of conversations provided in JSONL format. The metric values are stored in an SQLite database called eval_results.db
.
See additional usage examples in the vignettes.
You can install FlexEval from the GitHub repository. FlexEval is not yet available on PyPI.
Using pip
:
pip install git+https://github.com/DigitalHarborFoundation/FlexEval.git
Using uv
:
uv add git+https://github.com/DigitalHarborFoundation/FlexEval.git
Using poetry
:
poetry add git+https://github.com/DigitalHarborFoundation/FlexEval.git
To make evaluations easier for LLM-powered systems.
Thoroughly evaluating LLMs is difficult and best-practices are rapidly developing. While it is not yet clear how to guarantee the behavior of LLM-powered systems, we can absolutely increase visibility into their behavior. Which features are important depends on the application, but might include:
- safety
- verbosity
- use of emojis
- text complexity and reading ease
- appropriateness of function calling
- other things we haven't thought of
The most common method of evaluating LLMs is to prompt them and compare their responses to "ideal" responses. This is necessary for many applications, but is not sufficient to cover the cases above. Moreover, we're confident that as users continue to develop LLM-powered applications, they will want to compute metrics of their own devising to quantify and track the behavior of these applications during development and in production.
With this in mind, we've created a tool that makes it easier to write and apply custom metrics to conversations.
FlexEval is a tool for applying functions that produce quantitative metrics to conversational data.
Inputs:
- historical conversations
- Python functions that convert conversations and conversational turns into numbers
- rubrics that an LLM can use to convert conversations and conversational turns into numbers
- configurations for LLMs you would like to test
Process:
- (optional) generate conversational completions using an LLM or LLM-powered system
- apply each Python function to each conversation/turn/completion
Outputs:
- metric values in an SQLite database
FlexEval evaluates Python functions and machine-graded rubrics for each provided conversation.
FlexEval began as an extension to OpenAI Evals, making it easier to use. It is now independent of OpenAI Evals and offers several usability improvements:
- Whereas OpenAI Evals requires users to write a new class with inheritance to define new completion functions (a generic term to a function that accepts a conversation or prompt and produces a response), FlexEval allows users to define this using a function in
configuration/completion_functions.py
. - Whereas OpenAI Evals requires users to create a new class with inheritance to define a new metric type, FlexEval allows users to do this by writing a function in
configuration/function_metrics.py
. - FlexEval makes it easy to use any LLM as a rubric-based grader.
- FlexEval makes it easy to write eval suites, that is, sets of multiple metrics to be evaluated against the same dataset of conversations.
- FlexEval allows metrics to be computed over entire conversations (i.e. how many turns are in this conversation), conversations faceted by role (how many turns per role are in this conversation), or individual turns faceted by role (what is the length of each string), and then aggregated (what is the average length of text output produced by the user vs the assistant).
Prior to running an evaluation, you'll need to tell FlexEval which metrics you want to compute and what conversations you want to use.
-
Write your data as a file in
jsonl
format. (In the future, we will support other formats and streaming inputs.) Each separate thread – one conversation between a user and an assistant – should be one line of the file. The format of each line is JSON, with aninput
key, and a corresponding value that consists of a list of turns like the following:{"input": [{"role": "user", "content": "Hi, Nice to meet you!"}, {"role": "assistant", "content": "Nice to meet you, too! How can I help you today?"}]}
-
Add any Python modules containing function metrics to your configuration. Existing function metrics can be viewed in
flexeval.configuration.function_metrics
. -
If desired, create any rubric metrics in a
rubric_metrics.yaml
file. Rubrics in this file will be used to evaluate conversations and completions using "chain-of-thoughts then classify" (COT classify) and will report a numeric score (e.g., 0 or 1) mapped to a choice string (e.g.,"Yes", "No") from the classification results. For more information on how to write and use rubrics in FlexEval, checkdoc/rubric_metric_guidelines.md
in this repo. (We will also add a vignette demonstrating a custom rubric.) -
Run the evaluation in Python code or via the CLI.
See the vignettes.
We will create a vignette about the CLI soon. For now, access the documentation directly.
python -m flexeval --help
Results are saved in an SQLite database. We will create a vignette demonstrating accessing and interpreting result metrics in July 2025.
This tool is intended to be "batteries included" for many basic use cases. It supports the following out-of-the-box:
- scoring historical conversations - useful for monitoring live systems.
- scoring LLMs:
- locally hosted and served via an endpoint using something like LM Studio
- LLMs accessible by a REST endpoint and accessible via a network call
- any OpenAI LLM
- a set of useful rubrics
- a set of useful Python functions
If this work is useful to you, please cite our EDM 2024 paper:
S. Thomas Christie, Baptiste Moreau-Pernet, Yu Tian, & John Whitmer. (2024). FlexEval: a customizable tool for chatbot performance evaluation and dialogue analysis. Proceedings of the 17th International Conference on Educational Data Mining, 903-908. Atlanta, Georgia, USA, July 2024. https://doi.org/10.5281/zenodo.12729993
FlexEval is a tool for executing evaluations.
An evaluation is represented by flexeval.schema.eval_schema.Eval
, and contains a set of flexeval.schema.eval_schema.MetricItem
s to apply to the test data.
- Functions:
flexeval.schema.eval_schema.FunctionItem
s apply a Python function to the test data, returning a numeric value. - Rubrics:
flexeval.schema.eval_schema.RubricItem
s use a configuredGraderLlm
function and the provided rubric template to generate a numeric score from an LLM's output.
You execute an Eval
by creating an EvalRun
. flexeval.schema.evalrun_schema.EvalRun
contains:
- Data sources (conversations as inputs, an SQLite filepath as output)
- An
Eval
specification, containing the metrics to compute - Sources for the metrics defined in the
Eval
e.g. Python modules containing the functions referenced inFunctionItem
s or YAML files containing the rubric templates. - A
Config
specification, describing how evaluation should be executed.
The Config
includes details about multi-threaded metric computation, about logging, etc.
Metrics can operates at any of four levels of granularity:
- Thread: Full conversation
- Turn: Adjacent set of messages from the same user or assistant
- Message: Individual message from user or assistant
- ToolCall: Function/tool invocation within a message
FlexEval uses Python's logging
.
If you don't want to see FlexEval's logs:
# turn of all INFO and DEBUG log messages, but leave WARNING and ERROR messages
logging.getLogger('flexeval').setLevel(logging.WARNING)
# turn off all logging, including warnings and errors
logging.getLogger('flexeval').setLevel(logging.CRITICAL + 1)
Pull requests to expand the set of provided rubrics or functions are welcome.
To develop FlexEval, you should install uv
:
curl -LsSf https://astral.sh/uv/install.sh | sh
uv build
Run the unit tests:
uv run python -m unittest discover -s tests.unit
To run a specific file's tests:
uv run python -m unittest tests.unit.{module_name}
There are integration tests in tests/integration that can be executed.
To add a dependency:
uv add {package_name}
To update dependencies:
uv lock --upgrade
Verify CLI:
uv run python -m flexeval --help
We format code files using ruff
.
uvx ruff check --fix
uvx ruff format
FlexEval exposes a CLI.
Run an eval set by specifying the .env file:
uv run --env-file=.env python -m flexeval --eval_name {eval_suite_name}
Or set the UV_ENV_FILE variable first:
export UV_ENV_FILE=.env
uv run python -m flexeval --eval_name {eval_suite_name}