EVAL EVAL

EVAL EVAL (Eval^2) is a Solutions Engineering and Data Science Discovery Cycle project that aims to assess LLM evaluation libraries and services. Here we will implement and compare metrics for a shared AI problem space. The chosen problem space is Supplemental Nutrition Assistance Program (SNAP) notice quality analysis.

Given ten sample SNAP notices this repository produces sample analysis with model and prompt iteration. The analysis functions as an input to evaluate. Creating domain knowledge of LLM-evaluation frameworks and comparing implementation factors is the goal of the project. Improving the SNAP notice analysis is only an objective when it enables evaluation.

The command line tools in this repository add information to a JSON manifest over analysis and evaluation steps. The manifest file's components are represented by Pydantic models, which allows for easy validation and use with LLMs. Contributors should add evaluations to the experiments directory as described below.

How to Contribute

To contribute to the Discovery Cycle, choose an evaluation and/or telemetry framework and create a script in the "experiments" folder. If you'd like, a template script has been provided for convenience. The main.py script should automatically pick it up and run it. After writing and running the evaluation, don't forget to fill out the post evaluation survey.

Getting Started

To leverage your machine's GPU, we've opted to run this experiment with a Python local environment. We'll need to install Ollama and Python dependencies locally.

You must have python 3.12 or later installed to run the project. Check your python version with python3 --version. Newer versions of python can be installed from python.org.

Install Ollama with homebrew: brew install ollama
Start ollama: ollama serve
Pull any Ollama models: ollama pull "<model name>". Analysis is currently using "llama3.1:8b" and "deepseek-r1:8b".
Setup a Python virtual environment: python3 -m venv .venv && chmod +x .venv/bin/activate && .venv/bin/activate
Install requirements pip install -e . && pip install -r requirements.txt
Copy example.env to .env and copy any API keys down from our shared LastPass record.

NB: Steps 1-3 are only required if you'd like to run the analysis command or write evaluation with Ollama models.

Commands

This repository provides two commands "analyze" and "evaluate". The former creates analysis for use in experiments. The latter runs the evaluation experiments and is most relevant for most discovery cycle participation cases.

Sample Execution of "evaluate" command:

Run all experiments:

python main.py evaluate manifest_with_analysis.json --output_path="results.json"

Run a single experiment:

python main.py evaluate manifest_with_analysis.json --output_path="results.json" --metrics=rouge_experiment,another_experiment

Run analysis:

python main.py analyze manifest.json --output_path="results.json"

NB: Running the analysis command is not required for contributing evaluations. Manifests with and without analysis and notice documents are available on Google Drive.

Adding Dependencies

You will almost certainly need to add additional Python packages to contribute evaluations. This repository uses pip and pip-chill to manage dependencies. To add a new dependency run pip install <package>. When you are ready to commit your work, add your dependencies to our requirements file (requirements.txt) by running pip-chill > requirements.txt.

Adding Providers and Models

Feel free to add Python packages necessary to support other LLM Providers. See "Adding Dependencies" above. Often, this is handled by the evaluation framework. If you need to add support for a provider that has not been used yet in the project, make sure to include the API key in your .env file and example.env file. Entries in the .env file will be included as environment variables automatically (access with os.getenv("API_KEY")).

If you need to download or install an additional non-Ollama model, please do so in your experiment file.

If you need to add an Ollama model, include it in the list of supported models in main.py.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
eval_eval		eval_eval
experiments		experiments
.gitignore		.gitignore
README.md		README.md
example.env		example.env
main.py		main.py
process_results.py		process_results.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EVAL EVAL

How to Contribute

Getting Started

Commands

Adding Dependencies

Adding Providers and Models

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

codeforamerica/eval_eval

Folders and files

Latest commit

History

Repository files navigation

EVAL EVAL

How to Contribute

Getting Started

Commands

Adding Dependencies

Adding Providers and Models

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages