Model Eval Prototype

This is a toy project to learn about AI Model Evaluation benchmarks and its implementations.

With this project I wanna learn:

Integrations using Huggingface
AI Model Benchmarks and their implementations
Refresh python skills
Zookeeper + Kafka as a message broker. In past jobs and experiences I used RabbitMQ, Redis or AWS services for this. So this is a nice change to get into Kafka

Repo structure

model-eval: This folder contains only scripts. They were my first set of tests
eval-node: This folder contains the application that I explain how to run below.

Model

Llama 3 8b through Ollama: https://ollama.com/library/llama3:8b

Define Benchmarks

For this small proto, a small set of benchmarks will be used.

Accuracy: Measure the model's correctness on a specific task (e.g., classification or Q&A).
Toxicity: Evaluate how often the model generates harmful or offensive content.
Readability: Assess the fluency and grammatical correctness of the output.
Response Time: Measure the time taken for the model to generate a response. ~~- Consistency: Evaluate how consistent the model's responses are across repeated queries.~~ Not evaluating consistency at this point because the model-eval/paraphrases.py does not generate enough changes to the questions.

Datasets

Dataset that aligns with the benchmarks. Using squad_v2

Instructions

First, clone the project.

Python environment

Using uv to handle the Python env.

Install UV
Initialize the environment with a desired Python version

uv venv model_eval --python 3.12

Activate the environment

 source model_eval/bin/activate

Install dependencies

uv pip install requests transformers nltk numpy pandas scikit-learn textstat datasets evaluate torch torchvision torchaudio sentencepiece confluent-kafka fastapi uvicorn

Running the app

Run the docker compose file to run Kafka + Zookeper

cd eval-node
docker compose up

Create the topics in Kafka

# Create the evaluation_tasks topic
docker exec -it kafka kafka-topics --bootstrap-server localhost:9092 --create --topic evaluation_tasks --partitions 1 --replication-factor 1

# Create the evaluation_results topic
docker exec -it kafka kafka-topics --bootstrap-server localhost:9092 --create --topic evaluation_results --partitions 1 --replication-factor 1

# Where:
#   --partitions: The number of partitions for the topic.
#   --replication-factor: The number of replicas for fault tolerance. Use 1 for a single-node setup.

# Now check if the topics are created
docker exec -it kafka kafka-topics --bootstrap-server localhost:9092 --list

Run ollama in another terminal with this specific model

ollama run llama3:8b

Run the application

uvicorn app:app --host 0.0.0.0 --port 8000

Optional run the app with DEBUG log level

uvicorn app:app --host 0.0.0.0 --port 8000 --log-level debug

Test the app

Once the application is up and running, you have two HTTP endpoints to interact

[POST] http://localhost:8000/submit-task This endpoint receives the task to start the evaluation. Body parameters:

{
    // Model to run the evaluations on
    "model_name": "llama3:8b",
    // Dataset to load
    "dataset": "squad_v2"
}

The endpoint will return a task_id so we can check the status later on.

Example:

 curl -X POST "http://localhost:8000/submit-task" -H "Content-Type: application/json" -d '{"model_name": "llama3:8b", "dataset": "squad_v2"}'
{"task_id":"83c68245-3d05-44d4-a257-c63f95ecb3c8","status":"pending"}

[GET] http://localhost:8000/task-status/<task_id> This endpoint receives the task_id as a path param and returns the current status of the task as well as the history

Examples:

# Example output of a processing task
curl "http://localhost:8000/task-status/f0af065a-9d27-4fa7-af3c-2ad04b36af6b"
{
  "task_id": "f0af065a-9d27-4fa7-af3c-2ad04b36af6b",
  "status": "processing",
  "timestamp": 1737386569.4359438,
  "metadata": {},
  "history": [
    { "status": "pending", "timestamp": 1737386568.8356962 },
    { "status": "processing", "timestamp": 1737386569.435947 }
  ]
}

# Example output of a completed task
curl "http://localhost:8000/task-status/f0af065a-9d27-4fa7-af3c-2ad04b36af6b"
{
  "task_id": "f0af065a-9d27-4fa7-af3c-2ad04b36af6b",
  "status": "completed",
  "timestamp": 1737386634.716165,
  "metadata": {
    "task_id": "f0af065a-9d27-4fa7-af3c-2ad04b36af6b",
    "status": "completed",
    "model_name": "llama3:8b",
    "dataset": "squad_v2",
    "metrics": {
      "accuracy": {
        "exact": 40.0,
        "f1": 40.0,
        "total": 10,
        "HasAns_exact": 0.0,
        "HasAns_f1": 0.0,
        "HasAns_total": 6,
        "NoAns_exact": 100.0,
        "NoAns_f1": 100.0,
        "NoAns_total": 4,
        "best_exact": 40.0,
        "best_exact_thresh": 0.0,
        "best_f1": 40.0,
        "best_f1_thresh": 0.0
      },
      "toxicity": {
        "toxicity_scores": [
          0.0019954408053308725, 0.0019954408053308725, 0.0019954408053308725,
          0.0019954408053308725, 0.0019954408053308725, 0.0019954408053308725,
          0.0019954408053308725, 0.0019954408053308725, 0.0019954408053308725,
          0.0019954408053308725
        ]
      },
      "readability": {
        "readability_scores": [
          206.84, 206.84, 206.84, 206.84, 206.84, 206.84, 206.84, 206.84,
          206.84, 206.84
        ]
      },
      "evaluation_time": 0.6611873750807717
    },
    "timestamp": 1737386634.714082
  },
  "history": [
    { "status": "pending", "timestamp": 1737386568.8356962 },
    { "status": "processing", "timestamp": 1737386569.435947 },
    { "status": "completed", "timestamp": 1737386634.7161658 }
  ]
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
eval-node		eval-node
model-eval		model-eval
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Model Eval Prototype

Repo structure

Model

Define Benchmarks

Datasets

Instructions

Python environment

Running the app

Test the app

Links and references

About

Uh oh!

Releases

Packages

Languages

Lzok/model-eval-prototype

Folders and files

Latest commit

History

Repository files navigation

Model Eval Prototype

Repo structure

Model

Define Benchmarks

Datasets

Instructions

Python environment

Running the app

Test the app

Links and references

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages