Skip to content

Lzok/model-eval-prototype

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Model Eval Prototype

This is a toy project to learn about AI Model Evaluation benchmarks and its implementations.

With this project I wanna learn:

  • Integrations using Huggingface
  • AI Model Benchmarks and their implementations
  • Refresh python skills
  • Zookeeper + Kafka as a message broker. In past jobs and experiences I used RabbitMQ, Redis or AWS services for this. So this is a nice change to get into Kafka

Repo structure

  • model-eval: This folder contains only scripts. They were my first set of tests
  • eval-node: This folder contains the application that I explain how to run below.

Model

Llama 3 8b through Ollama: https://ollama.com/library/llama3:8b

Define Benchmarks

For this small proto, a small set of benchmarks will be used.

  • Accuracy: Measure the model's correctness on a specific task (e.g., classification or Q&A).
  • Toxicity: Evaluate how often the model generates harmful or offensive content.
  • Readability: Assess the fluency and grammatical correctness of the output.
  • Response Time: Measure the time taken for the model to generate a response. - Consistency: Evaluate how consistent the model's responses are across repeated queries. Not evaluating consistency at this point because the model-eval/paraphrases.py does not generate enough changes to the questions.

Datasets

Dataset that aligns with the benchmarks. Using squad_v2

Instructions

First, clone the project.

Python environment

Using uv to handle the Python env.

  1. Install UV
  2. Initialize the environment with a desired Python version
uv venv model_eval --python 3.12
  1. Activate the environment
 source model_eval/bin/activate
  1. Install dependencies
uv pip install requests transformers nltk numpy pandas scikit-learn textstat datasets evaluate torch torchvision torchaudio sentencepiece confluent-kafka fastapi uvicorn

Running the app

  1. Run the docker compose file to run Kafka + Zookeper
cd eval-node
docker compose up
  1. Create the topics in Kafka
# Create the evaluation_tasks topic
docker exec -it kafka kafka-topics --bootstrap-server localhost:9092 --create --topic evaluation_tasks --partitions 1 --replication-factor 1

# Create the evaluation_results topic
docker exec -it kafka kafka-topics --bootstrap-server localhost:9092 --create --topic evaluation_results --partitions 1 --replication-factor 1

# Where:
#   --partitions: The number of partitions for the topic.
#   --replication-factor: The number of replicas for fault tolerance. Use 1 for a single-node setup.

# Now check if the topics are created
docker exec -it kafka kafka-topics --bootstrap-server localhost:9092 --list
  1. Run ollama in another terminal with this specific model
ollama run llama3:8b
  1. Run the application
uvicorn app:app --host 0.0.0.0 --port 8000

Optional run the app with DEBUG log level

uvicorn app:app --host 0.0.0.0 --port 8000 --log-level debug

Test the app

Once the application is up and running, you have two HTTP endpoints to interact

  • [POST] http://localhost:8000/submit-task This endpoint receives the task to start the evaluation. Body parameters:
{
    // Model to run the evaluations on
    "model_name": "llama3:8b",
    // Dataset to load
    "dataset": "squad_v2"
}

The endpoint will return a task_id so we can check the status later on.

Example:

 curl -X POST "http://localhost:8000/submit-task" -H "Content-Type: application/json" -d '{"model_name": "llama3:8b", "dataset": "squad_v2"}'
{"task_id":"83c68245-3d05-44d4-a257-c63f95ecb3c8","status":"pending"}

Examples:

# Example output of a processing task
curl "http://localhost:8000/task-status/f0af065a-9d27-4fa7-af3c-2ad04b36af6b"
{
  "task_id": "f0af065a-9d27-4fa7-af3c-2ad04b36af6b",
  "status": "processing",
  "timestamp": 1737386569.4359438,
  "metadata": {},
  "history": [
    { "status": "pending", "timestamp": 1737386568.8356962 },
    { "status": "processing", "timestamp": 1737386569.435947 }
  ]
}

# Example output of a completed task
curl "http://localhost:8000/task-status/f0af065a-9d27-4fa7-af3c-2ad04b36af6b"
{
  "task_id": "f0af065a-9d27-4fa7-af3c-2ad04b36af6b",
  "status": "completed",
  "timestamp": 1737386634.716165,
  "metadata": {
    "task_id": "f0af065a-9d27-4fa7-af3c-2ad04b36af6b",
    "status": "completed",
    "model_name": "llama3:8b",
    "dataset": "squad_v2",
    "metrics": {
      "accuracy": {
        "exact": 40.0,
        "f1": 40.0,
        "total": 10,
        "HasAns_exact": 0.0,
        "HasAns_f1": 0.0,
        "HasAns_total": 6,
        "NoAns_exact": 100.0,
        "NoAns_f1": 100.0,
        "NoAns_total": 4,
        "best_exact": 40.0,
        "best_exact_thresh": 0.0,
        "best_f1": 40.0,
        "best_f1_thresh": 0.0
      },
      "toxicity": {
        "toxicity_scores": [
          0.0019954408053308725, 0.0019954408053308725, 0.0019954408053308725,
          0.0019954408053308725, 0.0019954408053308725, 0.0019954408053308725,
          0.0019954408053308725, 0.0019954408053308725, 0.0019954408053308725,
          0.0019954408053308725
        ]
      },
      "readability": {
        "readability_scores": [
          206.84, 206.84, 206.84, 206.84, 206.84, 206.84, 206.84, 206.84,
          206.84, 206.84
        ]
      },
      "evaluation_time": 0.6611873750807717
    },
    "timestamp": 1737386634.714082
  },
  "history": [
    { "status": "pending", "timestamp": 1737386568.8356962 },
    { "status": "processing", "timestamp": 1737386569.435947 },
    { "status": "completed", "timestamp": 1737386634.7161658 }
  ]
}

Links and references

About

Prototype to learn about evaluating AI Models in an event-driven arch

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages