This is a toy project to learn about AI Model Evaluation benchmarks and its implementations.
With this project I wanna learn:
- Integrations using Huggingface
- AI Model Benchmarks and their implementations
- Refresh python skills
- Zookeeper + Kafka as a message broker. In past jobs and experiences I used RabbitMQ, Redis or AWS services for this. So this is a nice change to get into Kafka
model-eval
: This folder contains only scripts. They were my first set of testseval-node
: This folder contains the application that I explain how to run below.
Llama 3 8b through Ollama: https://ollama.com/library/llama3:8b
For this small proto, a small set of benchmarks will be used.
- Accuracy: Measure the model's correctness on a specific task (e.g., classification or Q&A).
- Toxicity: Evaluate how often the model generates harmful or offensive content.
- Readability: Assess the fluency and grammatical correctness of the output.
- Response Time: Measure the time taken for the model to generate a response.
- Consistency: Evaluate how consistent the model's responses are across repeated queries.Not evaluating consistency at this point because the model-eval/paraphrases.py does not generate enough changes to the questions.
Dataset that aligns with the benchmarks. Using squad_v2
First, clone the project.
Using uv to handle the Python env.
- Install UV
- Initialize the environment with a desired Python version
uv venv model_eval --python 3.12
- Activate the environment
source model_eval/bin/activate
- Install dependencies
uv pip install requests transformers nltk numpy pandas scikit-learn textstat datasets evaluate torch torchvision torchaudio sentencepiece confluent-kafka fastapi uvicorn
- Run the docker compose file to run Kafka + Zookeper
cd eval-node
docker compose up
- Create the topics in Kafka
# Create the evaluation_tasks topic
docker exec -it kafka kafka-topics --bootstrap-server localhost:9092 --create --topic evaluation_tasks --partitions 1 --replication-factor 1
# Create the evaluation_results topic
docker exec -it kafka kafka-topics --bootstrap-server localhost:9092 --create --topic evaluation_results --partitions 1 --replication-factor 1
# Where:
# --partitions: The number of partitions for the topic.
# --replication-factor: The number of replicas for fault tolerance. Use 1 for a single-node setup.
# Now check if the topics are created
docker exec -it kafka kafka-topics --bootstrap-server localhost:9092 --list
- Run ollama in another terminal with this specific model
ollama run llama3:8b
- Run the application
uvicorn app:app --host 0.0.0.0 --port 8000
Optional run the app with DEBUG log level
uvicorn app:app --host 0.0.0.0 --port 8000 --log-level debug
Once the application is up and running, you have two HTTP endpoints to interact
- [POST]
http://localhost:8000/submit-task
This endpoint receives the task to start the evaluation. Body parameters:
{
// Model to run the evaluations on
"model_name": "llama3:8b",
// Dataset to load
"dataset": "squad_v2"
}
The endpoint will return a task_id
so we can check the status later on.
Example:
curl -X POST "http://localhost:8000/submit-task" -H "Content-Type: application/json" -d '{"model_name": "llama3:8b", "dataset": "squad_v2"}'
{"task_id":"83c68245-3d05-44d4-a257-c63f95ecb3c8","status":"pending"}
- [GET] http://localhost:8000/task-status/<task_id> This endpoint receives the task_id as a path param and returns the current status of the task as well as the history
Examples:
# Example output of a processing task
curl "http://localhost:8000/task-status/f0af065a-9d27-4fa7-af3c-2ad04b36af6b"
{
"task_id": "f0af065a-9d27-4fa7-af3c-2ad04b36af6b",
"status": "processing",
"timestamp": 1737386569.4359438,
"metadata": {},
"history": [
{ "status": "pending", "timestamp": 1737386568.8356962 },
{ "status": "processing", "timestamp": 1737386569.435947 }
]
}
# Example output of a completed task
curl "http://localhost:8000/task-status/f0af065a-9d27-4fa7-af3c-2ad04b36af6b"
{
"task_id": "f0af065a-9d27-4fa7-af3c-2ad04b36af6b",
"status": "completed",
"timestamp": 1737386634.716165,
"metadata": {
"task_id": "f0af065a-9d27-4fa7-af3c-2ad04b36af6b",
"status": "completed",
"model_name": "llama3:8b",
"dataset": "squad_v2",
"metrics": {
"accuracy": {
"exact": 40.0,
"f1": 40.0,
"total": 10,
"HasAns_exact": 0.0,
"HasAns_f1": 0.0,
"HasAns_total": 6,
"NoAns_exact": 100.0,
"NoAns_f1": 100.0,
"NoAns_total": 4,
"best_exact": 40.0,
"best_exact_thresh": 0.0,
"best_f1": 40.0,
"best_f1_thresh": 0.0
},
"toxicity": {
"toxicity_scores": [
0.0019954408053308725, 0.0019954408053308725, 0.0019954408053308725,
0.0019954408053308725, 0.0019954408053308725, 0.0019954408053308725,
0.0019954408053308725, 0.0019954408053308725, 0.0019954408053308725,
0.0019954408053308725
]
},
"readability": {
"readability_scores": [
206.84, 206.84, 206.84, 206.84, 206.84, 206.84, 206.84, 206.84,
206.84, 206.84
]
},
"evaluation_time": 0.6611873750807717
},
"timestamp": 1737386634.714082
},
"history": [
{ "status": "pending", "timestamp": 1737386568.8356962 },
{ "status": "processing", "timestamp": 1737386569.435947 },
{ "status": "completed", "timestamp": 1737386634.7161658 }
]
}
- https://huggingface.co/datasets/rajpurkar/squad_v2
- https://github.com/huggingface/evaluate/blob/main/metrics/squad_v2/README.md
- https://huggingface.co/docs/hub/datasets-usage
- https://rajpurkar.github.io/SQuAD-explorer/
- https://huggingface.co/docs/evaluate/choosing_a_metric
- https://huggingface.co/docs/datasets/v1.13.2/package_reference/main_classes.html?highlight=select#dataset
- https://huggingface.co/unitary/toxic-bert
- https://huggingface.co/docs/transformers/model_doc/t5
- https://github.com/google/sentencepiece#installation
- https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
- https://www.datacamp.com/blog/llm-evaluation
- https://www.singlestore.com/blog/complete-guide-to-evaluating-large-language-models/