A workshop project demonstrating how to build and evaluate LLM-powered classification systems using real-world data from Bluesky social network.
This workshop explores a comprehensive methodology for productizing robust LLM applications through tracing, dataset creation, and evaluation. Using W&B Weave, we'll explore how to build reliable evaluation pipelines for LLM applications.
The project showcases a practical example of building an LLM evaluation pipeline using:
- Bluesky posts as source data
- OpenAI's GPT-4 and other LLMs for classification
- Weights & Biases (Weave) for evaluation tracking and dataset versioning
- Gradio for the interactive UI
You can run this workshop directly in Colab by clicking the badge above. To run locally:
- Clone the repository:
git clone https://github.com/altryne/llm-evals-workshop
- Install dependencies:
Note: This project requires Python 3.10 or higher
pip install uv
uv pip install -r requirements.txt
- Set up environment variables:
Copy
.env.example
to.env
and fill in your credentials:
WANDB_API_KEY=your-wandb-api-key-here
OPENAI_API_KEY=your-openai-api-key-here
GEMINI_API_KEY=your-gemini-api-key-here
OPENROUTER_API_KEY=your-openrouter-api-key-here
- Interactive UI: Built with Gradio for easy post classification and feedback collection
- Evaluation Pipeline: Uses Weights & Biases Weave for:
- Tracing LLM calls and responses
- Dataset versioning and management
- Evaluation tracking and analysis
- Dataset Creation: Tools for building and annotating datasets from Bluesky posts
- Multi-Model Support: Supports multiple LLM providers (OpenAI, Gemini, OpenRouter)
- Comprehensive Evaluation Methods:
- Programmatic scoring for structured outputs
- Human-in-the-loop (HITL) annotations
- LLM-as-judge evaluations
The workshop covers three main evaluation methods:
-
Programmatic Scoring
- Fast and reliable for structured outputs
- Uses string matching and regex
- Best for exact match or pattern-based evaluation
- Example: Checking if LLM classification matches ground truth
-
Human-in-the-Loop (HITL)
- Manual review and annotation
- Creates high-quality ground truth data
- Used for kickstarting evaluation datasets
- Interactive UI for efficient annotation
-
LLM-as-Judge
- Uses LLMs to evaluate other LLMs
- Handles open-ended responses
- Cost-effective alternative to human evaluation
- Includes best practices and limitations
-
Run the Jupyter notebook:
-
Follow the instructions, in the notebook. The items to do yourself are marked with
#TODO
eval.ipynb
: Main notebook with implementation and UItemplates/
: HTML templates for post displaydata/
: JSON files containing Bluesky posts.env
: Configuration for API keys and credentials
Created by Alex Volkov for Weights & Biases