You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
"source": "# Optimizing Language Models with DSPy GEPA\n\n_Authored by: [Behrooz Azarkhalili](https://github.com/behroozazarkhalili)_\n\nThis notebook demonstrates how to use [DSPy](https://dspy.ai/)'s GEPA (Generalized Error-driven Prompt Augmentation) optimizer to improve language model performance on mathematical reasoning tasks. We'll work with the [NuminaMath-1.5 dataset](https://huggingface.co/datasets/AI-MO/NuminaMath-1.5) and show how GEPA can boost accuracy through automated prompt optimization.\n\n**What you'll learn:**\n- Setting up DSPy with language models ([OpenRouter](https://openrouter.ai/)) \n- Processing and filtering mathematical problem datasets\n- Building a baseline Chain-of-Thought reasoning program\n- Optimizing prompts with GEPA using error-driven feedback\n- Evaluating improvements in model accuracy\n\n\nGEPA works by analyzing errors, generating targeted feedback, and automatically refining prompts to address common failure patterns. This makes it particularly effective for complex reasoning tasks where prompt quality significantly impacts performance.\n\n**Key Resources:**\n- [DSPy Documentation](https://dspy.ai/learn/programming/)\n- [Chain-of-Thought Prompting Paper](https://arxiv.org/abs/2201.11903)\n- [GEPA Optimizer Guide](https://dspy.ai/api/optimizers/GEPA/)"
7
+
"source": [
8
+
"# Prompt Optimization for Language Models with DSPy GEPA\n",
"This notebook demonstrates how to use [DSPy](https://dspy.ai/)'s GEPA (Generalized Error-driven Prompt Augmentation) optimizer to improve language model performance on mathematical reasoning tasks. We'll work with the [NuminaMath-1.5 dataset](https://huggingface.co/datasets/AI-MO/NuminaMath-1.5) and show how GEPA can boost accuracy through automated prompt optimization.\n",
13
+
"\n",
14
+
"**What you'll learn:**\n",
15
+
"- Setting up DSPy with language models ([OpenRouter](https://openrouter.ai/)) \n",
16
+
"- Processing and filtering mathematical problem datasets\n",
17
+
"- Building a baseline Chain-of-Thought reasoning program\n",
18
+
"- Optimizing prompts with GEPA using error-driven feedback\n",
19
+
"- Evaluating improvements in model accuracy\n",
20
+
"\n",
21
+
"\n",
22
+
"GEPA works by analyzing errors, generating targeted feedback, and automatically refining prompts to address common failure patterns. This makes it particularly effective for complex reasoning tasks where prompt quality significantly impacts performance.\n",
"source": "## Installation and Setup\n\nInstall required dependencies and import libraries for DSPy, dataset processing, and model configuration.\n\n**Installation Options:**\n- **uv** - Fast Python package installer ([documentation](https://docs.astral.sh/uv/))\n- **pip** - Traditional Python package manager\n\n**Key Dependencies:**\n- `dspy` - DSPy framework for language model programming\n- `datasets` - Hugging Face datasets library for loading NuminaMath-1.5\n- `python-dotenv` - Environment variable management for API keys"
34
+
"source": [
35
+
"## Installation and Setup\n",
36
+
"\n",
37
+
"Install required dependencies and import libraries for DSPy, dataset processing, and model configuration.\n",
38
+
"\n",
39
+
"**Installation Options:**\n",
40
+
"- **uv** - Fast Python package installer ([documentation](https://docs.astral.sh/uv/))\n",
41
+
"- **pip** - Traditional Python package manager\n",
42
+
"\n",
43
+
"**Key Dependencies:**\n",
44
+
"- `dspy` - DSPy framework for language model programming\n",
45
+
"- `datasets` - Hugging Face datasets library for loading NuminaMath-1.5\n",
46
+
"- `python-dotenv` - Environment variable management for API keys"
"source": "### Understanding GEPA's Two-Model Architecture\n\nGEPA's breakthrough innovation lies in its **dual-model approach** for reflective prompt optimization, which fundamentally differs from traditional single-model optimizers.\n\n**Why Two Models?**\n\nTraditional prompt optimizers rely on scalar metrics (accuracy scores) to guide improvements, essentially using trial-and-error without understanding *why* predictions fail. GEPA introduces a revolutionary approach by separating concerns:\n\n**1. Student LM (Inference Model)**\n- **Role**: Primary model that executes tasks and generates predictions\n- **Characteristics**: Fast, cost-efficient, handles high-volume inference\n- **Usage Pattern**: ~90-95% of all API calls during optimization\n- **In This Notebook**: `openrouter/openai/gpt-4.1-nano`\n\n**2. Reflection LM (Meta-Cognitive Model)**\n- **Role**: Analyzes failures, identifies patterns, and generates prompt improvements\n- **Characteristics**: Stronger reasoning, analytical depth, interpretability\n- **Usage Pattern**: ~5-10% of API calls (only during reflection phases)\n- **In This Notebook**: `openrouter/qwen/qwen3-next-80b-a3b-thinking`\n\n**The Reflective Optimization Cycle:**\n\n```\n1. Student LM solves training problems → predictions\n2. Metric provides rich textual feedback on failures\n3. Reflection LM analyzes batches of failures → identifies patterns\n4. Reflection LM generates improved prompt instructions\n5. Student LM tests new prompts → validation\n6. Repeat until convergence\n```\n\n**Research Foundation:**\n\nThis approach is detailed in the paper [\"Reflective Prompt Evolution Can Outperform Reinforcement Learning\"](https://arxiv.org/abs/2507.19457), which demonstrates that reflective optimization with textual feedback outperforms reinforcement learning approaches on complex reasoning tasks.",
"GEPA's breakthrough innovation lies in its **dual-model approach** for reflective prompt optimization, which fundamentally differs from traditional single-model optimizers.\n",
147
+
"\n",
148
+
"**Why Two Models?**\n",
149
+
"\n",
150
+
"Traditional prompt optimizers rely on scalar metrics (accuracy scores) to guide improvements, essentially using trial-and-error without understanding *why* predictions fail. GEPA introduces a revolutionary approach by separating concerns:\n",
151
+
"\n",
152
+
"**1. Student LM (Inference Model)**\n",
153
+
"- **Role**: Primary model that executes tasks and generates predictions\n",
"This approach is detailed in the paper [\"Reflective Prompt Evolution Can Outperform Reinforcement Learning\"](https://arxiv.org/abs/2507.19457), which demonstrates that reflective optimization with textual feedback outperforms reinforcement learning approaches on complex reasoning tasks."
178
+
]
104
179
},
105
180
{
106
181
"cell_type": "code",
@@ -2217,13 +2292,74 @@
2217
2292
"cell_type": "markdown",
2218
2293
"id": "skmsf5j36v",
2219
2294
"metadata": {},
2220
-
"source": "### Understanding the Optimization Results\n\n**Performance Improvement:**\n- **Baseline Accuracy**: 52.2% (47/90 correct)\n- **Optimized Accuracy**: 57.8% (52/90 correct)\n- **Improvement**: +5.6 percentage points (~11% relative improvement)\n\n**What Changed:**\nSee the instruction GEPA developed above.\n\n**Why the Modest Improvement?**\n\nThe ~6% gain is expected given:\n1. **Small Training Set**: Only 112 training examples (0.025% of full dataset)\n2. **Light Optimization**: Using `auto=\"light\"` for faster iteration\n3. **Simple Baseline**: Chain-of-Thought already provides decent reasoning structure\n4. **Model Limitations**: GPT-4.1 Nano's mathematical capabilities are the ceiling\n\n**Cost Efficiency:**\n\nThis entire experiment (baseline evaluation, GEPA optimization, and final evaluation on 224 examples) cost **less than $0.50** thanks to:\n- GPT-4.1 Nano's low pricing ($0.10/M input, $0.40/M output)\n- Asymmetric architecture (cheap model for 99% of calls, smart model for 1%)\n- Small sample size for demonstration purposes\n\n**Key Takeaway:**\n\nEven with limited data and light optimization, GEPA successfully identified failure patterns and generated targeted prompt improvements. With more training data (`sample_fraction=0.01` or higher) and heavier optimization (`auto=\"medium\"` or `\"heavy\"`), we'd expect 15-25% improvements, potentially reaching 65-70% accuracy."
"- Asymmetric architecture (cheap model for 99% of calls, smart model for 1%)\n",
2319
+
"- Small sample size for demonstration purposes\n",
2320
+
"\n",
2321
+
"**Key Takeaway:**\n",
2322
+
"\n",
2323
+
"Even with limited data and light optimization, GEPA successfully identified failure patterns and generated targeted prompt improvements. With more training data (`sample_fraction=0.01` or higher) and heavier optimization (`auto=\"medium\"` or `\"heavy\"`), we'd expect 15-25% improvements, potentially reaching 65-70% accuracy."
2324
+
]
2221
2325
},
2222
2326
{
2223
2327
"cell_type": "markdown",
2224
2328
"id": "cuj307bhp8f",
2225
-
"source": "## Learn More\n\nThis notebook introduced DSPy's GEPA optimizer for automated prompt improvement. Here are additional resources to deepen your understanding:\n\n### DSPy Framework\n- **[DSPy Documentation](https://dspy.ai/)** - Official documentation and guides\n- **[DSPy GitHub Repository](https://github.com/stanfordnlp/dspy)** - Source code and examples\n- **[DSPy Research Paper](https://arxiv.org/abs/2310.03714)** - \"DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines\"\n- **[DSPy Tutorial Series](https://dspy.ai/learn/programming/)** - Step-by-step learning path\n\n### Prompt Optimization\n- **[GEPA Optimizer Documentation](https://dspy.ai/api/optimizers/GEPA/)** - Technical details on GEPA\n- **[Chain-of-Thought Prompting](https://arxiv.org/abs/2201.11903)** - Foundational paper on CoT reasoning\n- **[Automatic Prompt Engineering](https://arxiv.org/abs/2211.01910)** - \"Large Language Models Are Human-Level Prompt Engineers\"\n- **[DSPy Optimizers Comparison](https://dspy.ai/api/optimizers/)** - Overview of different optimization strategies\n\n### Mathematical Reasoning\n- **[NuminaMath Dataset](https://huggingface.co/datasets/AI-MO/NuminaMath-1.5)** - The dataset used in this notebook\n- **[GSM8K Dataset](https://huggingface.co/datasets/gsm8k)** - Grade school math word problems benchmark\n- **[MATH Dataset](https://huggingface.co/datasets/hendrycks/competition_math)** - Competition-level mathematics problems\n- **[Mathematical Reasoning with LLMs](https://arxiv.org/abs/2206.14858)** - Survey of techniques\n\n### Related Techniques\n- **[Few-Shot Learning](https://arxiv.org/abs/2005.14165)** - \"Language Models are Few-Shot Learners\" (GPT-3 paper)\n- **[Self-Consistency](https://arxiv.org/abs/2203.11171)** - Improving reasoning via multiple sampling paths\n- **[ReAct Prompting](https://arxiv.org/abs/2210.03629)** - Reasoning and Acting in language models\n\n### Tools and Platforms\n- **[OpenRouter](https://openrouter.ai/)** - Unified API for multiple LLM providers\n- **[Hugging Face Datasets](https://huggingface.co/docs/datasets/)** - Dataset loading and processing\n- **[DSPy Optimizers Guide](https://dspy.ai/deep-dive/optimizers/)** - Deep dive into optimization strategies",
2226
-
"metadata": {}
2329
+
"metadata": {},
2330
+
"source": [
2331
+
"## Learn More\n",
2332
+
"\n",
2333
+
"This notebook introduced DSPy's GEPA optimizer for automated prompt improvement. Here are additional resources to deepen your understanding:\n",
2334
+
"\n",
2335
+
"### DSPy Framework\n",
2336
+
"- **[DSPy Documentation](https://dspy.ai/)** - Official documentation and guides\n",
2337
+
"- **[DSPy GitHub Repository](https://github.com/stanfordnlp/dspy)** - Source code and examples\n",
2338
+
"- **[DSPy Research Paper](https://arxiv.org/abs/2310.03714)** - \"DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines\"\n",
0 commit comments