|
7 | 7 | "source": [ |
8 | 8 | "# Optimizing Language Models with DSPy GEPA: From 42% to 64% Accuracy\n", |
9 | 9 | "\n", |
| 10 | + "_Authored by: [Behrooz Azarkhalili](https://github.com/behroozazarkhalili)_\n", |
| 11 | + "\n", |
10 | 12 | "This notebook demonstrates how to use DSPy's GEPA (Generalized Error-driven Prompt Augmentation) optimizer to improve language model performance on mathematical reasoning tasks. We'll work with the NuminaMath-1.5 dataset and show how GEPA can boost accuracy from 42% to 64% through automated prompt optimization.\n", |
11 | 13 | "\n", |
12 | 14 | "**What you'll learn:**\n", |
|
24 | 26 | "GEPA works by analyzing errors, generating targeted feedback, and automatically refining prompts to address common failure patterns. This makes it particularly effective for complex reasoning tasks where prompt quality significantly impacts performance." |
25 | 27 | ] |
26 | 28 | }, |
| 29 | + { |
| 30 | + "cell_type": "markdown", |
| 31 | + "id": "99b369f9", |
| 32 | + "metadata": {}, |
| 33 | + "source": [ |
| 34 | + "## Installation and Setup\n", |
| 35 | + "\n", |
| 36 | + "Install required dependencies and import libraries for DSPy, dataset processing, and model configuration." |
| 37 | + ] |
| 38 | + }, |
27 | 39 | { |
28 | 40 | "cell_type": "code", |
29 | 41 | "execution_count": null, |
|
67 | 79 | "print(\"🔄 Make sure Ollama is running: ollama run qwen3:8b\")" |
68 | 80 | ] |
69 | 81 | }, |
| 82 | + { |
| 83 | + "cell_type": "markdown", |
| 84 | + "id": "ee1fa682", |
| 85 | + "metadata": {}, |
| 86 | + "source": [ |
| 87 | + "## Language Model Configuration\n", |
| 88 | + "\n", |
| 89 | + "Configure your language model - either local (Ollama) or cloud-based (OpenRouter) - for use with DSPy." |
| 90 | + ] |
| 91 | + }, |
70 | 92 | { |
71 | 93 | "cell_type": "code", |
72 | 94 | "execution_count": null, |
|
99 | 121 | "train_split = load_dataset(\"AI-MO/NuminaMath-1.5\")['train']" |
100 | 122 | ] |
101 | 123 | }, |
| 124 | + { |
| 125 | + "cell_type": "markdown", |
| 126 | + "id": "aca72fbc", |
| 127 | + "metadata": {}, |
| 128 | + "source": [ |
| 129 | + "## Dataset Loading and Filtering\n", |
| 130 | + "\n", |
| 131 | + "Load the NuminaMath-1.5 dataset and filter for problems with numeric answers suitable for evaluation." |
| 132 | + ] |
| 133 | + }, |
102 | 134 | { |
103 | 135 | "cell_type": "code", |
104 | 136 | "execution_count": null, |
|
180 | 212 | " return train_set, val_set, test_set" |
181 | 213 | ] |
182 | 214 | }, |
| 215 | + { |
| 216 | + "cell_type": "markdown", |
| 217 | + "id": "e6d6b6f9", |
| 218 | + "metadata": {}, |
| 219 | + "source": [ |
| 220 | + "## Dataset Preparation Functions\n", |
| 221 | + "\n", |
| 222 | + "Helper functions to process the dataset, split it into train/val/test sets, and preview examples." |
| 223 | + ] |
| 224 | + }, |
183 | 225 | { |
184 | 226 | "cell_type": "code", |
185 | 227 | "execution_count": null, |
|
234 | 276 | "program = dspy.ChainOfThought(GenerateResponse)" |
235 | 277 | ] |
236 | 278 | }, |
| 279 | + { |
| 280 | + "cell_type": "markdown", |
| 281 | + "id": "3659214d", |
| 282 | + "metadata": {}, |
| 283 | + "source": [ |
| 284 | + "## Baseline Chain-of-Thought Program\n", |
| 285 | + "\n", |
| 286 | + "Create a simple baseline using DSPy's Chain-of-Thought module to establish initial performance." |
| 287 | + ] |
| 288 | + }, |
237 | 289 | { |
238 | 290 | "cell_type": "code", |
239 | 291 | "execution_count": null, |
|
269 | 321 | "evaluate(program)" |
270 | 322 | ] |
271 | 323 | }, |
| 324 | + { |
| 325 | + "cell_type": "markdown", |
| 326 | + "id": "329bacee", |
| 327 | + "metadata": {}, |
| 328 | + "source": [ |
| 329 | + "## Evaluation Metric\n", |
| 330 | + "\n", |
| 331 | + "Define the evaluation metric to compare model predictions against ground truth answers." |
| 332 | + ] |
| 333 | + }, |
272 | 334 | { |
273 | 335 | "cell_type": "code", |
274 | 336 | "execution_count": null, |
|
303 | 365 | "outputs": [], |
304 | 366 | "source": [] |
305 | 367 | }, |
| 368 | + { |
| 369 | + "cell_type": "markdown", |
| 370 | + "id": "07134dea", |
| 371 | + "metadata": {}, |
| 372 | + "source": [ |
| 373 | + "## Baseline Evaluation\n", |
| 374 | + "\n", |
| 375 | + "Evaluate the baseline Chain-of-Thought program to establish our starting accuracy before optimization." |
| 376 | + ] |
| 377 | + }, |
306 | 378 | { |
307 | 379 | "cell_type": "code", |
308 | 380 | "execution_count": null, |
|
357 | 429 | ")\n" |
358 | 430 | ] |
359 | 431 | }, |
| 432 | + { |
| 433 | + "cell_type": "markdown", |
| 434 | + "id": "e5fe6dd8", |
| 435 | + "metadata": {}, |
| 436 | + "source": [ |
| 437 | + "## GEPA Optimization\n", |
| 438 | + "\n", |
| 439 | + "Apply GEPA optimizer with error-driven feedback to automatically improve the prompt and boost performance." |
| 440 | + ] |
| 441 | + }, |
360 | 442 | { |
361 | 443 | "cell_type": "code", |
362 | 444 | "execution_count": null, |
|
381 | 463 | "print(optimized_program.predict.signature.instructions)" |
382 | 464 | ] |
383 | 465 | }, |
| 466 | + { |
| 467 | + "cell_type": "markdown", |
| 468 | + "id": "74c7476f", |
| 469 | + "metadata": {}, |
| 470 | + "source": [ |
| 471 | + "## Optimized Program Evaluation\n", |
| 472 | + "\n", |
| 473 | + "Evaluate the GEPA-optimized program to measure the improvement in accuracy and effectiveness." |
| 474 | + ] |
| 475 | + }, |
384 | 476 | { |
385 | 477 | "cell_type": "code", |
386 | 478 | "execution_count": null, |
|
393 | 485 | } |
394 | 486 | ], |
395 | 487 | "metadata": { |
| 488 | + "accelerator": "GPU", |
| 489 | + "colab": { |
| 490 | + "gpuType": "L4", |
| 491 | + "provenance": [] |
| 492 | + }, |
396 | 493 | "kernelspec": { |
397 | | - "display_name": "behrooz", |
| 494 | + "display_name": "Python 3", |
398 | 495 | "language": "python", |
399 | 496 | "name": "python3" |
400 | 497 | }, |
|
408 | 505 | "name": "python", |
409 | 506 | "nbconvert_exporter": "python", |
410 | 507 | "pygments_lexer": "ipython3", |
411 | | - "version": "3.11.11" |
| 508 | + "version": "3.11.0" |
412 | 509 | } |
413 | 510 | }, |
414 | 511 | "nbformat": 4, |
|
0 commit comments