diff --git a/docs/docs/api/optimizers/BootstrapFewShotWithRandomSearch.md b/docs/docs/api/optimizers/BootstrapFewShotWithRandomSearch.md
index f2df1530b2..889c887699 100644
--- a/docs/docs/api/optimizers/BootstrapFewShotWithRandomSearch.md
+++ b/docs/docs/api/optimizers/BootstrapFewShotWithRandomSearch.md
@@ -1,5 +1,9 @@
# dspy.BootstrapFewShotWithRandomSearch
+`BootstrapFewShotWithRandomSearch` is a prompt optimizer that automatically discovers the best few-shot examples for your language model. Instead of manually guessing which examples to include in your prompts, this optimizer **bootstraps new examples using your model's own successful outputs** and uses intelligent random search to find the combination that delivers the best performance.
+
+For tasks where you have training data available, this optimizer can significantly boost your model's performance by finding the most effective demonstrations to include in your prompts.
+
::: dspy.BootstrapFewShotWithRandomSearch
handler: python
@@ -16,3 +20,369 @@
separate_signature: false
inherited_members: true
+
+## The Problem We're Solving
+
+Here's a challenge every prompt engineer faces: **which examples should you include in your few-shot prompts?**
+
+When you're building a language model application, you know that good examples in your prompt can dramatically improve performance. But choosing the right examples is surprisingly difficult:
+
+1. **Manual selection is hit-or-miss**: You might pick examples that seem good to you but don't actually help the model learn the pattern
+2. **Limited perspective**: Your training data might not cover all the scenarios where good demonstrations would help
+3. **No systematic evaluation**: Without testing different combinations, you're essentially guessing
+4. **Time-consuming iteration**: Manually trying different example sets and evaluating them is slow and tedious
+
+This creates several specific problems:
+
+- **Suboptimal demonstrations**: The examples you choose might not be the best teachers for your model
+- **Missing coverage**: You might not have examples that show the model how to handle edge cases
+- **Wasted potential**: Your model could perform much better with the right examples, but finding them manually is impractical
+- **Inconsistent results**: Different people might choose different examples, leading to unpredictable performance
+
+## The BootstrapFewShotWithRandomSearch Solution
+
+BootstrapFewShotWithRandomSearch solves this by **automating the entire example selection process**. Think of it as having an expert prompt engineer that:
+
+- **Generates additional high-quality examples** by letting your model solve problems and keeping the ones it gets right
+- **Tests many different combinations** of examples systematically
+- **Finds the optimal set** that makes your model perform best on validation data
+- **Uses smart random search** to explore the space of possible prompts efficiently
+
+Instead of you manually crafting example sets, this optimizer uses your model as both teacher and student - it generates new examples with the model, then finds the best combination to teach that same model.
+
+## How BootstrapFewShotWithRandomSearch Works
+
+Let's break down the process into three main stages:
+
+### Stage 1: Building a Pool of Candidate Examples
+
+**What happens**: The optimizer gathers potential examples to use in prompts.
+
+**How it works**:
+
+- **Labeled examples**: Takes examples directly from your training data (up to `max_labeled_demos`)
+- **Bootstrapped examples**: Uses your model as a "teacher" to solve training problems, keeping only the ones it gets right (up to `max_bootstrapped_demos`)
+
+**Why this matters**: By combining real examples with model-generated ones, you get a richer pool of demonstrations that covers more scenarios and shows correct reasoning patterns.
+
+**Example**: For a math word problem task, you might have 5 real examples from your training set, plus 3 additional examples where your model solved problems correctly with step-by-step reasoning.
+
+### Stage 2: Creating Multiple Candidate Prompts
+
+**What happens**: The optimizer creates many different prompt variants by combining examples in different ways.
+
+**How it works**:
+
+- Creates `num_candidate_programs` different prompts, each with a different selection and ordering of examples
+- Uses randomization to explore different combinations you might never think to try manually
+- Always includes some baseline comparisons (like no examples, or just labeled examples)
+
+**The key insight**: Instead of betting everything on one example selection, the optimizer hedges by trying many different approaches simultaneously.
+
+### Stage 3: Testing and Selecting the Best Prompt
+
+**What happens**: Each candidate prompt is evaluated on validation data to find the winner.
+
+**How it works**:
+
+1. **Evaluation**: Each prompt variant is tested on your validation set using your metric
+2. **Comparison**: The optimizer compares performance across all candidates
+3. **Selection**: The prompt that achieves the highest score becomes your optimized program
+4. **Validation**: This ensures the chosen prompt generalizes well, not just fits the training data
+
+## The Technical Details (For the Curious)
+
+### The Optimization Problem
+
+From a formal perspective, BootstrapFewShotWithRandomSearch is solving this optimization challenge:
+
+$$S^* = \arg\max_{S \subseteq D} \; M(\text{model prompted with } S,\; \text{validation data})$$
+
+Where:
+- $S$ is a set of examples to include in the prompt
+- $D$ is your pool of available examples (labeled + bootstrapped)
+- $M$ is your evaluation metric
+- We want to find the $S$ that gives the highest score
+
+**The challenge**: This is computationally hard because there are exponentially many possible subsets to choose from.
+
+### The Random Search Approach
+
+Instead of trying every combination (which would be impossibly expensive), the optimizer uses **intelligent random search**:
+
+1. **Smart sampling**: Generates multiple candidate example sets using randomization
+2. **Parallel evaluation**: Tests many candidates simultaneously to find good solutions efficiently
+3. **Best selection**: Picks the candidate that performs best on validation data
+
+**Why this works**: Random search is surprisingly effective for this type of problem, especially when you can evaluate many candidates in parallel. It avoids the local optima that greedy approaches might get stuck in.
+
+### The Bootstrapping Process
+
+The "bootstrapping" aspect works like this:
+
+1. **Teacher generation**: Your model (acting as teacher) attempts to solve training problems
+2. **Quality filtering**: Only solutions that pass your metric become examples
+3. **Demonstration creation**: Correct solutions (including reasoning steps) become new few-shot examples
+
+This creates a **positive feedback loop**: the model generates examples of its own successful problem-solving, which then help it solve similar problems even better.
+
+### Multi-Round Refinement (Optional)
+
+For `max_rounds > 1`, the process iterates:
+
+- **Round 1**: Generate initial bootstrapped examples with base model
+- **Round 2+**: Use the improved model from previous round as teacher, potentially finding even better examples
+- **Progressive improvement**: Each round can discover new successful patterns
+
+## Using BootstrapFewShotWithRandomSearch in Practice
+
+### Example 1: Text Classification Optimization
+
+Let's start with a simple classification task:
+
+```python
+import dspy
+from dspy.teleprompt import BootstrapFewShotWithRandomSearch
+
+# Define your classification task
+TOPICS = ["sports", "politics", "technology"]
+classifier = dspy.Predict(f"text -> topic:Literal{{{','.join(TOPICS)}}}")
+
+# Define how to measure success
+def accuracy_metric(example, prediction, trace=None) -> bool:
+ return example.topic == prediction.topic
+
+# Set up the optimizer
+optimizer = BootstrapFewShotWithRandomSearch(
+ metric=accuracy_metric,
+ max_bootstrapped_demos=3, # Up to 3 model-generated examples
+ max_labeled_demos=3, # Up to 3 real examples
+ num_candidate_programs=8, # Try 8 different prompt variants
+ max_rounds=1 # Single round of bootstrapping
+)
+
+# Optimize your classifier
+optimized_classifier = optimizer.compile(
+ classifier,
+ trainset=train_data,
+ valset=validation_data
+)
+```
+
+**What this does**:
+
+- Takes your basic classifier and training data
+- Generates up to 3 additional examples by having the model classify training texts correctly
+- Creates 8 different prompt variants with different example combinations
+- Tests each variant and returns the best-performing one
+
+### Example 2: Math Problem Solving with Chain-of-Thought
+
+For more complex reasoning tasks:
+
+```python
+from dspy import ChainOfThought, Module
+
+# Configure your language model
+dspy.configure(lm=dspy.LM(model='openai/gpt-4o', max_tokens=300))
+
+# Define a chain-of-thought math solver
+class MathSolver(Module):
+ def __init__(self):
+ self.solver = ChainOfThought("question -> answer")
+
+ def forward(self, question):
+ return self.solver(question=question)
+
+# Define success metric for math problems
+def math_accuracy(example, prediction, trace=None) -> bool:
+ return str(example.answer).strip() == str(prediction.answer).strip()
+
+# Set up more thorough optimization
+optimizer = BootstrapFewShotWithRandomSearch(
+ metric=math_accuracy,
+ max_bootstrapped_demos=5, # More examples for complex reasoning
+ max_labeled_demos=3, # Some real examples as foundation
+ num_candidate_programs=12, # Try more variants for better results
+ max_rounds=2, # Two rounds for iterative improvement
+ num_threads=4 # Parallelize for speed
+)
+
+# Optimize the math solver
+optimized_solver = optimizer.compile(
+ MathSolver(),
+ trainset=math_problems,
+ valset=validation_problems
+)
+```
+
+**What this does**:
+
+- Uses chain-of-thought reasoning to solve math problems step-by-step
+- Generates examples where the model shows correct reasoning patterns
+- Tries 12 different combinations across 2 rounds of refinement
+- Returns a solver with optimized demonstrations that improve reasoning
+
+### Key Parameters Explained
+
+- **`max_labeled_demos`**: Maximum real examples from your training data to include
+- **`max_bootstrapped_demos`**: Maximum model-generated examples to create
+- **`num_candidate_programs`**: How many different prompt variants to test (more = better results but higher cost)
+- **`max_rounds`**: Number of iterative improvement rounds (1 is usually sufficient)
+- **`num_threads`**: Parallel evaluation threads (higher = faster but more resource usage)
+- **`metric`**: Function that determines what "success" means for your task
+
+## What You Can Expect
+
+### The Good News
+
+**Significant Performance Improvements**:
+
+- Typical improvements range from 10-25% accuracy boost over unoptimized prompts
+- Works especially well for reasoning tasks where step-by-step examples help
+- Often discovers example combinations that perform better than intuitive manual choices
+
+**Automated Discovery**:
+
+- Finds effective example combinations you might not think of manually
+- Generates high-quality demonstrations by keeping only the model's successful attempts
+- Adapts to your specific task and data characteristics
+
+**Practical Benefits**:
+
+- **Time-saving**: Eliminates manual trial-and-error in example selection
+- **Systematic**: Evaluates options objectively using your chosen metric
+- **Scalable**: Can handle large datasets and complex reasoning tasks
+
+### The Realistic Expectations
+
+**Cost Considerations**:
+
+- **Time**: Typically takes 10-30 minutes depending on settings and data size
+- **API calls**: Makes many model calls during optimization (budget accordingly)
+- **Compute**: Benefits from parallel processing when available
+
+**Performance Factors**:
+
+- **Works best with sufficient data**: Needs enough training examples to bootstrap from (ideally 20+ examples)
+- **Depends on base model capability**: If your model can't solve training problems correctly, bootstrapping won't generate good examples
+- **Quality varies by task**: More effective for tasks where examples significantly help (like reasoning, complex formatting)
+
+**Not Magic**:
+
+- **Won't fix fundamental issues**: Can't overcome poor model choice or impossible tasks
+- **Metric-dependent**: Only as good as your evaluation function
+- **May overfit**: Can sometimes find examples too specific to validation data
+
+## Strengths and Limitations
+
+### Key Strengths
+
+**Automatic Example Discovery**:
+
+- Eliminates the guesswork in selecting few-shot examples
+- Uses the model's own successful outputs as teaching examples
+- Systematically explores combinations you might miss manually
+
+**Effective Search Strategy**:
+
+- Random search is simple but surprisingly powerful for this problem
+- Avoids local optima that greedy selection might get stuck in
+- Embarrassingly parallel - can evaluate many candidates simultaneously
+
+**Quality Assurance**:
+
+- Only includes bootstrapped examples that pass your quality metric
+- Validates final selection on held-out data to ensure generalization
+- Prevents overfitting to specific training examples
+
+**Flexibility**:
+
+- Works with any DSPy module and task type
+- Supports custom metrics for different quality measures
+- Can be combined with different base models and reasoning strategies
+
+### Key Limitations
+
+**Computational Cost**:
+
+- Requires many model evaluations during optimization
+- Can be expensive for large models or extensive search
+- Time scales with number of candidates and validation data size
+
+**Bootstrap Dependency**:
+
+- Effectiveness limited by base model's ability to solve training problems
+- Very weak models may not generate useful bootstrapped examples
+- Very strong models might not benefit much from few-shot examples
+
+**Search Limitations**:
+
+- Random search doesn't guarantee finding the global optimum
+- May miss good combinations that require more sophisticated search
+- No learning from previous trials to guide future searches
+
+**Data Requirements**:
+
+- Needs sufficient training data to bootstrap effectively
+- Requires representative validation data for proper selection
+- Quality depends on having a meaningful evaluation metric
+
+## Best Practices and Tips
+
+### Setting Up for Success
+
+1. **Start with good training data**: Ensure your examples are representative and high-quality
+2. **Choose meaningful metrics**: Your evaluation function should capture what you actually care about
+3. **Begin conservatively**: Start with fewer candidates and rounds, then scale up if promising
+4. **Monitor costs**: Keep track of API usage during optimization
+
+### Common Pitfalls to Avoid
+
+1. **Insufficient validation data**: Too small validation sets lead to unreliable optimization
+2. **Poor metric design**: Metrics that don't reflect real performance goals mislead optimization
+3. **Over-optimization**: Running too many rounds or candidates can lead to overfitting
+4. **Ignoring base performance**: Not checking if optimization actually improved over baseline
+
+## Comparison with Other Optimizers
+
+### vs. Manual Example Selection
+
+- **BootstrapFewShot**: Systematic, objective, discovers non-obvious combinations
+- **Manual**: Intuitive but subjective, time-consuming, limited exploration
+
+### vs. Simple BootstrapFewShot (without random search)
+
+- **With Random Search**: Tests multiple combinations, more robust results
+- **Without Random Search**: Single attempt, may get unlucky with initial selection
+
+### vs. MIPROv2 or Bayesian Optimizers
+
+- **BootstrapFewShot**: Simpler, more straightforward, good baseline performance
+- **Advanced optimizers**: More sample-efficient, can optimize instructions too, but more complex
+
+## When to Use BootstrapFewShotWithRandomSearch
+
+### Great For:
+
+- **Tasks where examples significantly help**: Complex reasoning, specific formatting, nuanced classification
+- **When you have sufficient training data**: At least 20-50 examples to bootstrap from
+- **Systematic optimization needs**: When manual example selection is too time-consuming
+- **Performance-critical applications**: Where the optimization cost is justified by improved results
+
+### Consider Alternatives When:
+
+- **Very limited data**: Fewer than 10-20 examples may not provide enough bootstrapping material
+- **Simple tasks**: Basic classification or generation where examples don't help much
+- **Tight resource constraints**: When optimization cost exceeds the value of improvement
+- **Already high performance**: If your current approach achieves 95%+ on your metric
+
+### Getting Started
+
+1. **Prepare your data**: Ensure you have training and validation sets
+2. **Define your metric**: Create a function that measures what success means for your task
+3. **Start small**: Begin with `num_candidate_programs=5` and `max_rounds=1`
+4. **Evaluate results**: Test the optimized program on held-out data
+5. **Scale up if promising**: Increase parameters for potentially better results
+
+BootstrapFewShotWithRandomSearch represents a powerful middle ground in prompt optimization - more sophisticated than manual selection, simpler than advanced Bayesian methods, and effective across a wide range of tasks. When you have good training data and clear success metrics, it can deliver substantial improvements with relatively straightforward setup and reasonable computational cost.
diff --git a/docs/docs/api/optimizers/MIPROv2.md b/docs/docs/api/optimizers/MIPROv2.md
index 34a6f2eaa2..b4d46dcfaf 100644
--- a/docs/docs/api/optimizers/MIPROv2.md
+++ b/docs/docs/api/optimizers/MIPROv2.md
@@ -2,6 +2,8 @@
`MIPROv2` (Multiprompt Instruction PRoposal Optimizer Version 2) is an prompt optimizer capable of optimizing both instructions and few-shot examples jointly. It does this by bootstrapping few-shot example candidates, proposing instructions grounded in different dynamics of the task, and finding an optimized combination of these options using Bayesian Optimization. It can be used for optimizing few-shot examples & instructions jointly, or just instructions for 0-shot optimization.
+For those interested in more details, more information on `MIPROv2` along with a study on `MIPROv2` compared with other DSPy optimizers can be found in [this paper](https://arxiv.org/abs/2406.11695).
+
::: dspy.MIPROv2
handler: python
@@ -19,50 +21,309 @@
inherited_members: true
-## Example Usage
+## The Problem We're Solving
+
+Let's start with a specific challenge in prompt optimization: **existing optimizers often only tackle half the problem**.
+
+Here's what typically happens with traditional prompt optimization approaches:
+
+1. **Bootstrap-style optimizers** focus primarily on generating good few-shot examples, but leave instruction wording largely unchanged
+2. **Instruction-only optimizers** improve the task description but don't systematically select the best examples to include
+3. **Manual approaches** require you to separately optimize instructions and examples, missing potential synergies between them
+4. **Simple search methods** test combinations randomly or greedily, often missing better solutions
+
+This creates several specific problems:
+
+- **Suboptimal combinations**: The best instruction might need different examples than you'd expect, and vice versa
+- **Inefficient search**: Random or exhaustive testing wastes compute on unlikely-to-succeed combinations
+- **Limited exploration**: Without systematic candidate generation, you might never discover effective instruction variants
+- **Fragmented optimization**: Optimizing pieces separately misses the interactions between instruction wording and example selection
+
+## The MIPROv2 Solution
+
+MIPROv2 solves this by **automating the entire prompt optimization process**. Think of it as having an AI assistant that:
+
+- **Generates multiple prompt variations** for you to try
+- **Tests each variation systematically** on your data
+- **Learns from the results** to propose even better variations
+- **Finds the optimal combination** of instruction wording and examples
+
+Instead of you manually crafting prompts, MIPROv2 uses your language model itself to suggest improvements, then uses smart search algorithms to find the best combination.
+
+## How MIPROv2 Works Under the Hood
+
+Let's break down MIPROv2's process into three main stages:
+
+### Stage 1: Gathering Building Blocks
+
+**What happens**: MIPROv2 collects potential few-shot examples to use in prompts.
+
+**How it works**:
+
+- **Labeled examples**: Takes some of your existing training examples directly
+- **Bootstrapped examples**: Runs your current program on training inputs to generate new examples, but only keeps the ones that score well on your metric
+
+**Why this matters**: Having a diverse pool of high-quality examples gives MIPROv2 raw materials to build effective prompts.
+
+### Stage 2: Generating Instruction Candidates
+
+**What happens**: MIPROv2 creates multiple versions of the instruction text.
+
+**How it works**:
+
+- Uses your language model to propose different ways to phrase the task instructions
+- Grounds these proposals in the actual examples and program context
+- Generates variations that might emphasize different aspects (e.g., step-by-step reasoning vs. brevity)
+
+**Example**:
+
+For a Q&A task, it might generate:
+
+- "Answer the question step by step, showing your reasoning"
+- "Provide a concise, accurate answer to the question"
+- "Based on the given context, answer the following question"
-The program below shows optimizing a math program with MIPROv2
+### Stage 3: Smart Search for the Best Combination
+
+**What happens**: MIPROv2 systematically tests different instruction + example combinations.
+
+**How it works**:
+
+1. **Initial trials**: Tests various combinations randomly to get baseline data
+2. **Learning phase**: Builds a model of what makes prompts successful
+3. **Guided search**: Uses Bayesian optimization to focus on promising combinations
+4. **Refinement**: Continues testing until it finds the best-performing prompt
+
+**The key insight**: Instead of trying every possible combination (which would be too expensive), MIPROv2 uses smart search to focus on the most promising options.
+
+## The Technical Details (For the Curious)
+
+### The Math Behind It
+
+From a formal perspective, MIPROv2 is solving this optimization problem:
+
+$$\max_{\theta} \; M(\text{Program}_\theta)$$
+
+Where:
+
+- $\theta$ represents all the prompt parameters (instructions + examples)
+- $M$ is your evaluation metric (like accuracy)
+- We want to find the $\theta$ that gives the highest score
+
+**The challenge**: This is a "black-box" optimization problem because:
+
+- We can't take gradients (the metric isn't differentiable)
+- Small changes in prompts can have unpredictable effects
+- We need to balance exploration (trying new things) with exploitation (refining what works)
+
+### The Bayesian Optimization Approach
+
+MIPROv2 tackles this using **Bayesian optimization**, which works like this:
+
+1. **Build a surrogate model**: Creates a statistical model that predicts how well a prompt will perform based on past evaluations
+2. **Acquisition function**: Uses this model to decide which prompt to test next (balancing trying promising options vs. exploring unknowns)
+3. **Update and repeat**: After each test, updates the model and selects the next candidate
+
+**Why this works**: Bayesian optimization is particularly good at handling noisy evaluations (which language models produce) and finding good solutions with relatively few trials.
+
+### The Meta-Learning Aspect
+
+An advanced feature is that MIPROv2 can **learn how to propose better instructions over time**. As it discovers what types of instructions work well for your task, it can bias future proposals toward similar patterns.
+
+## Using MIPROv2 in Practice
+
+### Example 1: Zero-Shot Optimization (Instructions Only)
+
+Let's say you want to optimize just the instruction without adding examples:
```python
-import dspy
-from dspy.datasets.gsm8k import GSM8K, gsm8k_metric
+from dspy.teleprompt import MIPROv2
+
+# Set up the optimizer
+optimizer = MIPROv2(metric=accuracy_metric, auto="light")
-# Import the optimizer
+# Optimize only instructions (no examples)
+optimized_program = optimizer.compile(
+ qa_system,
+ trainset=train_data,
+ max_bootstrapped_demos=0, # No AI-generated examples
+ max_labeled_demos=0, # No manual examples
+ requires_permission_to_run=False
+)
+```
+
+**What this does**:
+
+- Takes your basic Q&A system
+- Generates different instruction wordings
+- Tests them to find the best one
+- Returns an improved version with better instructions
+
+### Example 2: Full Optimization (Instructions + Examples)
+
+For maximum performance, optimize both instructions and examples:
+
+```python
from dspy.teleprompt import MIPROv2
-# Initialize the LM
-lm = dspy.LM('openai/gpt-4o-mini', api_key='YOUR_OPENAI_API_KEY')
-dspy.configure(lm=lm)
+# Configure for more thorough optimization
+optimizer = MIPROv2(
+ metric=accuracy_metric,
+ num_candidates=7, # Generate 7 instruction variants
+ init_temperature=0.5 # Control randomness in proposals
+)
-# Initialize optimizer
-teleprompter = MIPROv2(
- metric=gsm8k_metric,
- auto="medium", # Can choose between light, medium, and heavy optimization runs
+# Run full optimization
+optimized_program = optimizer.compile(
+ qa_system.deepcopy(),
+ trainset=train_data,
+ max_bootstrapped_demos=3, # Up to 3 AI-generated examples
+ max_labeled_demos=4, # Up to 4 manual examples
+ num_trials=15, # Test 15 different combinations
+ minibatch_size=25, # Test each on 25 examples
+ requires_permission_to_run=False
)
+```
+
+**What this does**:
+
+- Creates 7 different instruction variants
+- Allows up to 7 examples total in the prompt
+- Tests 15 different combinations intelligently
+- Returns the best-performing combination
+
+### Key Parameters Explained
+
+- **`auto`**: Quick presets (`"light"`, `"medium"`, `"heavy"`) that balance speed vs. thoroughness
+- **`num_candidates`**: How many instruction variations to generate
+- **`num_trials`**: How many combinations to test (more trials = better results but higher cost)
+- **`minibatch_size`**: How many examples to test each combination on
+- **`max_bootstrapped_demos`**: Maximum AI-generated examples to include
+- **`max_labeled_demos`**: Maximum manual examples to include
+
+## What You Can Expect
+
+### The Good News
+
+**Significant Performance Improvements**: MIPROv2 often delivers substantial gains:
+
+- Typical improvements range from 5-15% accuracy boost
+- Some cases see even larger improvements (up to 20%+ in favorable conditions)
+- Works well even with limited training data
+
+**Automated Discovery**: MIPROv2 often finds prompt strategies you wouldn't think of:
+
+- Novel instruction phrasings that work better than obvious approaches
+- Unexpected combinations of examples that complement each other well
+- Task-specific optimizations tailored to your exact use case
+
+**Flexible Application**: Works for both:
+
+- **Zero-shot tasks**: Where you just want better instructions
+- **Few-shot tasks**: Where examples significantly help performance
+
+### The Realistic Expectations
+
+**Cost Considerations**:
+
+- **Time**: Light runs take ~5-10 minutes, medium runs ~20-30 minutes, heavy runs can take hours
+- **Compute**: Benefits from parallel processing if available
-# Optimize program
-print(f"Optimizing program with MIPROv2...")
-gsm8k = GSM8K()
-optimized_program = teleprompter.compile(
- dspy.ChainOfThought("question -> answer"),
- trainset=gsm8k.train,
- requires_permission_to_run=False,
+**Not Magic**: MIPROv2 has limitations:
+
+- **Can't fix fundamental model limitations**: If your base model isn't capable enough, even perfect prompts won't solve everything
+- **Depends on good metrics**: The optimizer is only as good as the evaluation function you provide
+- **May overfit**: Can sometimes create prompts too specific to your training examples
+
+**Quality Varies**: Results depend on:
+
+- How much room for improvement exists in your initial prompt
+- Quality and representativeness of your training data
+- Appropriateness of your evaluation metric
+- The specific task type
+
+## Best Practices and Tips
+
+### Setting Up for Success
+
+1. **Start with representative data**: Your training set should reflect real-world usage
+2. **Choose good metrics**: Use evaluation functions that capture what you actually care about
+3. **Begin with light runs**: Start with `auto="light"` to get quick wins before investing in heavier optimization
+4. **Monitor for overfitting**: Review the optimized prompts to ensure they're not too specific to your training data
+
+### Common Pitfalls to Avoid
+
+1. **Inadequate training data**: Too few or non-representative examples lead to poor optimization
+2. **Wrong metrics**: Optimizing for the wrong thing (e.g., brevity when you need accuracy)
+3. **Insufficient trials**: Stopping optimization too early before finding good solutions
+4. **Ignoring costs**: Running unnecessarily expensive optimization when lighter approaches would suffice
+
+### Advanced Usage Tips
+
+**Use different models for different purposes**:
+```python
+# Use a strong model for generating instructions, smaller one for the task
+optimizer = MIPROv2(
+ metric=accuracy_metric,
+ prompt_model=large_model, # Strong model for creative instruction generation
+ task_model=small_model # Efficient model for actual task execution
)
+```
-# Save optimize program for future use
-optimized_program.save(f"optimized.json")
+**Inspect optimization results**:
+```python
+# Save and examine the optimized program
+optimized_program.save("my_optimized_prompt")
+print("Final instruction:", optimized_program.predictors[0].signature.instructions)
+print("Examples used:", optimized_program.predictors[0].demos)
```
-## How `MIPROv2` works
+## Understanding the Results
-At a high level, `MIPROv2` works by creating both few-shot examples and new instructions for each predictor in your LM program, and then searching over these using Bayesian Optimization to find the best combination of these variables for your program. If you want a visual explanation check out this [twitter thread](https://x.com/michaelryan207/status/1804189184988713065).
+### What MIPROv2 Gives You
-These steps are broken down in more detail below:
+After optimization, you get:
-1) **Bootstrap Few-Shot Examples**: Randomly samples examples from your training set, and run them through your LM program. If the output from the program is correct for this example, it is kept as a valid few-shot example candidate. Otherwise, we try another example until we've curated the specified amount of few-shot example candidates. This step creates `num_candidates` sets of `max_bootstrapped_demos` bootstrapped examples and `max_labeled_demos` basic examples sampled from the training set.
+- **An improved DSPy program**: Drop-in replacement for your original with better prompts
+- **Optimized instructions**: Better-worded task descriptions
+- **Curated examples**: Carefully selected few-shot demonstrations
+- **Performance metrics**: Data on how much improvement was achieved
-2) **Propose Instruction Candidates**. The instruction proposer includes (1) a generated summary of properties of the training dataset, (2) a generated summary of your LM program's code and the specific predictor that an instruction is being generated for, (3) the previously bootstrapped few-shot examples to show reference inputs / outputs for a given predictor and (4) a randomly sampled tip for generation (i.e. "be creative", "be concise", etc.) to help explore the feature space of potential instructions. This context is provided to a `prompt_model` which writes high quality instruction candidates.
+### Interpreting Success
-3) **Find an Optimized Combination of Few-Shot Examples & Instructions**. Finally, we use Bayesian Optimization to choose which combinations of instructions and demonstrations work best for each predictor in our program. This works by running a series of `num_trials` trials, where a new set of prompts are evaluated over our validation set at each trial. The new set of prompts are only evaluated on a minibatch of size `minibatch_size` at each trial (when `minibatch`=`True`). The best averaging set of prompts is then evalauted on the full validation set every `minibatch_full_eval_steps`. At the end of the optimization process, the LM program with the set of prompts that performed best on the full validation set is returned.
+**Good signs**:
-For those interested in more details, more information on `MIPROv2` along with a study on `MIPROv2` compared with other DSPy optimizers can be found in [this paper](https://arxiv.org/abs/2406.11695).
+- Consistent improvement across validation examples
+- Instructions that make intuitive sense for your task
+- Examples that are representative and high-quality
+
+**Warning signs**:
+
+- Instructions that are overly specific to training data
+- Inconsistent performance across different test sets
+- Examples that don't generalize well
+
+## Conclusion: When and How to Use MIPROv2
+
+### MIPROv2 is Great For:
+
+- **Complex multi-step programs**: Where small prompt improvements compound across steps
+- **Tasks with clear success metrics**: Where you can easily measure what "better" means
+- **Scenarios where manual optimization is time-consuming**: Especially complex domains where good prompts aren't obvious
+- **When you have some training data**: Even a small representative set helps significantly
+
+### Consider Alternatives When:
+
+- **Your current prompts already work well**: If you're getting 95%+ accuracy, optimization might not be worth the cost
+- **You have very limited data**: With fewer than ~10-20 examples, results may be unreliable
+- **Time/cost constraints are tight**: Manual tweaking might be faster for simple cases
+- **Your task is very simple**: Basic tasks might not benefit much from sophisticated optimization
+
+### Getting Started
+
+1. **Try a light run first**: Use `auto="light"` to get a feel for potential improvements
+2. **Evaluate the results**: Test the optimized program on held-out data
+3. **Scale up if promising**: Move to medium or heavy runs if initial results are encouraging
+4. **Iterate and refine**: Use insights from optimization to inform further improvements
+
+MIPROv2 represents a significant advance in automated prompt engineering. When used appropriately, it can save substantial time while delivering better results than manual optimization. The key is understanding when and how to apply it effectively for your specific use case.
diff --git a/docs/docs/design-principles.md b/docs/docs/design-principles.md
new file mode 100644
index 0000000000..2f143a3a19
--- /dev/null
+++ b/docs/docs/design-principles.md
@@ -0,0 +1,157 @@
+# DSPy Philosophy and Design Principles
+
+> This document has been consolidated from discussions with the DSPy team, including [Omar Khattab](https://x.com/lateinteraction) and other core contributors, as well as from the official DSPy documentation, [twitter](https://x.com/DSPyOSS) and community insights.
+
+DSPy is built on a simple idea: building with LLMs should feel like programming, not guessing at prompts. Instead of crafting brittle prompt strings through trial-and-error, DSPy lets you write structured, modular code that describes what you want the AI to do.
+
+This approach brings core software engineering principles – modularity, abstraction, and clear contracts – to AI development. At its heart, DSPy can be thought of as "compiling declarative AI functions into LM calls, with Signatures, Modules, and Optimizers." It's like having a compiler for LLM-based programs.
+
+By focusing on information flow and high-level contracts rather than hardcoded wording, DSPy aims to future-proof your AI programs against the fast-evolving landscape of models and techniques.
+
+## The Foundation: Signatures, Modules, and Optimizers
+
+Any robust LLM programming framework needs stable, high-level abstractions. DSPy provides three core building blocks:
+
+### Signatures: What, Not How
+
+A Signature is a declarative specification of a task's inputs, outputs, and intent. It tells the LM what it needs to do without prescribing how to do it.
+
+```python
+# This signature defines the contract, not the implementation
+question_answer = dspy.Predict("question -> answer")
+```
+
+Think of it like a function signature in traditional coding – you define the input and output fields with semantic names, describing the interface of an LM-powered function. By separating what from how, Signatures let you focus on the information that flows through your system rather than exact prompt wording.
+
+### Modules: Reusable Strategies
+
+A Module encapsulates how to accomplish a subtask in a composable, adaptable way. Modules are like functions or classes in software engineering – they can be combined to form complex pipelines.
+
+```python
+# Same module, different tasks
+cot = dspy.ChainOfThought("question -> answer") # Math reasoning
+classify = dspy.ChainOfThought("text -> sentiment") # Classification
+```
+
+The key insight: modules in DSPy are polymorphic and parameterized. The same module can adjust its behavior based on the Signature and can learn or be optimized. A `ChainOfThought` module provides a stable reasoning algorithm that's independent of any single prompt phrasing.
+
+### Optimizers: The Self-Improving Part
+
+Optimizers are DSPy's "compiler." Given a module and signature, an Optimizer tunes the prompts or parameters to maximize performance on your metric. DSPy treats prompt engineering as a search problem – much like a compiler explores optimizations to improve performance.
+
+```python
+# Let the optimizer find the best prompts
+teleprompter = dspy.BootstrapFewShot(metric=my_metric)
+optimized_program = teleprompter.compile(my_program, trainset=trainset)
+```
+
+This means improving your AI system doesn't require manually rewriting prompts. You compile and optimize, letting the framework refine the low-level details. The same DSPy program can be recompiled for better results without changing the high-level code.
+
+These three abstractions stay stable even as the LLM field evolves. Your program's logic remains separate from shifting prompt styles or training paradigms.
+
+## Core Principles: The Five Bets
+
+DSPy is built on five foundational principles that guide its design and long-term vision:
+
+### 1. Information Flow Over Everything
+
+The most critical aspect of effective AI software is information flow, not prompt phrasing.
+
+Modern foundation models are incredibly powerful reasoners, so the limiting factor is often how well you provide information to the model. Instead of obsessing over exact prompt wording, focus on ensuring the right information gets to the right place in your pipeline.
+
+DSPy enforces this through Signatures. By explicitly structuring inputs and outputs, you naturally concentrate on what data flows through your system. The framework's support for arbitrary control flow lets information be routed and transformed as needed.
+
+The key shift: concentrate on defining the right Signature rather than finding the perfect prompt. Your AI system becomes robust to changes in phrasing or model because the essential information being conveyed remains well-defined.
+
+### 2. Functional, Structured Interactions
+
+LLM interactions should be structured as predictable program components, not ad-hoc prompt strings.
+
+DSPy treats each LLM interaction like a function call. A Signature defines a functional contract: what inputs it expects, what it outputs, and how it should behave. This prevents the confusion of mixing instructions, context, and output format in one giant prompt string.
+
+```python
+# Instead of one giant prompt doing everything:
+summarize = dspy.Predict("email -> summary")
+classify = dspy.Predict("summary -> category")
+```
+
+Each module operates like a well-defined function with structured inputs and outputs. This yields clarity and modularity – each piece does one thing in a controlled way, making your programs transparent and logically composed.
+
+### 3. Polymorphic Inference Modules
+
+Inference strategies should be reusable, adaptable modules that work across many tasks.
+
+Different prompting techniques and reasoning methods should be encapsulated in modules that can be applied everywhere. A single module (like `ChainOfThought` for reasoning or `Retrieve` for RAG) can work across many tasks and Signatures.
+
+```python
+# Same reasoning strategy, different domains
+math_solver = dspy.ChainOfThought("problem -> solution")
+code_reviewer = dspy.ChainOfThought("code -> feedback")
+```
+
+This polymorphism is powerful: develop a prompting strategy once and reuse it everywhere. It clearly separates what's fixed (the strategy) from what adapts (the content). When new prompting techniques emerge, you can incorporate them by updating modules without rewriting your entire application.
+
+Polymorphic modules also distinguish which parts can be learned versus fixed. The reasoning template might be constant, but the actual content can be optimized for your specific problem.
+
+### 4. Decouple Specification from Execution
+
+What your AI should do must be independent from how it's implemented underneath.
+
+AI is fast-moving – new paradigms (few-shot prompting, fine-tuning, retrieval augmentation, RL) emerge constantly. DSPy future-proofs your system by separating what you want (the specification) from how it's achieved (the current technique).
+
+You write Signatures and compose Modules without hard-coding whether the model uses in-context examples, fine-tuning, or external tools. Those details are handled by your chosen modules and optimizers.
+
+```python
+# Same specification, different implementations
+translator = dspy.Predict("text -> translation") # Could use prompts, fine-tuning, or both
+```
+
+The same program can be instantiated under different paradigms. Write your code once, and the framework can optimize it as prompts today, fine-tuned models tomorrow, or something entirely new next year.
+
+### 5. Natural Language Optimization as First-Class
+
+Optimizing prompts and instructions through data is a powerful learning paradigm.
+
+Rather than viewing prompt crafting as a static human task, DSPy treats it as an optimization problem solvable with data and metrics. This approach elevates prompt optimization to be as important as traditional model training.
+
+```python
+# Systematic prompt optimization, not manual tweaking
+optimizer = dspy.MIPRO(metric=accuracy, num_candidates=10)
+better_program = optimizer.compile(program, trainset=trainset)
+```
+
+DSPy provides optimizers that generate candidate prompts, evaluate them, and pick the best ones iteratively. This often achieves better sample efficiency than expensive model fine-tuning. By making this core to the framework, DSPy signals that algorithmic prompt tuning should replace manual prompt tweaking.
+
+This principle aligns with the belief that as LLMs become runtime engines, improving how we instruct them matters as much as improving the engines themselves.
+
+## Beyond Prompt Engineering
+
+A common misconception is that DSPy is just "fancy prompt templating." The approach is fundamentally different:
+
+**From Artisanal to Systematic**: Traditional prompt engineering is manual tweaking until output "seems good." DSPy replaces this with a systematic process: declare what you need via Signatures and let modules and optimizers construct the best prompts.
+
+**Modularity vs. Monolithic Prompts**: Instead of one giant prompt trying to do everything, DSPy encourages splitting functionality into modules. A retrieval module handles fetching info, a reasoning module handles thinking steps, a formatting module handles output. Each piece is easier to understand, test, and improve independently.
+
+**Reusability and Community**: Manual prompts are locked to specific tasks. In DSPy, strategies (modules and optimizers) are reusable. The community can contribute new modules that everyone can apply to their own Signatures. It's not a collection of templates – it's a framework where best practices accumulate.
+
+**Beyond Chat Interfaces**: DSPy isn't about writing clever ChatGPT prompts. It's about designing full AI systems and pipelines with multiple LMs and steps. The compiler can optimize your entire pipeline end-to-end, something manual prompt tinkering can't achieve.
+
+DSPy brings the rigor of compilers and optimizers to what was previously an informal process. Just as high-level programming languages replaced raw machine code, DSPy's creators believe high-level LLM programming will replace low-level prompt tweaking.
+
+## Long-Term Vision: The Future of LLM Programming
+
+DSPy anticipates a **paradigm shift** in how we build AI systems. As models become more central to applications, treating them as black boxes with handwritten prompts becomes *untenable*.
+
+We need **"system prompt learning"** – giving LLMs ways to learn and refine their instructions over time, not just their internal weights. DSPy's focus on prompt optimization aligns with this vision. You can think of a DSPy program as a *"living" system prompt* that improves iteratively.
+
+Because DSPy programs are **declarative and modular**, they're equipped to absorb advances. If a better prompting technique emerges, you can incorporate it by updating a module without redesigning your entire system. This is like how well-designed software can swap databases or libraries thanks to *abstraction boundaries*.
+
+The long-term bet: **LLM-based development** will standardize around such abstractions, moving away from one-off solutions. Programming with LLMs may become as mainstream as web development – and when that happens, having compiler-like frameworks to manage complexity will be *crucial*.
+
+We can imagine a future where AI developers design **Signatures** and plug in **Modules** like today's developers work with APIs and libraries. Type-safety analogies might become literal as research progresses on *specifying and verifying* LLM behavior.
+
+DSPy aims to bridge from today's prompt experiments to tomorrow's **rigorous discipline** of "LLM programming." The philosophy embraces structure and learning in a domain often approached ad-hoc. By raising the abstraction level – treating prompts and flows as code – we can build AI systems that are more *reliable*, *maintainable*, and *powerful*.
+
+This isn't just about making prompt engineering easier. It's laying groundwork for the **next generation** of AI software development, where humans and AI models collaborate through clear interfaces and continual improvement.
+
+The ultimate vision: making LLMs *first-class programmable entities* in our software stack.
\ No newline at end of file
diff --git a/docs/docs/js/mathjax-config.js b/docs/docs/js/mathjax-config.js
new file mode 100644
index 0000000000..ec8c28b31d
--- /dev/null
+++ b/docs/docs/js/mathjax-config.js
@@ -0,0 +1,17 @@
+window.MathJax = {
+ tex: {
+ inlineMath: [["\\(", "\\)"], ["$", "$"]],
+ displayMath: [["\\[", "\\]"], ["$$", "$$"]],
+ processEscapes: true,
+ processEnvironments: true,
+ packages: {'[+]': ['ams', 'newcommand', 'configmacros']}
+ },
+ options: {
+ ignoreHtmlClass: ".*|",
+ processHtmlClass: "arithmatex"
+ }
+};
+
+document$.subscribe(() => {
+ MathJax.typesetPromise()
+})
\ No newline at end of file
diff --git a/docs/docs/why-dspy.md b/docs/docs/why-dspy.md
new file mode 100644
index 0000000000..013b748a6c
--- /dev/null
+++ b/docs/docs/why-dspy.md
@@ -0,0 +1,85 @@
+# Why DSPy?
+
+> This document has been consolidated from discussions with the DSPy team, including [Omar Khattab](https://x.com/lateinteraction) and other core contributors, as well as from the official DSPy documentation, [Twitter](https://x.com/DSPyOSS) and community insights.
+
+If you've built anything with LLMs, you've probably hit the wall: prompts that work great in testing break in production, small changes cascade into system failures, and every new model requires rewriting everything from scratch.
+
+DSPy emerged from this frustration. Instead of treating prompts as strings to craft and re-craft, it treats them as programs to compile and optimize. You write what you want the system to do, and DSPy figures out how to make it work well.
+
+## The Problem with Prompt Engineering
+
+Most LLM development today feels like programming in assembly language. You're writing very specific instructions for each task, debugging by trial and error, and starting over when anything changes.
+
+Take a typical scenario: you spend days crafting the perfect prompt for email summarization. It works beautifully on your test emails. Then you switch from GPT to Claude, and everything breaks. Or your users start sending different types of emails, and suddenly your carefully tuned prompt produces garbage.
+
+This happens because prompts are brittle. They're optimized for specific contexts and fall apart when those contexts shift. Worse, they don't compose well – if you want to chain multiple LLM calls together, you end up with a mess of string concatenation and manual output parsing.
+
+The field changes quickly. New techniques like chain-of-thought, retrieval, and fine-tuning keep replacing each other. This means constantly rewriting your code to use the latest methods.
+
+## How DSPy Changes This
+
+DSPy treats LLM programming more like traditional software engineering. Instead of writing prompts, you write programs that describe what you want to happen. DSPy then compiles these programs into effective prompts automatically.
+
+Here's the key insight: you shouldn't have to manually optimize prompts any more than you should have to manually optimize assembly code. The computer should do that work for you.
+
+```python
+# Instead of crafting prompts, describe the task
+qa = dspy.ChainOfThought("question -> answer")
+
+# Let DSPy optimize it for your data
+compiled = optimizer.compile(qa, trainset=examples)
+```
+
+This compiled program often performs better than hand-tuned prompts because DSPy can try thousands of variations and pick the best ones. It's like having an expert prompt engineer working around the clock.
+
+The modular design means you can build complex pipelines by combining simple pieces:
+
+```python
+# Each piece has a clear job
+retriever = dspy.Retrieve(k=5)
+summarizer = dspy.ChainOfThought("context, question -> summary")
+classifier = dspy.Predict("summary -> category")
+
+# Compose them naturally
+def pipeline(question):
+ docs = retriever(question)
+ summary = summarizer(docs, question)
+ return classifier(summary)
+```
+
+When you need to swap out components – maybe you want to try a different model, or add a reasoning step – you modify the high-level program and recompile. DSPy handles the prompt engineering.
+
+## Why Now?
+
+We're at an inflection point with LLMs. The models themselves are incredibly capable – GPT, Claude, Llama can handle almost any task if you ask them the right way. The problem isn't the models anymore; it's how we're programming them.
+
+There's a growing recognition that we might be missing a major paradigm for LLM learning – the idea that models should get better at how they're instructed, not just what they know. DSPy is built around this insight of "system prompt learning."
+
+We're also seeing LLMs move from research demos to real products. When you're prototyping, it's fine to manually tweak prompts until they work. But when you're serving millions of users, you need systems that are reliable, maintainable, and can improve automatically.
+
+The timing is right because we finally understand enough about how prompting works to systematize it. Patterns like chain-of-thought, few-shot learning, and retrieval augmentation aren't magic anymore – they're techniques we can encode into reusable modules.
+
+## Who Uses DSPy
+
+**Individual developers** love DSPy because it eliminates the tedious parts of LLM development. Instead of spending hours tweaking prompts, you can prototype new ideas quickly using built-in modules for common patterns. When something breaks, you debug structured code rather than mysterious prompt interactions.
+
+**Researchers** find DSPy invaluable for experimentation. Want to compare chain-of-thought reasoning with retrieval augmentation? Both approaches use the same framework, so you can swap them in and out easily. Your experiments become more reproducible because DSPy programs are concrete and version-controllable, unlike vague descriptions of prompts.
+
+**Engineering teams** adopt DSPy to manage complexity. When multiple engineers work on LLM features, DSPy's modular structure prevents the codebase from becoming a tangle of one-off prompts. You can enforce consistency across features, integrate with existing ML infrastructure, and optimize costs by automatically finding efficient model configurations.
+
+## About the Examples
+
+If you look at DSPy examples and think "this seems simple," you're seeing the point. A signature like `"question -> answer"` looks trivial, but it's doing a lot of work behind the scenes.
+
+The simplicity is intentional. DSPy examples are like "Hello World" programs – they demonstrate the core concepts without getting bogged down in application complexity. In practice, you'll combine these simple pieces to build sophisticated systems.
+
+Remember, when you see a minimal example, DSPy is handling prompt generation, optimization, and model interaction automatically. The few lines of code you write represent a lot of engineering effort you don't have to do yourself.
+
+## The Bottom Line
+
+DSPy changes how you think about building with LLMs. Instead of crafting prompts by hand, you write programs that describe what you want to achieve. Instead of manually tuning for each model and dataset, you let DSPy optimize automatically.
+
+This isn't just about making prompt engineering easier – it's about making LLM development more like traditional software engineering. Reliable, maintainable, and cumulative.
+
+As LLMs become central to more applications, having systematic ways to program them becomes essential. DSPy provides that foundation, letting you build on solid abstractions rather than brittle prompts.
+
diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml
index fcb2128b99..1380ab0421 100644
--- a/docs/mkdocs.yml
+++ b/docs/mkdocs.yml
@@ -65,6 +65,8 @@ nav:
- Community:
- Community Resources: community/community-resources.md
- Use Cases: community/use-cases.md
+ - Why DSPy?: why-dspy.md
+ - Design Principles: design-principles.md
- Roadmap: roadmap.md
- Contributing: community/how-to-contribute.md
- FAQ:
@@ -248,6 +250,9 @@ extra:
extra_javascript:
- "js/runllm-widget.js"
+ - "js/mathjax-config.js"
+ - https://polyfill.io/v3/polyfill.min.js?features=es6
+ - https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js
markdown_extensions:
- pymdownx.tabbed: