#ForTheLoveOfCode
GEPA-Lite is a lightweight implementation based on the proposed GEPA prompt optimization method that is custom fit for single-task applications. It's built on the core principle of LLM self-reflection, self-improvement, streamlined.
Developed in the spirit of open-source initiatives like Google Summer of Code 2025
and For the Love of Code 2025
, this project leverages Gemma (ollama::gemma3n:e4b
) as its core model. The project also offers optional support for the Gemini API, allowing access to powerful models like gemini-2.5-flash-lite
, gemini-2.5-flash
, and gemini-2.5-pro
.
Created by: Emmanuel G. Maminta* (GitHub: egmaminta
, LinkedIn: egmaminta
, Personal Blog: egmaminta.github.io
, Email: egmaminta@up.edu.ph
)
*University of the Philippines
GEPA-Lite demonstration on sampled GSM8K**. Initial prompt: "You are an assistant and your task is to answer the user's question." Final prompt: "Your task is to solve mathematical word problems and output only the final numerical answer."
**GSM8K consists of 8.5K high quality grade school math problems created by human problem writers.
Strategy | Trigger condition | Action taken | Goal |
---|---|---|---|
Exploit-Max | Only one best candidate remains | Select it directly | Focus on clear winner |
Exploit-Normal | Multiple best, Q branch | Sample one by frequency as "best" | Exploit robust, multi-task candidates |
Exploit-Merge | Multiple best, (1-Q) branch | Merge (mutate) all bests into a new candidate | Synthesize new, possibly better prompt |
Explore | With probability (1-P) | Pick a random candidate from the pool | Maintain diversity, avoid local optima |
Note: P
: Exploit probability, Q
: Probability to not merge (or not mutate) all bests
Disclaimer: This implementation contains minor deviations from the reference paper. I, the creator of this repository, am held accountable for any part that is implemented incorrectly. However, the idea remains the same. Feel free to contribute!
Title: GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning
Authors: Lakshya A Agrawal1, Shangyin Tan1, Dilara Soylu2, Noah Ziems4, Rishi Khare1, Krista Opsahl-Ong5, Arnav Singhvi2,5, Herumb Shandilya2, Michael J Ryan2, Meng Jiang4, Christopher Potts2, Koushik Sen1, Alexandros G. Dimakis1,3, Ion Stoica1, Dan Klein1, Matei Zaharia1,5, Omar Khattab6
1UC Berkeley, 2Stanford University, 3BespokeLabs.ai, 4Notre Dame, 5Databricks, 6MIT
TL;DR: GEPA is introduced, which utilizes natural language reflection to learn from trial and error. GEPA outperforms the reinforcement learning method GRPO (Group Relative Policy Optimization) by an average of 10% and up to 20%, while using up to 35 times fewer rollouts. It also surpasses the leading prompt optimizer, MIPROv2, by over 10%. Additionally, GEPA shows potential as an effective inference-time search strategy for code optimization.
This implementation performs iterative prompt evolution using concurrent generation, evaluation, Pareto-style selection, reflection-based mutation, and optional merging. It is fully asynchronous around model I/O to maximize throughput under a fixed evaluation budget.
- Load config (
GEPA_cfg.yaml
) and datasets (Dpareto.json
,Dfeedback.json
). - Generate an initial pool of diverse candidate prompts concurrently from a seed prompt.
- Evaluate each candidate over the full Pareto set (
Dpareto
) to build a score matrix S (candidates × tasks). - Enter the optimization loop (budget-limited):
- Sample a mini-batch of feedback examples (
Dfeedback
). - Choose a candidate via exploitation (probability P) or random exploration (1−P).
- Under exploitation:
- Use Pareto-based filtering + dominance removal to get a set of non-dominated "best-on-something" candidates.
- With probability Q: sample one candidate proportional to its frequency of being per-task best.
- With probability (1−Q): merge multiple top candidates into a synthesized prompt.
- Evaluate the chosen prompt on the mini-batch, collect structured feedback text.
- Reflect (mutation step): ask the reflection model to propose an improved prompt.
- Evaluate the new prompt on the same mini-batch; if it meets or exceeds the parent’s mean mini-batch score, evaluate it on the full Pareto set and append to the pool.
- Sample a mini-batch of feedback examples (
- After budget exhaustion, report the prompt with highest mean Pareto score.
Dpareto
: Ground-truth QA pairs used for stable, comparable scoring across candidates.Dfeedback
: Additional sampled items for rapid, lower-cost iterative refinement.Scores
: String similarity viadifflib.SequenceMatcher
(ratio ∈ [0,1]); exact matches earn 1. (You have the option to implement your owneval_metric
andeval_feedback
functions).
- Target model is the worker that actually answers the user's questions. It runs candidate prompts and produces task outputs that get scored.
- Reflection model is the coach that looks at the worker's prompt plus feedback and suggests an improved prompt. It never answers the task directly; it just proposes new or merged prompts.
- For each task (column), collect candidates achieving the column max.
- Union these sets → initial elite pool.
- Remove dominated candidates: a candidate dominated if another is ≥ on all tasks and > on at least one.
- Count how often each survivor is task-best; convert frequencies to a sampling distribution.
- This concentrates probability mass on genuinely distinct trade-offs and eliminates strictly inferior prompts.
- A meta prompt (
META_PROMPT
) injects the current prompt and structured per-sample feedback (query, model answer, ground truth, score, textual feedback). - The reflection model returns at most one improved prompt (keeps loop tight, reduces drift).
- Only promotes a mutant to global evaluation if it does not regress on the mini-batch.
When multiple strong candidates exist, a merge step requests a synthesized prompt that preserves strengths and addresses weaknesses (controlled by probability branch).
Exploit-Max
- When: Only one candidate survives Pareto filtering (i.e., it is the clear best).
- What happens: The algorithm selects this single best candidate deterministically for further evaluation and mutation.
- Purpose: Quickly exploit a clear winner, focusing resources on refining the top performer.
Exploit-Normal
- When: Multiple non-dominated candidates survive, and a random draw (with probability Q) chooses not to merge.
- What happens: One candidate is sampled from the survivors, with probability proportional to how often each is "best" across tasks.
- Purpose: Exploit strong candidates while maintaining diversity, allowing the process to focus on robust prompts that perform well on multiple tasks.
Exploit-Merge
- When: Multiple non-dominated candidates survive, and a random draw (with probability 1-Q) chooses to merge.
- What happens: The algorithm synthesizes a new prompt by merging the surviving candidates, aiming to combine their strengths and address their weaknesses.
- Purpose: Encourage innovation and avoid stagnation by creating new, potentially superior prompts from the best available options.
Explore
- When: With probability (1-P), the algorithm chooses to explore rather than exploit.
- What happens: A random candidate is selected from the entire pool, regardless of its current performance.
- Purpose: Maintain diversity, prevent premature convergence, and allow the discovery of overlooked or novel solutions.
- Async tasks evaluate prompts across samples in parallel (per candidate and per mini-batch mutation stage).
- A semaphore throttles concurrent initial generations to respect model rate limits.
generate_initial_candidates_from_seed
: Concurrent single-prompt generations for diversity + deduplication.select_candidate
: Pareto elite extraction, dominance pruning, probability assignment.extract_response_from_target
/extract_response_from_reflection
: Structured JSON-like parsing viaPydantic
schema.GEPA
: Orchestrates the evolutionary loop with budget accounting.extract_merged_prompt
: Synthesizes multiple prompts into one.eval_metric
/eval_feedback
: Scoring + diagnostic feedback synthesis.
Every model query against a sample decrements the global budget. Full-set evaluations are only triggered when a mutation shows promise on the mini-batch, amortizing expensive scoring.
- Frequency weighting of per-task best appearances yields a simple proxy for multi-task robustness without solving a weighted scalarization.
- Using an explicit merge path combats stagnation when multiple partial specialists exist.
- Mini-batch gating reduces variance while avoiding full-set overfitting each iteration.
- Limiting reflection to one improvement per cycle simplifies acceptance logic and traceability.
- Similarity metric is surface-level; could plug in semantic or task-specific evaluators.
- Needs epsilon-based comparisons for floating scores to avoid brittle equality.
- Dominance check is O(k^2 * tasks); could be optimized for large pools.
- No archival mechanism (e.g., hall-of-fame) or diversity penalty to prevent convergence to local optima.
- Reflection may benefit from multiple candidates with bandit selection.
- Logging could be made structured (JSON) for analytics.
- Swap
eval_metric
for domain-specific scoring. - Add alternative selection policies (e.g., Thompson sampling on mean + variance).
- Introduce early stopping when no improvement in N consecutive accepted mutations.
- Integrate caching for repeated model queries on unchanged (prompt, query) pairs.
- Ensure you have
miniconda
/anaconda
installed. Create a virtual environment with Python ver. 3.11.13.
conda create -n "GEPA-Lite" python=3.11.13 -y
- Activate the virtual environment.
conda activate GEPA-Lite
- Install the required libraries.
pip install -r requirements.txt
- Open
GEPA_cfg.yml
. - (Gemini API) Replace placeholder
PROJECT
andLOCATION
ifVERTEXAI: true
; else optionally setVERTEXAI: false
. (Ollama API) ReplaceTARGET_MODEL
andREFLECTION_MODEL
to Ollama-providedgemma3n:e4b
model. - Adjust
BUDGET
ensuring it's ≥(NUM_INITIAL_CANDIDATE_PROMPTS * |Dpareto|) + (MINI_BATCH_SIZE * 5 * |Dpareto|) + safety margin (~30%)
. - Tune
EXPLOIT_PROB
andMERGE_PROB
based on desired exploration. - Save and re-run the main script (
GEPA.py
).
- Purpose
Dpareto.json
: Stable evaluation set. Used to compute full score vectors per candidate prompt (deterministic performance basis).Dfeedback.json
: Fast adaptation pool. Small random mini-batches sampled each iteration to generate feedback and mutate prompts.
- Expected schema. Required keys:
question
,answer
(non-empty strings). Avoid extra keys unless you extend code to read them.
[
{
"question": "string input or query",
"answer": "ground truth string answer"
},
...
]
- Choosing content
Dpareto.json
: Must be representative, cover all subtopics / formats / difficulty tiers. It must vary in length, phrasing, edge cases (numbers, units, multi-step reasoning). Do not modify mid-run.Dfeedback.json
: Larger pool thanMINI_BATCH_SIZE
(≥ twice the size ofDpareto
). Higher variability, includes tricky and failure-prone cases. Can include some overlap withDpareto
, but prefer mostly distinct to broaden learning signals.
- Sizing guidelines
Dpareto
size: Small tasks (QA-like) 15-40; moderate 40-100. Larger increases cost linearly.Dfeedback
size: 2–5×Dpareto
or at least 50 if you want variety;minimum = MINI_BATCH_SIZE * 2 * |Dpareto|
.- Budget sanity:
Initial cost = NUM_INITIAL_CANDIDATE_PROMPTS * |Dpareto|
. Each accepted new candidate costs another|Dpareto|
. EnsureBUDGET ≥ initial + (expected improvements * |Dpareto|) + iteration minibatch costs
.
- Difficulty / Distribution Strategy
- Create a difficulty split: easy / medium / hard (e.g., 40/40/20). Hard items expose weaknesses; easy items stabilize averages.
- Ensure each sub-domain appears at least twice in
Dpareto
to reduce variance. - Put rare but critical edge cases (formatting, units, rounding) in
Dpareto
so they can influence selection pressure.
- Workflow to build
Dpareto
- Gather raw candidate pairs.
- Deduplicate (case-insensitive question hash).
- Remove ambiguous items.
- Normalize answers.
- Tag coverage (spread topics).
- Shuffle with fixed seed; pick top N.
- Save to
Dpareto.json
and freeze (commit with version tag).
- Workflow to build
Dfeedback
- Start with remaining pool not used in Dpareto.
- Inject a few "stress" / adversarial items (format traps, long inputs).
- Ensure at least
MINI_BATCH_SIZE * 2 * |Dpareto|
total. - Optionally rotate / refresh between runs (version the file).
I'd like to acknowledge the authors of GEPA paper for their awesome work; the team (developers, engineers, and scientists) behind Gemma 3n, Gemini for their powerful models; and Google Summer of Code 2025 and For the Love of Code 2025 for inspiring me to continue contributing to open-source. #ForTheLoveOfCode