Dynamic Validation in GEPAState #100

aria42 · 2025-10-08T01:41:26Z

Addresses #96 and enables add new validation evals. Also removes assumption that every program has been evaluated on all validation examples. Everything is still well-defined with this change with the exception that merge proposals must be careful to compare only on validation examples both candidates have been evaluated against.

Changes

Changes to GEPAState: The state information about validation (scores, outputs, etc.) are stored in terms of opaque identifier (ValId). Making the state agnostic about the set of validation examples enables you to extend the validation examples from existing states (and run directories). It also removes the assumption that we have "dense" evaluation where each program is evaluated on the full validation set.
Changes to GEPAEngine: The engine takes optional val_evaluation_policy to select the subset of validation examples to evaluate for a program. If none is provided the full valset is evaluated. The policy enables you to have flexible ways of choosing examples (e.g., picking validation examples not evaluated by other programs or curriculum-style staging of more harder examples over time). Note that GEPAEngine assumes the valset ids are indices of a validation set (which can be updated), but this assumption doesn't leak to GEPAState.
Changes to api.py: Add a parameter for merge_val_overlap_floor which is threaded through to MergeProposer. This parameter rejects merge proposal if the merge candidates don't have this many shared validation examples they are both evaluated against.
Running pre-commit on the affected files leaded a much larger set of changes from past commits that don't assume to be formatted. Sorry!

Tests

Existing tests pass and added the following:

Added test around extending validation and having custom validation selection policy
Added test around loading a legacy gepa.bin format and validating it can be read and upgraded to "sparse" format

Open Questions

In GEPAState there was per_program_tracked_scores and program_full_scores_val_set member variables which were meant to capture the train validation and valset scores for candidates. But prior to this change these two variables were identical (latter was set to the former see existing code here and here). I've left derived properties for these for backward compatibility but could remove them since I think the distinction is no longer needed.

LakshyAAAgrawal · 2025-10-13T04:07:32Z

Could we also address #79, which would be very aligned with this PR? It should also be a straightforward change. Just need to implement a DataLoader protocol, and load next training minibatch using it.

LakshyAAAgrawal · 2025-10-12T20:30:19Z

src/gepa/proposer/merge.py

            state.prog_candidate_val_subscores[id1],
            state.prog_candidate_val_subscores[id2],
        )
+        if not subsample_ids:


Question: I do several random selections to select id1 and id2 to see if we can find candidates which we are able to merge that also satisfy all the requirements. Do you think we can include this as part of those checks?

I pushed this deeper into the merge code while you're scanning for valid pairs by injecting a predicate. I also added some tests on the merge functions since I wasn't 100% sure on the types everywhere in 889e1a9

src/gepa/logging/utils.py

LakshyAAAgrawal · 2025-10-13T03:34:58Z

src/gepa/gepa_utils.py

+    new_program_at_pareto_front_valset = {
+        val_id: {prog_idx for prog_idx in front if prog_idx in dominators}
+        for val_id, front in program_at_pareto_front_valset.items()
+    }


The current candidate selection logic assumes that all valset scores are present. I expect the current code will break if this assumption is not satisfied.

It could perhaps be good to encode this as an assert statement somewhere in https://github.com/gepa-ai/gepa/blob/main/src/gepa/strategies/candidate_selector.py#L11-L32.

I'm not sure this is the best solution, but basically the CandidateSelector is responsible for saying if it's compatible with an EvaluationPolicy and the Pareto selector throws if the eval policy is "sparse". See 0d651a

LakshyAAAgrawal · 2025-10-13T04:04:17Z

src/gepa/gepa_utils.py

    sampling_list = [prog_idx for prog_idx, freq in program_frequency_in_validation_pareto_front.items() for _ in range(freq)]
-    assert len(sampling_list) > 0
+    if not sampling_list:
+        return idxmax(train_val_weighted_agg_scores_for_all_programs)


Will idxmax work on "train_val_weighted_agg_scores_for_all_programs" directly?

The only current call site for this passes in state.program_full_scores_val_set is a dense array of average validation scores for each program. So idxmax should work, but of course this PR changes the semantics so you're comparing averages that could be from different validation examples.

LakshyAAAgrawal · 2025-10-13T04:06:35Z

src/gepa/api.py

    # Reproducibility
    seed: int = 0,
    raise_on_exception: bool = True,
+    val_evaluation_policy: Callable[[GEPAState], list[int]] | None = None,


Tying back to the point about some candidate selectors not working with partial validation scores, maybe in the gepa.optimize call (entrypoint), we can add the assert here. That if val_evaluation_policy is something that doesn't work with the candidate selector chosen, then throw an error.

Handled in 0d651a and added a test case that this fails

LakshyAAAgrawal · 2025-10-13T04:23:39Z

Any thoughts about the scenario where we have an updating/expanding valset? I think the current valset still assumes in some places that the size of the valset is fixed. Since we are generalizing, we could perhaps try to aim for the most general case.

I like the concept of ValId introduced as a typevar. I think this enables us to expand validation sets?

src/gepa/core/engine.py

LakshyAAAgrawal · 2025-10-13T04:15:29Z

src/gepa/core/engine.py

-        valset_score = sum(valset_subscores) / len(valset_subscores)
+        valset_outputs, valset_subscores = self._evaluate_on_valset(new_program, state)
+        valset_score = (
+            sum(valset_subscores.values()) / len(valset_subscores) if len(valset_subscores) > 0 else float("-inf")


nit: Any thoughts about float(-inf) vs NaN?

Slight preference for -inf so comparisons with real numbers are still coherent, but a weak preference and happy to go with NaN if you prefer

LakshyAAAgrawal · 2025-10-13T04:44:55Z

src/gepa/core/state.py

-    program_candidates: list[dict[str, str]]
-    parent_program_for_candidate: list[list[int | None]]

-    program_full_scores_val_set: list[float]


I think we should still keep the aggregate scores for all programs? I am trying to think of arguments for both ways. What are your thoughts?

It's converted into a property which is derived from the raw data on the fly below. You have another comment to rename this which I'll do. I think keeping as a derived property is fine as long as the data sizes here are pretty small and we're not worried about the cost of on-the-fly aggregation right now.

LakshyAAAgrawal · 2025-10-13T04:49:21Z

src/gepa/core/state.py


-    program_full_scores_val_set: list[float]
+class GEPAState(Generic[RolloutOutput, ValId]):
+    _VALIDATION_SCHEMA_VERSION: ClassVar[int] = 2


Should we introduce something like this for GEPAResult class as well?

src/gepa/core/state.py

LakshyAAAgrawal · 2025-10-13T05:22:59Z

src/gepa/core/state.py

+        task_dir = os.path.join(output_dir, f"task_{val_id}")
+        os.makedirs(task_dir, exist_ok=True)
+        with open(os.path.join(task_dir, f"iter_{0}_prog_0.json"), "w") as f:
+            json.dump(score, f, indent=4, default=json_default)


This line was earlier writing both the score and output I think.

I don't think it was. I thought so as well, but the unit tests for this read these json files and check for just a score. I'm fine to have this dump both (which makes sense!) but wasn't sure what the intended behavior was @LakshyAAAgrawal

tests/test_module_selector.py

LakshyAAAgrawal · 2025-10-13T05:32:03Z

Dear @aria42,

Thank you so much for this wonderful PR. This is exactly the direction I wanted to take GEPA towards, and I am hoping that merging this will enable deep research into ideal, data and rollout efficient validation as well as candidate selection strategies. Thank you once again for the PR.

SInce the PR changes so many parts of the codebase, I have left a lot of comments. Please feel free to ignore some of them if you disagree, and I would love to hear your thoughts about it.

aria42 · 2025-10-13T05:35:25Z

SInce the PR changes so many parts of the codebase, I have left a lot of comments. Please feel free to ignore some of them if you disagree, and I would love to hear your thoughts about it.

Thanks @LakshyAAAgrawal! Comments are great will address in the morning.

README.md

src/gepa/api.py

src/gepa/core/adapter.py

src/gepa/examples/anymaths-bench/train_anymaths.py

src/gepa/examples/rag_adapter/rag_optimization.py

src/gepa/examples/terminal-bench/train_terminus.py

…lSignature

Added a note for users to reach out for listing use cases.

….load.

Added new references related to model tuning and optimization using GEPA.

…n permissions Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

…ataLoader instances.

…ing "best" program on validation. Thread through GEPA{Engine, API}

resolve errors from merge conflicts

aria42 · 2025-10-14T08:19:10Z

Hey @LakshyAAAgrawal,

I've got a few items open (the unresolved comments), but wanted to update on some stuff to get your opinion. I can also squash my full diff into a single commit since merging the main commits has made this PR hard to read now 😄

Check out eval_policy.py this is the "validation eval policy" protocol that selects the validation examples to use for evaluation. With a tweak to batch_sampler.py it can re-use the same protocol used for training batch selection. It extends this with a method to get get_best_program on validation (to handle how to compare programs with different eval support coverage). I thought about also having get_best_program_for_instance to have a per-instance level example, but this gets a little messy in terms of interaction with the existing Pareto front logic.
I believe the interface for api.py and engine.py is now fully backward compatible since we accept lists or data loader for train/validation.

I still need to address open comments but big things left to tackle in morning
[x] Validating the candidate selector is compatible with the eval policy
[x] Pushing the check for sufficient overlap for merge proposals deeper into logic checks for selecting candidates
[x] Add more unit test coverage for changes

- sparse evaluation method for eval policy

…unit-test to make sure we fail when not compatible

aria42 · 2025-10-14T18:46:49Z

Hey @LakshyAAAgrawal,

I think I've covered the comments (but easily could've missed one). I'm not thrilled about the EvaluationPolicy interface; I think making it a BatchSampler (which is similar) makes the names a bit confusing. I think it all works but might be better to separate.

Let me know, totally happy to iterate more. And sorry the PR got so big, this just ended up touching lots of things.

aria42 added 2 commits October 7, 2025 18:15

dynamic validation in GEPAState

2766b36

remove uv.lock change

540b7d6

LakshyAAAgrawal mentioned this pull request Oct 9, 2025

Do not perform full validation evals for every prompt accepted on the trainset minibatch #103

Open

fix bugs and undo unneded to changes to reduce churn in PR

7b8815e

LakshyAAAgrawal reviewed Oct 13, 2025

View reviewed changes

tests/test_module_selector.py Outdated Show resolved Hide resolved

LakshyAAAgrawal mentioned this pull request Oct 13, 2025

Allow for objective and hybrid pareto frontier tracking #92

Open

Dataloader protocol for train/validation

3bdf2bc