Skip to content

Conversation

aria42
Copy link

@aria42 aria42 commented Oct 8, 2025

Addresses #96 and enables add new validation evals. Also removes assumption that every program has been evaluated on all validation examples. Everything is still well-defined with this change with the exception that merge proposals must be careful to compare only on validation examples both candidates have been evaluated against.

Changes

  • Changes to GEPAState: The state information about validation (scores, outputs, etc.) are stored in terms of opaque identifier (ValId). Making the state agnostic about the set of validation examples enables you to extend the validation examples from existing states (and run directories). It also removes the assumption that we have "dense" evaluation where each program is evaluated on the full validation set.

  • Changes to GEPAEngine: The engine takes optional val_evaluation_policy to select the subset of validation examples to evaluate for a program. If none is provided the full valset is evaluated. The policy enables you to have flexible ways of choosing examples (e.g., picking validation examples not evaluated by other programs or curriculum-style staging of more harder examples over time). Note that GEPAEngine assumes the valset ids are indices of a validation set (which can be updated), but this assumption doesn't leak to GEPAState.

  • Changes to api.py: Add a parameter for merge_val_overlap_floor which is threaded through to MergeProposer. This parameter rejects merge proposal if the merge candidates don't have this many shared validation examples they are both evaluated against.

  • Running pre-commit on the affected files leaded a much larger set of changes from past commits that don't assume to be formatted. Sorry!

Tests

Existing tests pass and added the following:

  • Added test around extending validation and having custom validation selection policy
  • Added test around loading a legacy gepa.bin format and validating it can be read and upgraded to "sparse" format

Open Questions

  • In GEPAState there was per_program_tracked_scores and program_full_scores_val_set member variables which were meant to capture the train validation and valset scores for candidates. But prior to this change these two variables were identical (latter was set to the former see existing code here and here). I've left derived properties for these for backward compatibility but could remove them since I think the distinction is no longer needed.

@LakshyAAAgrawal
Copy link
Contributor

LakshyAAAgrawal commented Oct 13, 2025

Could we also address #79, which would be very aligned with this PR? It should also be a straightforward change. Just need to implement a DataLoader protocol, and load next training minibatch using it.

state.prog_candidate_val_subscores[id1],
state.prog_candidate_val_subscores[id2],
)
if not subsample_ids:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: I do several random selections to select id1 and id2 to see if we can find candidates which we are able to merge that also satisfy all the requirements. Do you think we can include this as part of those checks?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed this deeper into the merge code while you're scanning for valid pairs by injecting a predicate. I also added some tests on the merge functions since I wasn't 100% sure on the types everywhere in 889e1a9

new_program_at_pareto_front_valset = {
val_id: {prog_idx for prog_idx in front if prog_idx in dominators}
for val_id, front in program_at_pareto_front_valset.items()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current candidate selection logic assumes that all valset scores are present. I expect the current code will break if this assumption is not satisfied.

It could perhaps be good to encode this as an assert statement somewhere in https://github.com/gepa-ai/gepa/blob/main/src/gepa/strategies/candidate_selector.py#L11-L32.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is the best solution, but basically the CandidateSelector is responsible for saying if it's compatible with an EvaluationPolicy and the Pareto selector throws if the eval policy is "sparse". See 0d651a

sampling_list = [prog_idx for prog_idx, freq in program_frequency_in_validation_pareto_front.items() for _ in range(freq)]
assert len(sampling_list) > 0
if not sampling_list:
return idxmax(train_val_weighted_agg_scores_for_all_programs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will idxmax work on "train_val_weighted_agg_scores_for_all_programs" directly?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only current call site for this passes in state.program_full_scores_val_set is a dense array of average validation scores for each program. So idxmax should work, but of course this PR changes the semantics so you're comparing averages that could be from different validation examples.

src/gepa/api.py Outdated
# Reproducibility
seed: int = 0,
raise_on_exception: bool = True,
val_evaluation_policy: Callable[[GEPAState], list[int]] | None = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tying back to the point about some candidate selectors not working with partial validation scores, maybe in the gepa.optimize call (entrypoint), we can add the assert here. That if val_evaluation_policy is something that doesn't work with the candidate selector chosen, then throw an error.

Copy link
Author

@aria42 aria42 Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled in 0d651a and added a test case that this fails

@LakshyAAAgrawal
Copy link
Contributor

LakshyAAAgrawal commented Oct 13, 2025

Any thoughts about the scenario where we have an updating/expanding valset? I think the current valset still assumes in some places that the size of the valset is fixed. Since we are generalizing, we could perhaps try to aim for the most general case.

I like the concept of ValId introduced as a typevar. I think this enables us to expand validation sets?

valset_score = sum(valset_subscores) / len(valset_subscores)
valset_outputs, valset_subscores = self._evaluate_on_valset(new_program, state)
valset_score = (
sum(valset_subscores.values()) / len(valset_subscores) if len(valset_subscores) > 0 else float("-inf")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Any thoughts about float(-inf) vs NaN?

Copy link
Author

@aria42 aria42 Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slight preference for -inf so comparisons with real numbers are still coherent, but a weak preference and happy to go with NaN if you prefer

program_candidates: list[dict[str, str]]
parent_program_for_candidate: list[list[int | None]]

program_full_scores_val_set: list[float]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should still keep the aggregate scores for all programs? I am trying to think of arguments for both ways. What are your thoughts?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's converted into a property which is derived from the raw data on the fly below. You have another comment to rename this which I'll do. I think keeping as a derived property is fine as long as the data sizes here are pretty small and we're not worried about the cost of on-the-fly aggregation right now.


program_full_scores_val_set: list[float]
class GEPAState(Generic[RolloutOutput, ValId]):
_VALIDATION_SCHEMA_VERSION: ClassVar[int] = 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we introduce something like this for GEPAResult class as well?

task_dir = os.path.join(output_dir, f"task_{val_id}")
os.makedirs(task_dir, exist_ok=True)
with open(os.path.join(task_dir, f"iter_{0}_prog_0.json"), "w") as f:
json.dump(score, f, indent=4, default=json_default)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line was earlier writing both the score and output I think.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it was. I thought so as well, but the unit tests for this read these json files and check for just a score. I'm fine to have this dump both (which makes sense!) but wasn't sure what the intended behavior was @LakshyAAAgrawal

@LakshyAAAgrawal
Copy link
Contributor

Dear @aria42,

Thank you so much for this wonderful PR. This is exactly the direction I wanted to take GEPA towards, and I am hoping that merging this will enable deep research into ideal, data and rollout efficient validation as well as candidate selection strategies. Thank you once again for the PR.

SInce the PR changes so many parts of the codebase, I have left a lot of comments. Please feel free to ignore some of them if you disagree, and I would love to hear your thoughts about it.

@aria42
Copy link
Author

aria42 commented Oct 13, 2025

SInce the PR changes so many parts of the codebase, I have left a lot of comments. Please feel free to ignore some of them if you disagree, and I would love to hear your thoughts about it.

Thanks @LakshyAAAgrawal! Comments are great will address in the morning.

…ing "best" program on validation. Thread through GEPA{Engine, API}
resolve errors from merge conflicts
@aria42
Copy link
Author

aria42 commented Oct 14, 2025

Hey @LakshyAAAgrawal,

I've got a few items open (the unresolved comments), but wanted to update on some stuff to get your opinion. I can also squash my full diff into a single commit since merging the main commits has made this PR hard to read now 😄

  • Check out eval_policy.py this is the "validation eval policy" protocol that selects the validation examples to use for evaluation. With a tweak to batch_sampler.py it can re-use the same protocol used for training batch selection. It extends this with a method to get get_best_program on validation (to handle how to compare programs with different eval support coverage). I thought about also having get_best_program_for_instance to have a per-instance level example, but this gets a little messy in terms of interaction with the existing Pareto front logic.

  • I believe the interface for api.py and engine.py is now fully backward compatible since we accept lists or data loader for train/validation.

I still need to address open comments but big things left to tackle in morning
[x] Validating the candidate selector is compatible with the eval policy
[x] Pushing the check for sufficient overlap for merge proposals deeper into logic checks for selecting candidates
[x] Add more unit test coverage for changes

@aria42 aria42 force-pushed the dynamic-validation-squash branch from 924ca00 to e3d7a13 Compare October 14, 2025 08:44
@aria42 aria42 force-pushed the dynamic-validation-squash branch from e3d7a13 to 23c01a0 Compare October 14, 2025 08:47
- sparse evaluation method for eval policy
…unit-test to make sure we fail when not compatible
@aria42
Copy link
Author

aria42 commented Oct 14, 2025

Hey @LakshyAAAgrawal,

I think I've covered the comments (but easily could've missed one). I'm not thrilled about the EvaluationPolicy interface; I think making it a BatchSampler (which is similar) makes the names a bit confusing. I think it all works but might be better to separate.

Let me know, totally happy to iterate more. And sorry the PR got so big, this just ended up touching lots of things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants