-
Notifications
You must be signed in to change notification settings - Fork 86
Dynamic Validation in GEPAState #100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Could we also address #79, which would be very aligned with this PR? It should also be a straightforward change. Just need to implement a DataLoader protocol, and load next training minibatch using it. |
src/gepa/proposer/merge.py
Outdated
state.prog_candidate_val_subscores[id1], | ||
state.prog_candidate_val_subscores[id2], | ||
) | ||
if not subsample_ids: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: I do several random selections to select id1 and id2 to see if we can find candidates which we are able to merge that also satisfy all the requirements. Do you think we can include this as part of those checks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed this deeper into the merge code while you're scanning for valid pairs by injecting a predicate. I also added some tests on the merge functions since I wasn't 100% sure on the types everywhere in 889e1a9
new_program_at_pareto_front_valset = { | ||
val_id: {prog_idx for prog_idx in front if prog_idx in dominators} | ||
for val_id, front in program_at_pareto_front_valset.items() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current candidate selection logic assumes that all valset scores are present. I expect the current code will break if this assumption is not satisfied.
It could perhaps be good to encode this as an assert statement somewhere in https://github.com/gepa-ai/gepa/blob/main/src/gepa/strategies/candidate_selector.py#L11-L32.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is the best solution, but basically the CandidateSelector
is responsible for saying if it's compatible with an EvaluationPolicy
and the Pareto selector throws if the eval policy is "sparse". See 0d651a
sampling_list = [prog_idx for prog_idx, freq in program_frequency_in_validation_pareto_front.items() for _ in range(freq)] | ||
assert len(sampling_list) > 0 | ||
if not sampling_list: | ||
return idxmax(train_val_weighted_agg_scores_for_all_programs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will idxmax work on "train_val_weighted_agg_scores_for_all_programs" directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only current call site for this passes in state.program_full_scores_val_set
is a dense array of average validation scores for each program. So idxmax
should work, but of course this PR changes the semantics so you're comparing averages that could be from different validation examples.
src/gepa/api.py
Outdated
# Reproducibility | ||
seed: int = 0, | ||
raise_on_exception: bool = True, | ||
val_evaluation_policy: Callable[[GEPAState], list[int]] | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tying back to the point about some candidate selectors not working with partial validation scores, maybe in the gepa.optimize call (entrypoint), we can add the assert here. That if val_evaluation_policy is something that doesn't work with the candidate selector chosen, then throw an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handled in 0d651a and added a test case that this fails
Any thoughts about the scenario where we have an updating/expanding valset? I think the current valset still assumes in some places that the size of the valset is fixed. Since we are generalizing, we could perhaps try to aim for the most general case. I like the concept of ValId introduced as a typevar. I think this enables us to expand validation sets? |
valset_score = sum(valset_subscores) / len(valset_subscores) | ||
valset_outputs, valset_subscores = self._evaluate_on_valset(new_program, state) | ||
valset_score = ( | ||
sum(valset_subscores.values()) / len(valset_subscores) if len(valset_subscores) > 0 else float("-inf") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Any thoughts about float(-inf) vs NaN?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slight preference for -inf so comparisons with real numbers are still coherent, but a weak preference and happy to go with NaN if you prefer
program_candidates: list[dict[str, str]] | ||
parent_program_for_candidate: list[list[int | None]] | ||
|
||
program_full_scores_val_set: list[float] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should still keep the aggregate scores for all programs? I am trying to think of arguments for both ways. What are your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's converted into a property which is derived from the raw data on the fly below. You have another comment to rename this which I'll do. I think keeping as a derived property is fine as long as the data sizes here are pretty small and we're not worried about the cost of on-the-fly aggregation right now.
|
||
program_full_scores_val_set: list[float] | ||
class GEPAState(Generic[RolloutOutput, ValId]): | ||
_VALIDATION_SCHEMA_VERSION: ClassVar[int] = 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we introduce something like this for GEPAResult class as well?
task_dir = os.path.join(output_dir, f"task_{val_id}") | ||
os.makedirs(task_dir, exist_ok=True) | ||
with open(os.path.join(task_dir, f"iter_{0}_prog_0.json"), "w") as f: | ||
json.dump(score, f, indent=4, default=json_default) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line was earlier writing both the score and output I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it was. I thought so as well, but the unit tests for this read these json files and check for just a score. I'm fine to have this dump both (which makes sense!) but wasn't sure what the intended behavior was @LakshyAAAgrawal
Dear @aria42, Thank you so much for this wonderful PR. This is exactly the direction I wanted to take GEPA towards, and I am hoping that merging this will enable deep research into ideal, data and rollout efficient validation as well as candidate selection strategies. Thank you once again for the PR. SInce the PR changes so many parts of the codebase, I have left a lot of comments. Please feel free to ignore some of them if you disagree, and I would love to hear your thoughts about it. |
Thanks @LakshyAAAgrawal! Comments are great will address in the morning. |
Added a note for users to reach out for listing use cases.
Added new references related to model tuning and optimization using GEPA.
…n permissions Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
…ataLoader instances.
…ing "best" program on validation. Thread through GEPA{Engine, API}
resolve errors from merge conflicts
Hey @LakshyAAAgrawal, I've got a few items open (the unresolved comments), but wanted to update on some stuff to get your opinion. I can also squash my full diff into a single commit since merging the main commits has made this PR hard to read now 😄
I still need to address open comments but big things left to tackle in morning |
924ca00
to
e3d7a13
Compare
e3d7a13
to
23c01a0
Compare
- sparse evaluation method for eval policy
…unit-test to make sure we fail when not compatible
Hey @LakshyAAAgrawal, I think I've covered the comments (but easily could've missed one). I'm not thrilled about the Let me know, totally happy to iterate more. And sorry the PR got so big, this just ended up touching lots of things. |
Addresses #96 and enables add new validation evals. Also removes assumption that every program has been evaluated on all validation examples. Everything is still well-defined with this change with the exception that merge proposals must be careful to compare only on validation examples both candidates have been evaluated against.
Changes
Changes to
GEPAState
: The state information about validation (scores, outputs, etc.) are stored in terms of opaque identifier (ValId
). Making the state agnostic about the set of validation examples enables you to extend the validation examples from existing states (and run directories). It also removes the assumption that we have "dense" evaluation where each program is evaluated on the full validation set.Changes to
GEPAEngine
: The engine takes optionalval_evaluation_policy
to select the subset of validation examples to evaluate for a program. If none is provided the full valset is evaluated. The policy enables you to have flexible ways of choosing examples (e.g., picking validation examples not evaluated by other programs or curriculum-style staging of more harder examples over time). Note thatGEPAEngine
assumes the valset ids are indices of a validation set (which can be updated), but this assumption doesn't leak toGEPAState
.Changes to
api.py
: Add a parameter formerge_val_overlap_floor
which is threaded through toMergeProposer
. This parameter rejects merge proposal if the merge candidates don't have this many shared validation examples they are both evaluated against.Running
pre-commit
on the affected files leaded a much larger set of changes from past commits that don't assume to be formatted. Sorry!Tests
Existing tests pass and added the following:
gepa.bin
format and validating it can be read and upgraded to "sparse" formatOpen Questions
GEPAState
there wasper_program_tracked_scores
andprogram_full_scores_val_set
member variables which were meant to capture the train validation and valset scores for candidates. But prior to this change these two variables were identical (latter was set to the former see existing code here and here). I've left derived properties for these for backward compatibility but could remove them since I think the distinction is no longer needed.