Skip to content

Confusion matrix-based metrics #1660

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 38 commits into
base: main
Choose a base branch
from

Conversation

chrico-bu-uab
Copy link

@chrico-bu-uab chrico-bu-uab commented Oct 21, 2024

Since the existing frameworks don't allow for confusion matrix-based evaluators or optimization, I created additional evaluator and optimizer classes (Confusion and MCCBootstrapFewShotWithRandomSearch, respectively).

These additions address #556.

@chrico-bu-uab chrico-bu-uab marked this pull request as draft October 22, 2024 14:09
@chrico-bu-uab chrico-bu-uab marked this pull request as ready for review October 22, 2024 15:26
@stevegbrooks
Copy link

Very much support this PR! How do we get attention from a maintainer?

@chrico-bu-uab
Copy link
Author

chrico-bu-uab commented Mar 5, 2025

Have you tried from dspy import maintainer_attention?

@okhat
Copy link
Collaborator

okhat commented Mar 5, 2025

Hey @chrico-bu-uab , what is this PR trying to achieve? It seems like it's a special case of evaluation where you need to evaluate the whole set at once, not each item? If so, how would that interact with optimization?

We typically recommend handling special cases like this in user code. Just make each dspy.Example carry a set of examples and handle that in your metric and your program. Then the normal dspy.Evaluate works just fine, and so do all optimizers.

@okhat
Copy link
Collaborator

okhat commented Mar 5, 2025

(looks like importing maintainer_attention works)

@chrico-bu-uab
Copy link
Author

I haven't tried making each example instance carry a set of examples yet. I still want separate predictions for each example, though, and to potentially parallelize their evaluations. It's just the metrics that need to be calculated all at once.

I'm not sure I understand the question about optimization. I mimicked the Evaluator and BootstrapFewShotWithRandomSearch classes pretty exactly (I think), so however those two interact with optimization should have transferred over to my classes.

I just thought confusion matrices are required often enough to warrant a dedicated framework, or even be the default metric. I rarely use straight accuracy for any ML. But I may be old-school.

I appreciate the feedback and understand if having the user handle this instead makes more sense for the community.

@stevegbrooks
Copy link

Hi @okhat - also thanks for that explanation.

Would you be able to post some example code to show how to do this: "Just make each dspy.Example carry a set of examples and handle that in your metric and your program. Then the normal dspy.Evaluate works just fine, and so do all optimizers."?

@piemasbi
Copy link

piemasbi commented Mar 6, 2025

Thanks for all the nice posts, and thanks to @chrico-bu-uab for the code with the new classes.

If I understand correctly, the current optimizers use the mean of the metric value (which is computed on a per-example basis, let's call it the "sample metric") over the train/validation set for the optimization. What would be nice to have would be to have the optimizers computing other summary statistics on the sample metrics (for example, sensitivity, precision, f1-score, etc).

I see the issue with this, since it is not always meaningful to have summary metrics other than the mean for many sample metrics (think about the semantic f1 score, for example). Still, it would make sense to support optimizing for other summary metrics, especially when datasets are imbalanced.

Hope this makes sense, otherwise please correct me, and helps to pinpoint the issue.

@stevegbrooks
Copy link

Hi @okhat - also thanks for that explanation.

Would you be able to post some example code to show how to do this: "Just make each dspy.Example carry a set of examples and handle that in your metric and your program. Then the normal dspy.Evaluate works just fine, and so do all optimizers."?

Hi @okhat just pinging you on this. Is this something you have offhand? Because I dont see any obvious way to implement it.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants