-
Notifications
You must be signed in to change notification settings - Fork 3k
MMLU Pro Evaluator #41860
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
MMLU Pro Evaluator #41860
Conversation
Thank you for your contribution @AbdelmohsenMS! We will review the pull request and get back to you soon. |
API Change CheckAPIView identified API level changes in this PR and created the following API reviews |
@AbdelmohsenMS please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
Contributor License AgreementContribution License AgreementThis Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
|
self.regex_patterns = regex_patterns | ||
self.is_missing_regex_patterns = regex_patterns is None | ||
self.follow_instructions = [] | ||
self.scores = [] | ||
self.chain_of_thought_lengths = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to expose any of them to external customers ? If not please remname them to with underscore for example self.is_missing_regex_patterns ---> self._is_missing_regex_patterns
metrics both overall and grouped by subject and category. Use this evaluator when you want to | ||
assess a model's general knowledge and reasoning abilities across diverse academic domains. | ||
|
||
The MMLU score value is either 0 or 1, with higher scores indicating better performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this accurate ? Should it be from 0-1 or between 0-1 ?
self.subject2scores = defaultdict(list) | ||
self.category2scores = defaultdict(list) | ||
|
||
def update(self, prediction: str, label: str, json_data: dict) -> Dict[str, Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you would not like to expose it to customer please start the name of method with underscore(_)
|
||
return result | ||
|
||
def update(self, prediction: str, label: str, json_data: dict) -> Dict[str, Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please start name of method with underscore if it is not be exposed to customers
"chain_of_thought_length": chain_of_thought_length, | ||
} | ||
|
||
def get_regex_patterns( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please start name of method with underscore if it is not be exposed to customers
"mmlu_score": sample_metrics["score"], # Just needed for _real_call | ||
"accuracy": sample_metrics["score"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have 2 property for score ?
Description
Please add an informative description that covers that changes made by the pull request and link all relevant issues.
If an SDK is being regenerated based on a new swagger spec, a link to the pull request containing these swagger spec changes has been included above.
All SDK Contribution checklist:
General Guidelines and Best Practices
Testing Guidelines