-
Notifications
You must be signed in to change notification settings - Fork 21
feat: multiple-choice metric support #79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Implements two evaluation metrics for MMLU-style multiple choice questions: - mmlu_exact_match: Flexible letter extraction with regex patterns - mmlu_strict_match: Strict single-letter exact matching The metrics handle various response formats: - Direct letter answers: "B" - Sentence responses: "The answer is B" - Formatted responses: "B) Code can survive..." 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Máirín Duffy <[email protected]>
Integrates the MMLU-style multiple choice metrics into the CustomMetrics handler and adds metric definitions to system configuration. Changes: - Import MMLUMetrics in CustomMetrics class - Register multiple_choice_exact and multiple_choice_strict metrics - Add wrapper methods to delegate to MMLUMetrics evaluator - Update metric names from mmlu_* to multiple_choice_* for consistency - Add metric metadata to system.yaml for validation The metrics are now accessible via: - custom:multiple_choice_exact (flexible letter extraction) - custom:multiple_choice_strict (exact single-letter matching) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Máirín Duffy <[email protected]>
Provides example configuration for MMLU-style multiple choice evaluations demonstrating the custom:multiple_choice_exact metric usage. Example includes: - Multi-turn conversation with Red Hat training questions - Questions covering Vim editor and file management - Expected responses (A, B, C, D format) - Response field set to null for API-based evaluation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Máirín Duffy <[email protected]>
Post-rebase of mmlu-style-eval with custom metric updates. Signed-off-by: Máirín Duffy <[email protected]>
Reset config to upstream defaults (openai provider, standard settings) and add MMLU multiple choice metric definitions. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Máirín Duffy <[email protected]>
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/lightspeed_evaluation/core/metrics/custom/mmlu_style_eval.py (1)
59-66: Preserve expected/extracted details for all response lengths.When the response is ≤100 chars, the conditional collapses the reason to only “Full response…”, dropping the expected/extracted context you build for longer replies. Refactor so that the explanatory prefix is always included and only the trailing “Full response…” fragment changes.
- reason = ( - f"Expected: {expected_clean} | " - f"Extracted: {response_letter} | " - f"Result: {'✓ CORRECT' if is_correct else '✗ INCORRECT'} | " - f"Full response: '{response[:100]}...'" - if len(response) > 100 - else f"Full response: '{response}'" - ) + reason_prefix = ( + f"Expected: {expected_clean} | " + f"Extracted: {response_letter} | " + f"Result: {'✓ CORRECT' if is_correct else '✗ INCORRECT'}" + ) + if len(response) > 100: + reason = f"{reason_prefix} | Full response: '{response[:100]}...'" + else: + reason = f"{reason_prefix} | Full response: '{response}'"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
config/mmlu_example.yaml(1 hunks)config/system.yaml(1 hunks)src/lightspeed_evaluation/core/metrics/custom/custom.py(3 hunks)src/lightspeed_evaluation/core/metrics/custom/mmlu_style_eval.py(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-08T11:11:54.516Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.
Applied to files:
config/system.yaml
🧬 Code graph analysis (2)
src/lightspeed_evaluation/core/metrics/custom/custom.py (2)
src/lightspeed_evaluation/core/metrics/custom/mmlu_style_eval.py (4)
MMLUMetrics(108-215)evaluate(23-68)evaluate(82-105)evaluate(121-145)src/lightspeed_evaluation/core/models/data.py (2)
TurnData(35-135)EvaluationScope(230-241)
src/lightspeed_evaluation/core/metrics/custom/mmlu_style_eval.py (2)
src/lightspeed_evaluation/core/models/data.py (2)
EvaluationScope(230-241)TurnData(35-135)src/lightspeed_evaluation/core/metrics/custom/custom.py (1)
evaluate(42-57)
asamal4
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding this.
Please update the readme also to include these new metrics.
There are few minor lint issues.
| f"Result: {'✓ CORRECT' if is_correct else '✗ INCORRECT'} | " | ||
| f"Full response: '{response[:100]}...'" | ||
| if len(response) > 100 | ||
| else f"Full response: '{response}'" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when response length is greater than 100, the reason won't have expected response.
| Args: | ||
| threshold: Score threshold for passing (default: 1.0). | ||
| """ | ||
| self.threshold = threshold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove threshold, as this is a binary metric.
| return {"score": score, "reason": reason} | ||
|
|
||
|
|
||
| class MultipleChoiceStrictMatch: # pylint: disable=too-few-public-methods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: Perhaps we can simply convert to this to a python function.
Summary by CodeRabbit
New Features
Documentation
Chores