feat: multiple-choice metric support #79

mairin · 2025-10-10T22:31:51Z

Summary by CodeRabbit

New Features
- Introduces MMLU-style multiple-choice evaluation with two metrics: Exact Match (flexible letter extraction) and Strict Match (single-letter only).
- Provides per-turn scoring with clear reasons for pass/fail.
- Enables deterministic evaluation flows for multiple-choice prompts.
Documentation
- Adds an example YAML configuration demonstrating an MMLU-style QA conversation group with expected answers and turn-level metrics.
Chores
- Registers the new multiple-choice metrics in system configuration for easy selection during evaluations.

Implements two evaluation metrics for MMLU-style multiple choice questions: - mmlu_exact_match: Flexible letter extraction with regex patterns - mmlu_strict_match: Strict single-letter exact matching The metrics handle various response formats: - Direct letter answers: "B" - Sentence responses: "The answer is B" - Formatted responses: "B) Code can survive..." 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Máirín Duffy <[email protected]>

Integrates the MMLU-style multiple choice metrics into the CustomMetrics handler and adds metric definitions to system configuration. Changes: - Import MMLUMetrics in CustomMetrics class - Register multiple_choice_exact and multiple_choice_strict metrics - Add wrapper methods to delegate to MMLUMetrics evaluator - Update metric names from mmlu_* to multiple_choice_* for consistency - Add metric metadata to system.yaml for validation The metrics are now accessible via: - custom:multiple_choice_exact (flexible letter extraction) - custom:multiple_choice_strict (exact single-letter matching) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Máirín Duffy <[email protected]>

Provides example configuration for MMLU-style multiple choice evaluations demonstrating the custom:multiple_choice_exact metric usage. Example includes: - Multi-turn conversation with Red Hat training questions - Questions covering Vim editor and file management - Expected responses (A, B, C, D format) - Response field set to null for API-based evaluation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Máirín Duffy <[email protected]>

Post-rebase of mmlu-style-eval with custom metric updates. Signed-off-by: Máirín Duffy <[email protected]>

Reset config to upstream defaults (openai provider, standard settings) and add MMLU multiple choice metric definitions. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Máirín Duffy <[email protected]>

coderabbitai · 2025-10-10T22:32:02Z

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title succinctly highlights the main change—adding multiple-choice metric support—and aligns with the pull request’s focus on introducing MMLU-style exact and strict metrics. It is concise, clear, and avoids vague or noisy language, enabling teammates to quickly grasp the primary feature. Therefore, it effectively summarizes the changeset.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

src/lightspeed_evaluation/core/metrics/custom/mmlu_style_eval.py (1)

59-66: Preserve expected/extracted details for all response lengths.

When the response is ≤100 chars, the conditional collapses the reason to only “Full response…”, dropping the expected/extracted context you build for longer replies. Refactor so that the explanatory prefix is always included and only the trailing “Full response…” fragment changes.

-        reason = (
-            f"Expected: {expected_clean} | "
-            f"Extracted: {response_letter} | "
-            f"Result: {'✓ CORRECT' if is_correct else '✗ INCORRECT'} | "
-            f"Full response: '{response[:100]}...'"
-            if len(response) > 100
-            else f"Full response: '{response}'"
-        )
+        reason_prefix = (
+            f"Expected: {expected_clean} | "
+            f"Extracted: {response_letter} | "
+            f"Result: {'✓ CORRECT' if is_correct else '✗ INCORRECT'}"
+        )
+        if len(response) > 100:
+            reason = f"{reason_prefix} | Full response: '{response[:100]}...'"
+        else:
+            reason = f"{reason_prefix} | Full response: '{response}'"

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 928901e and c2d53e3.

📒 Files selected for processing (4)

config/mmlu_example.yaml (1 hunks)
config/system.yaml (1 hunks)
src/lightspeed_evaluation/core/metrics/custom/custom.py (3 hunks)
src/lightspeed_evaluation/core/metrics/custom/mmlu_style_eval.py (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-09-08T11:11:54.516Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.

Applied to files:

config/system.yaml

🧬 Code graph analysis (2)

src/lightspeed_evaluation/core/metrics/custom/custom.py (2)

src/lightspeed_evaluation/core/metrics/custom/mmlu_style_eval.py (4)

MMLUMetrics (108-215)

evaluate (23-68)

evaluate (82-105)

evaluate (121-145)

src/lightspeed_evaluation/core/models/data.py (2)

TurnData (35-135)

EvaluationScope (230-241)

src/lightspeed_evaluation/core/metrics/custom/mmlu_style_eval.py (2)

src/lightspeed_evaluation/core/models/data.py (2)

EvaluationScope (230-241)

TurnData (35-135)

src/lightspeed_evaluation/core/metrics/custom/custom.py (1)

evaluate (42-57)

asamal4

Thank you for adding this.
Please update the readme also to include these new metrics.
There are few minor lint issues.

asamal4 · 2025-10-13T10:19:26Z

src/lightspeed_evaluation/core/metrics/custom/mmlu_style_eval.py

+            f"Result: {'✓ CORRECT' if is_correct else '✗ INCORRECT'} | "
+            f"Full response: '{response[:100]}...'"
+            if len(response) > 100
+            else f"Full response: '{response}'"


when response length is greater than 100, the reason won't have expected response.

asamal4 · 2025-10-13T10:21:25Z

src/lightspeed_evaluation/core/metrics/custom/mmlu_style_eval.py

+        Args:
+            threshold: Score threshold for passing (default: 1.0).
+        """
+        self.threshold = threshold


We can remove threshold, as this is a binary metric.

asamal4 · 2025-10-13T10:25:40Z

src/lightspeed_evaluation/core/metrics/custom/mmlu_style_eval.py

+        return {"score": score, "reason": reason}
+
+
+class MultipleChoiceStrictMatch:  # pylint: disable=too-few-public-methods


Optional: Perhaps we can simply convert to this to a python function.

mairin and others added 5 commits October 10, 2025 12:30

Integrate mmlu-style with custom metric updates

09e8355

Post-rebase of mmlu-style-eval with custom metric updates. Signed-off-by: Máirín Duffy <[email protected]>

coderabbitai bot reviewed Oct 10, 2025

View reviewed changes

asamal4 reviewed Oct 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: multiple-choice metric support #79

feat: multiple-choice metric support #79

Uh oh!

mairin commented Oct 10, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 10, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

asamal4 left a comment

Uh oh!

asamal4 Oct 13, 2025

Uh oh!

asamal4 Oct 13, 2025

Uh oh!

asamal4 Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		return {"score": score, "reason": reason}


		class MultipleChoiceStrictMatch: # pylint: disable=too-few-public-methods

feat: multiple-choice metric support #79

Are you sure you want to change the base?

feat: multiple-choice metric support #79

Uh oh!

Conversation

mairin commented Oct 10, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

asamal4 left a comment

Choose a reason for hiding this comment

Uh oh!

asamal4 Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

asamal4 Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

asamal4 Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mairin commented Oct 10, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 10, 2025 •

edited

Loading