Skip to content

Conversation

@mairin
Copy link
Collaborator

@mairin mairin commented Oct 10, 2025

Summary by CodeRabbit

  • New Features

    • Introduces MMLU-style multiple-choice evaluation with two metrics: Exact Match (flexible letter extraction) and Strict Match (single-letter only).
    • Provides per-turn scoring with clear reasons for pass/fail.
    • Enables deterministic evaluation flows for multiple-choice prompts.
  • Documentation

    • Adds an example YAML configuration demonstrating an MMLU-style QA conversation group with expected answers and turn-level metrics.
  • Chores

    • Registers the new multiple-choice metrics in system configuration for easy selection during evaluations.

mairin and others added 5 commits October 10, 2025 12:30
Implements two evaluation metrics for MMLU-style multiple choice questions:
- mmlu_exact_match: Flexible letter extraction with regex patterns
- mmlu_strict_match: Strict single-letter exact matching

The metrics handle various response formats:
- Direct letter answers: "B"
- Sentence responses: "The answer is B"
- Formatted responses: "B) Code can survive..."

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Máirín Duffy <[email protected]>
Integrates the MMLU-style multiple choice metrics into the CustomMetrics
handler and adds metric definitions to system configuration.

Changes:
- Import MMLUMetrics in CustomMetrics class
- Register multiple_choice_exact and multiple_choice_strict metrics
- Add wrapper methods to delegate to MMLUMetrics evaluator
- Update metric names from mmlu_* to multiple_choice_* for consistency
- Add metric metadata to system.yaml for validation

The metrics are now accessible via:
- custom:multiple_choice_exact (flexible letter extraction)
- custom:multiple_choice_strict (exact single-letter matching)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Máirín Duffy <[email protected]>
Provides example configuration for MMLU-style multiple choice evaluations
demonstrating the custom:multiple_choice_exact metric usage.

Example includes:
- Multi-turn conversation with Red Hat training questions
- Questions covering Vim editor and file management
- Expected responses (A, B, C, D format)
- Response field set to null for API-based evaluation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Máirín Duffy <[email protected]>
Post-rebase of mmlu-style-eval with custom metric updates.

Signed-off-by: Máirín Duffy <[email protected]>
Reset config to upstream defaults (openai provider, standard settings)
and add MMLU multiple choice metric definitions.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Máirín Duffy <[email protected]>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 10, 2025

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title succinctly highlights the main change—adding multiple-choice metric support—and aligns with the pull request’s focus on introducing MMLU-style exact and strict metrics. It is concise, clear, and avoids vague or noisy language, enabling teammates to quickly grasp the primary feature. Therefore, it effectively summarizes the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/lightspeed_evaluation/core/metrics/custom/mmlu_style_eval.py (1)

59-66: Preserve expected/extracted details for all response lengths.

When the response is ≤100 chars, the conditional collapses the reason to only “Full response…”, dropping the expected/extracted context you build for longer replies. Refactor so that the explanatory prefix is always included and only the trailing “Full response…” fragment changes.

-        reason = (
-            f"Expected: {expected_clean} | "
-            f"Extracted: {response_letter} | "
-            f"Result: {'✓ CORRECT' if is_correct else '✗ INCORRECT'} | "
-            f"Full response: '{response[:100]}...'"
-            if len(response) > 100
-            else f"Full response: '{response}'"
-        )
+        reason_prefix = (
+            f"Expected: {expected_clean} | "
+            f"Extracted: {response_letter} | "
+            f"Result: {'✓ CORRECT' if is_correct else '✗ INCORRECT'}"
+        )
+        if len(response) > 100:
+            reason = f"{reason_prefix} | Full response: '{response[:100]}...'"
+        else:
+            reason = f"{reason_prefix} | Full response: '{response}'"
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 928901e and c2d53e3.

📒 Files selected for processing (4)
  • config/mmlu_example.yaml (1 hunks)
  • config/system.yaml (1 hunks)
  • src/lightspeed_evaluation/core/metrics/custom/custom.py (3 hunks)
  • src/lightspeed_evaluation/core/metrics/custom/mmlu_style_eval.py (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-08T11:11:54.516Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.

Applied to files:

  • config/system.yaml
🧬 Code graph analysis (2)
src/lightspeed_evaluation/core/metrics/custom/custom.py (2)
src/lightspeed_evaluation/core/metrics/custom/mmlu_style_eval.py (4)
  • MMLUMetrics (108-215)
  • evaluate (23-68)
  • evaluate (82-105)
  • evaluate (121-145)
src/lightspeed_evaluation/core/models/data.py (2)
  • TurnData (35-135)
  • EvaluationScope (230-241)
src/lightspeed_evaluation/core/metrics/custom/mmlu_style_eval.py (2)
src/lightspeed_evaluation/core/models/data.py (2)
  • EvaluationScope (230-241)
  • TurnData (35-135)
src/lightspeed_evaluation/core/metrics/custom/custom.py (1)
  • evaluate (42-57)

Copy link
Collaborator

@asamal4 asamal4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding this.
Please update the readme also to include these new metrics.
There are few minor lint issues.

f"Result: {'✓ CORRECT' if is_correct else '✗ INCORRECT'} | "
f"Full response: '{response[:100]}...'"
if len(response) > 100
else f"Full response: '{response}'"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when response length is greater than 100, the reason won't have expected response.

Args:
threshold: Score threshold for passing (default: 1.0).
"""
self.threshold = threshold
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove threshold, as this is a binary metric.

return {"score": score, "reason": reason}


class MultipleChoiceStrictMatch: # pylint: disable=too-few-public-methods
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: Perhaps we can simply convert to this to a python function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants