Skip to content

Conversation

@asamal4
Copy link
Collaborator

@asamal4 asamal4 commented Oct 31, 2025

Background:
Optional tool calls: Imagine a scenario where we have a tool to fetch some information, but same information is also present in conversation history. It is possible that Model may not do a fresh tool call. This may be valid for some use-case
Argument with default value: It is possible that argument value for user query is same as default value in tool definition. So Again it is upto the model to add argument or skip it completely.

So we will have to provide ability to handle this non-deterministic behavior.

Approach: Let user provide multiple expected tool calls, eval will pass when one of those expected values matches. It will fail when none of the alternatives match.

Summary by CodeRabbit

  • New Features

    • Tool evaluation now accepts multiple alternative sets of expected tool-call sequences and reports which alternative matched or failed.
  • Models

    • Turn data schema and validation updated to normalize and validate multi-set tool-call formats with backward-compatibility handling and explicit empty-set rules.
  • Documentation

    • README expanded with detailed Tool Evaluation guidance, examples, and script-evaluation notes.
  • Tests

    • Comprehensive unit tests added for multi-set matching, validation, messaging, and error handling.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 31, 2025

Walkthrough

Adds multi-alternative support for expected tool-call sequences by changing expected_tool_calls to nested alternatives, updates validation and evaluation to handle multiple alternative sets, revises README examples, and adds unit tests validating new formats and backward compatibility.

Changes

Cohort / File(s) Summary
Documentation
README.md
Describe multi-alternative expected_tool_calls format, update tool-call examples, add Tool Evaluation section and script-based evaluation notes.
Evaluation Logic
src/lightspeed_evaluation/core/metrics/custom/tool_eval.py
evaluate_tool_calls signature updated to accept multiple alternative sets (list[list[list[dict]]]); iterates alternatives, returns on first match, and uses helper functions for success/failure messaging and error handling.
Data Model & Validation
src/lightspeed_evaluation/core/models/data.py
TurnData.expected_tool_calls type updated to Optional[list[list[list[dict[str, Any]]]]]; validate_expected_tool_calls refactored with compatibility and validation helpers (_ensure_multiple_sets_format, _validate_multiple_sets, _is_single_set_format, _reject_empty_sequences, _validate_empty_set_constraints, _validate_tool_call_sequences, etc.) enforcing multi-set structure and empty-set rules.
Test Init Files
tests/unit/core/metrics/__init__.py, tests/unit/core/metrics/custom/__init__.py, tests/unit/core/models/__init__.py
Add package initializers for test packages.
Unit Tests — Metrics
tests/unit/core/metrics/custom/test_tool_eval.py
New tests covering primary/alternative matches, empty-pattern behavior, failure and error paths, and low-level comparison helpers.
Unit Tests — Models
tests/unit/core/models/test_data.py
New tests validating conversion between single-set and multi-set formats, empty-alternative rules, regex argument handling, and edge-case validation errors.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant ToolEval.evaluate_tool_calls as Evaluator
    participant compare_tool_calls as Comparator
    participant _create_success_message as SuccessMsg
    participant _create_failure_message as FailureMsg

    Caller->>Evaluator: evaluate_tool_calls(expected_alternatives, actual_calls)
    loop for each alternative (i)
        Evaluator->>Comparator: compare_tool_calls(expected_alternatives[i], actual_calls)
        alt match
            Comparator-->>Evaluator: True
            Evaluator->>SuccessMsg: build success message (alternative i)
            SuccessMsg-->>Evaluator: success message
            Evaluator-->>Caller: (True, message)
            Note right of Evaluator: returns on first matching alternative
        else no match
            Comparator-->>Evaluator: False
        end
    end
    alt none matched
        Evaluator->>FailureMsg: build failure message (details for all alternatives)
        FailureMsg-->>Evaluator: failure message
        Evaluator-->>Caller: (False, message)
    end
    alt exception
        Evaluator-->>Caller: (False, error_message) and log
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Areas needing extra attention:
    • src/lightspeed_evaluation/core/models/data.py — new validation helpers, normalization logic, empty-set constraints.
    • src/lightspeed_evaluation/core/metrics/custom/tool_eval.py — alternative iteration, matching semantics, and message construction.
    • tests/unit/core/models/test_data.py and tests/unit/core/metrics/custom/test_tool_eval.py — ensure tests align with validation/evaluation behavior and cover edge cases.

Suggested reviewers

  • VladimirKadlec
  • tisnik

Poem

🐰 I hopped through nested lists and clues,

Alternatives hidden in tiny shoes,
I checked each path, one by one,
Matched the calls until the job was done,
Tests applaud — the rabbit twitches its nose.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The PR title "Ability to set alternate tool calls for eval" accurately summarizes the main change across all modified files. The changes enable users to specify multiple alternative sets of tool call sequences (updating the data model from single-set to multi-set format), update the evaluation logic to match against these alternatives, and add comprehensive test coverage. The title is concise, specific, and directly conveys the primary purpose without vague terminology or misleading claims. A teammate reviewing the commit history would immediately understand that this PR adds support for alternative tool call patterns in evaluation.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9a181e5 and ebacc05.

📒 Files selected for processing (8)
  • README.md (3 hunks)
  • src/lightspeed_evaluation/core/metrics/custom/tool_eval.py (2 hunks)
  • src/lightspeed_evaluation/core/models/data.py (3 hunks)
  • tests/unit/core/metrics/__init__.py (1 hunks)
  • tests/unit/core/metrics/custom/__init__.py (1 hunks)
  • tests/unit/core/metrics/custom/test_tool_eval.py (1 hunks)
  • tests/unit/core/models/__init__.py (1 hunks)
  • tests/unit/core/models/test_data.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (7)
tests/unit/**

📄 CodeRabbit inference engine (AGENTS.md)

Place unit tests under tests/unit/ mirroring the source structure

Files:

  • tests/unit/core/metrics/__init__.py
  • tests/unit/core/metrics/custom/test_tool_eval.py
  • tests/unit/core/metrics/custom/__init__.py
  • tests/unit/core/models/test_data.py
  • tests/unit/core/models/__init__.py
tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest for mocking (pytest-mock’s mocker), not unittest.mock
Name test functions as test_*
Name test classes as Test*
Add comprehensive tests for new features with mocked LLM calls using pytest

Files:

  • tests/unit/core/metrics/__init__.py
  • tests/unit/core/metrics/custom/test_tool_eval.py
  • tests/unit/core/metrics/custom/__init__.py
  • tests/unit/core/models/test_data.py
  • tests/unit/core/models/__init__.py
tests/**/test_*.py

📄 CodeRabbit inference engine (AGENTS.md)

Name test files as test_*.py

Files:

  • tests/unit/core/metrics/custom/test_tool_eval.py
  • tests/unit/core/models/test_data.py
src/lightspeed_evaluation/**

📄 CodeRabbit inference engine (AGENTS.md)

Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)

Files:

  • src/lightspeed_evaluation/core/metrics/custom/tool_eval.py
  • src/lightspeed_evaluation/core/models/data.py
src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Require type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels

Files:

  • src/lightspeed_evaluation/core/metrics/custom/tool_eval.py
  • src/lightspeed_evaluation/core/models/data.py
src/lightspeed_evaluation/core/metrics/custom/**

📄 CodeRabbit inference engine (AGENTS.md)

Add new custom metrics under src/lightspeed_evaluation/core/metrics/custom/

Files:

  • src/lightspeed_evaluation/core/metrics/custom/tool_eval.py
src/lightspeed_evaluation/core/metrics/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Register new metrics in MetricManager’s supported_metrics dictionary

Files:

  • src/lightspeed_evaluation/core/metrics/custom/tool_eval.py
🧠 Learnings (9)
📓 Common learnings
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.195Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration uses "tool_name" as the key for tool names, not "name". The tool call evaluation implementation specifically looks for the "tool_name" field when comparing expected vs actual tool calls.
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.195Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration and tool call evaluation implementation consistently uses "tool_name" as the key for tool names, not "name". The tool call parsing, comparison, and formatting all expect the "tool_name" field throughout the codebase.
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:140-145
Timestamp: 2025-09-11T12:47:06.747Z
Learning: User asamal4 prefers that non-critical comments are sent when actual code changes are pushed, not on unrelated commits.
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/core/metrics/**/*.py : Register new metrics in MetricManager’s supported_metrics dictionary

Applied to files:

  • tests/unit/core/metrics/__init__.py
  • tests/unit/core/metrics/custom/__init__.py
📚 Learning: 2025-08-13T14:07:44.195Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.195Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration uses "tool_name" as the key for tool names, not "name". The tool call evaluation implementation specifically looks for the "tool_name" field when comparing expected vs actual tool calls.

Applied to files:

  • README.md
  • tests/unit/core/metrics/custom/test_tool_eval.py
  • src/lightspeed_evaluation/core/metrics/custom/tool_eval.py
  • src/lightspeed_evaluation/core/models/data.py
📚 Learning: 2025-08-13T14:07:44.195Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.195Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration and tool call evaluation implementation consistently uses "tool_name" as the key for tool names, not "name". The tool call parsing, comparison, and formatting all expect the "tool_name" field throughout the codebase.

Applied to files:

  • README.md
  • tests/unit/core/metrics/custom/test_tool_eval.py
  • src/lightspeed_evaluation/core/metrics/custom/tool_eval.py
📚 Learning: 2025-09-08T11:11:54.516Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.

Applied to files:

  • README.md
  • tests/unit/core/metrics/custom/test_tool_eval.py
  • src/lightspeed_evaluation/core/metrics/custom/tool_eval.py
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to tests/**/*.py : Add comprehensive tests for new features with mocked LLM calls using pytest

Applied to files:

  • tests/unit/core/metrics/custom/test_tool_eval.py
  • tests/unit/core/models/test_data.py
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/core/metrics/custom/** : Add new custom metrics under src/lightspeed_evaluation/core/metrics/custom/

Applied to files:

  • tests/unit/core/metrics/custom/__init__.py
📚 Learning: 2025-08-26T11:17:48.640Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.

Applied to files:

  • src/lightspeed_evaluation/core/metrics/custom/tool_eval.py
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

  • src/lightspeed_evaluation/core/models/data.py
🧬 Code graph analysis (2)
tests/unit/core/metrics/custom/test_tool_eval.py (1)
src/lightspeed_evaluation/core/metrics/custom/tool_eval.py (1)
  • evaluate_tool_calls (10-34)
tests/unit/core/models/test_data.py (1)
src/lightspeed_evaluation/core/models/data.py (2)
  • TurnData (35-255)
  • _is_single_set_format (151-170)
🪛 GitHub Actions: Black
tests/unit/core/metrics/custom/test_tool_eval.py

[error] 1-1: Black formatting check failed. 4 files would be reformatted. Run 'black .' to format.

tests/unit/core/models/test_data.py

[error] 1-1: Black formatting check failed. 4 files would be reformatted. Run 'black .' to format.

src/lightspeed_evaluation/core/metrics/custom/tool_eval.py

[error] 1-1: Black formatting check failed. 4 files would be reformatted. Run 'black .' to format.

src/lightspeed_evaluation/core/models/data.py

[error] 1-1: Black formatting check failed. 4 files would be reformatted. Run 'black .' to format.

🪛 GitHub Actions: Pyright
src/lightspeed_evaluation/core/models/data.py

[error] 215-215: pyright: Type "list[Unknown] | bool" is not assignable to return type "bool" in data.py:215:16 (reportReturnType)

🪛 GitHub Actions: Python linter
src/lightspeed_evaluation/core/metrics/custom/tool_eval.py

[error] 169-169: Trailing whitespace (trailing-whitespace).


[error] 172-172: Trailing whitespace (trailing-whitespace).


[warning] 149-149: W0613: Unused argument 'actual' (unused-argument).

src/lightspeed_evaluation/core/models/data.py

[error] 116-116: Trailing whitespace (trailing-whitespace).


[error] 119-119: Trailing whitespace (trailing-whitespace).


[error] 133-133: Trailing whitespace (trailing-whitespace).


[error] 135-135: Line too long (113/100) (line-too-long).


[error] 139-139: Trailing whitespace (trailing-whitespace).


[error] 145-145: Trailing whitespace (trailing-whitespace).


[error] 155-155: Trailing whitespace (trailing-whitespace).


[error] 163-163: Line too long (107/100) (line-too-long).


[error] 169-169: Trailing whitespace (trailing-whitespace).


[error] 202-202: Trailing whitespace (trailing-whitespace).


[error] 208-208: Line too long (105/100) (line-too-long).


[error] 159-159: R1705: Unnecessary "elif" after "return", remove the leading "el" from "elif" (no-else-return).


[error] 162-162: R1705: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it (no-else-return).

🪛 GitHub Actions: Tests
tests/unit/core/metrics/custom/test_tool_eval.py

[error] 57-57: TestEvaluateToolCalls::test_empty_pattern_match_primary failed: AssertionError: 'No tool calls made (valid skip scenario)' not found in details which contains 'Primary pattern matched: No tool calls made (valid alternate skip scenario)'


[error] 71-71: TestEvaluateToolCalls::test_empty_pattern_match_alternative failed: AssertionError: 'valid skip scenario' not found in details which contains 'Alternative 2 matched: No tool calls made (valid alternate skip scenario)'


[error] 100-100: TestEvaluateToolCalls::test_error_handling failed: AssertionError: Expected message to include 'not an expected alternative' or 'Tool evaluation error', but details were 'No actual tool calls made and this is not set as an expected alternative'

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: mypy

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ebacc05 and 31e2708.

📒 Files selected for processing (8)
  • README.md (3 hunks)
  • src/lightspeed_evaluation/core/metrics/custom/tool_eval.py (2 hunks)
  • src/lightspeed_evaluation/core/models/data.py (3 hunks)
  • tests/unit/core/metrics/__init__.py (1 hunks)
  • tests/unit/core/metrics/custom/__init__.py (1 hunks)
  • tests/unit/core/metrics/custom/test_tool_eval.py (1 hunks)
  • tests/unit/core/models/__init__.py (1 hunks)
  • tests/unit/core/models/test_data.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
  • tests/unit/core/models/test_data.py
  • tests/unit/core/metrics/custom/test_tool_eval.py
  • tests/unit/core/metrics/init.py
  • tests/unit/core/metrics/custom/init.py
  • tests/unit/core/models/init.py
🧰 Additional context used
📓 Path-based instructions (4)
src/lightspeed_evaluation/**

📄 CodeRabbit inference engine (AGENTS.md)

Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)

Files:

  • src/lightspeed_evaluation/core/metrics/custom/tool_eval.py
  • src/lightspeed_evaluation/core/models/data.py
src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Require type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels

Files:

  • src/lightspeed_evaluation/core/metrics/custom/tool_eval.py
  • src/lightspeed_evaluation/core/models/data.py
src/lightspeed_evaluation/core/metrics/custom/**

📄 CodeRabbit inference engine (AGENTS.md)

Add new custom metrics under src/lightspeed_evaluation/core/metrics/custom/

Files:

  • src/lightspeed_evaluation/core/metrics/custom/tool_eval.py
src/lightspeed_evaluation/core/metrics/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Register new metrics in MetricManager’s supported_metrics dictionary

Files:

  • src/lightspeed_evaluation/core/metrics/custom/tool_eval.py
🧠 Learnings (6)
📓 Common learnings
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.195Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration uses "tool_name" as the key for tool names, not "name". The tool call evaluation implementation specifically looks for the "tool_name" field when comparing expected vs actual tool calls.
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.195Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration and tool call evaluation implementation consistently uses "tool_name" as the key for tool names, not "name". The tool call parsing, comparison, and formatting all expect the "tool_name" field throughout the codebase.
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:140-145
Timestamp: 2025-09-11T12:47:06.747Z
Learning: User asamal4 prefers that non-critical comments are sent when actual code changes are pushed, not on unrelated commits.
📚 Learning: 2025-08-13T14:07:44.195Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.195Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration uses "tool_name" as the key for tool names, not "name". The tool call evaluation implementation specifically looks for the "tool_name" field when comparing expected vs actual tool calls.

Applied to files:

  • README.md
  • src/lightspeed_evaluation/core/metrics/custom/tool_eval.py
  • src/lightspeed_evaluation/core/models/data.py
📚 Learning: 2025-08-13T14:07:44.195Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.195Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration and tool call evaluation implementation consistently uses "tool_name" as the key for tool names, not "name". The tool call parsing, comparison, and formatting all expect the "tool_name" field throughout the codebase.

Applied to files:

  • README.md
  • src/lightspeed_evaluation/core/metrics/custom/tool_eval.py
📚 Learning: 2025-09-08T11:11:54.516Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.

Applied to files:

  • README.md
  • src/lightspeed_evaluation/core/metrics/custom/tool_eval.py
📚 Learning: 2025-08-26T11:17:48.640Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.

Applied to files:

  • src/lightspeed_evaluation/core/metrics/custom/tool_eval.py
📚 Learning: 2025-09-11T12:47:06.747Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:140-145
Timestamp: 2025-09-11T12:47:06.747Z
Learning: User asamal4 prefers that non-critical comments are sent when actual code changes are pushed, not on unrelated commits.

Applied to files:

  • src/lightspeed_evaluation/core/metrics/custom/tool_eval.py

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tests/unit/core/metrics/custom/test_tool_eval.py (1)

90-103: Strengthen error handling test.

The current test passes an invalid type (string instead of list) but has an uncertain assertion with two possible error messages. The comment suggests the error handling might not be as robust as intended.

Consider:

  1. Using more specific assertions for the expected error scenario
  2. Adding separate test cases for different invalid input types (string, None, nested incorrect structure)
  3. Verifying that the error message clearly indicates the validation failure

Example improvement:

 def test_error_handling(self):
-    """Test error handling in evaluate_tool_calls."""
-    # Invalid expected format should be handled gracefully
-    expected = "invalid"  # Not a list
+    """Test error handling for invalid input type."""
+    expected = "invalid"  # String instead of list
     actual = []
 
     success, details = evaluate_tool_calls(expected, actual)
 
     assert success is False
-    # The function iterates over the string characters, so we get a different error
-    assert (
-        "not set as an expected alternative" in details
-        or "Tool evaluation error" in details
-    )
+    assert "Tool evaluation error" in details
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 31e2708 and ec5f63d.

📒 Files selected for processing (6)
  • README.md (3 hunks)
  • tests/unit/core/metrics/__init__.py (1 hunks)
  • tests/unit/core/metrics/custom/__init__.py (1 hunks)
  • tests/unit/core/metrics/custom/test_tool_eval.py (1 hunks)
  • tests/unit/core/models/__init__.py (1 hunks)
  • tests/unit/core/models/test_data.py (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • tests/unit/core/metrics/custom/init.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • tests/unit/core/models/init.py
  • tests/unit/core/metrics/init.py
  • tests/unit/core/models/test_data.py
🧰 Additional context used
📓 Path-based instructions (3)
tests/unit/**

📄 CodeRabbit inference engine (AGENTS.md)

Place unit tests under tests/unit/ mirroring the source structure

Files:

  • tests/unit/core/metrics/custom/test_tool_eval.py
tests/**/test_*.py

📄 CodeRabbit inference engine (AGENTS.md)

Name test files as test_*.py

Files:

  • tests/unit/core/metrics/custom/test_tool_eval.py
tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest for mocking (pytest-mock’s mocker), not unittest.mock
Name test functions as test_*
Name test classes as Test*
Add comprehensive tests for new features with mocked LLM calls using pytest

Files:

  • tests/unit/core/metrics/custom/test_tool_eval.py
🧠 Learnings (6)
📓 Common learnings
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.195Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration uses "tool_name" as the key for tool names, not "name". The tool call evaluation implementation specifically looks for the "tool_name" field when comparing expected vs actual tool calls.
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.195Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration and tool call evaluation implementation consistently uses "tool_name" as the key for tool names, not "name". The tool call parsing, comparison, and formatting all expect the "tool_name" field throughout the codebase.
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 90
File: src/lightspeed_evaluation/core/models/data.py:198-208
Timestamp: 2025-10-31T11:54:59.126Z
Learning: In the lightspeed_evaluation framework, the expected_tool_calls validator intentionally rejects a single empty set `[[]]` as the only alternative. This is by design: if no tool calls are expected, the tool_eval metric should not be configured for that turn. Empty sets are only valid as fallback alternatives (e.g., `[[[tool_call]], [[]]]`), representing optional tool call scenarios, not as primary or sole expectations.
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:140-145
Timestamp: 2025-09-11T12:47:06.747Z
Learning: User asamal4 prefers that non-critical comments are sent when actual code changes are pushed, not on unrelated commits.
📚 Learning: 2025-10-31T11:54:59.126Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 90
File: src/lightspeed_evaluation/core/models/data.py:198-208
Timestamp: 2025-10-31T11:54:59.126Z
Learning: In the lightspeed_evaluation framework, the expected_tool_calls validator intentionally rejects a single empty set `[[]]` as the only alternative. This is by design: if no tool calls are expected, the tool_eval metric should not be configured for that turn. Empty sets are only valid as fallback alternatives (e.g., `[[[tool_call]], [[]]]`), representing optional tool call scenarios, not as primary or sole expectations.

Applied to files:

  • README.md
  • tests/unit/core/metrics/custom/test_tool_eval.py
📚 Learning: 2025-08-13T14:07:44.195Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.195Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration uses "tool_name" as the key for tool names, not "name". The tool call evaluation implementation specifically looks for the "tool_name" field when comparing expected vs actual tool calls.

Applied to files:

  • README.md
  • tests/unit/core/metrics/custom/test_tool_eval.py
📚 Learning: 2025-08-13T14:07:44.195Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.195Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration and tool call evaluation implementation consistently uses "tool_name" as the key for tool names, not "name". The tool call parsing, comparison, and formatting all expect the "tool_name" field throughout the codebase.

Applied to files:

  • README.md
📚 Learning: 2025-09-08T11:11:54.516Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.

Applied to files:

  • README.md
  • tests/unit/core/metrics/custom/test_tool_eval.py
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to tests/**/*.py : Add comprehensive tests for new features with mocked LLM calls using pytest

Applied to files:

  • tests/unit/core/metrics/custom/test_tool_eval.py
🧬 Code graph analysis (1)
tests/unit/core/metrics/custom/test_tool_eval.py (1)
src/lightspeed_evaluation/core/metrics/custom/tool_eval.py (1)
  • evaluate_tool_calls (10-34)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: mypy
  • GitHub Check: tests (3.11)
  • GitHub Check: tests (3.13)
  • GitHub Check: pydocstyle
  • GitHub Check: tests (3.12)
🔇 Additional comments (14)
README.md (4)

294-294: LGTM!

The type change from list[list[dict]] to list[list[list[dict]]] correctly reflects the new multi-alternative structure for tool call evaluation.


305-305: LGTM!

The note clearly indicates that expected_tool_calls now requires the multi-alternative sets format for the custom:tool_eval metric.


317-325: LGTM!

The new "Tool Evaluation" section provides clear guidance on the multi-alternative format, including:

  • Format specification
  • Matching behavior
  • Use cases
  • Empty set constraints

This documentation will help users understand how to configure tool call evaluation with multiple alternatives.


329-348: LGTM!

The updated example clearly demonstrates the multi-alternative format with well-commented scenarios:

  • Primary approach with tool sequences
  • Alternative approach with different tools
  • Skip scenario with empty list

The inline comments help users understand each alternative's purpose.

tests/unit/core/metrics/custom/test_tool_eval.py (10)

1-9: LGTM!

The test file structure follows coding guidelines:

  • Located under tests/unit/
  • Named as test_*.py
  • Uses pytest ✓
  • Imports private functions for unit testing (acceptable practice) ✓

15-28: LGTM!

The test validates that the primary pattern matches correctly and returns the expected success message format.


30-46: LGTM!

The test correctly validates alternative pattern matching, ensuring that when the primary pattern doesn't match, the function tries subsequent alternatives.


48-57: Verify consistency with validation rules.

This test validates that [[]] (empty set as the only alternative) can be successfully evaluated. However, based on learnings, the data model validator intentionally rejects [[]] as the only alternative - such data should never reach the evaluate_tool_calls function in production.

Consider whether this test should:

  1. Be removed (if testing unreachable code paths)
  2. Be moved to a separate "invalid input handling" test suite
  3. Remain as-is (if testing low-level function behavior independent of validation)

Based on learnings.


59-71: LGTM!

The test correctly validates the skip scenario where an empty set appears as a fallback alternative (not as the only alternative), which aligns with the design.


73-88: LGTM!

The test validates the failure path when none of the expected patterns match the actual tool calls, including checking for appropriate error messaging.


109-137: LGTM!

The tests for compare_tool_calls provide good coverage:

  • Exact matches
  • Length mismatches
  • Empty sequences

All assertions are clear and validate the expected behavior.


143-168: LGTM!

The tests for _compare_tool_call_sequence validate both matching sequences and length mismatches correctly.


174-199: LGTM!

The tests for _compare_single_tool_call cover:

  • Successful tool name and argument matches
  • Tool name mismatches
  • Missing arguments

All test cases validate expected behavior correctly.


205-256: LGTM!

The tests for _compare_tool_arguments provide comprehensive coverage:

  • Exact argument matches
  • Regex pattern matching
  • Missing argument keys
  • Extra argument keys
  • Invalid regex patterns
  • Non-dictionary arguments

This thorough test suite validates all the critical argument comparison scenarios.

@asamal4
Copy link
Collaborator Author

asamal4 commented Nov 3, 2025

@VladimirKadlec @tisnik PTAL

Copy link
Contributor

@VladimirKadlec VladimirKadlec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@tisnik tisnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tisnik tisnik merged commit 64f8892 into lightspeed-core:main Nov 6, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants