Ability to set alternate tool calls for eval #90

asamal4 · 2025-10-31T10:39:30Z

Background:
Optional tool calls: Imagine a scenario where we have a tool to fetch some information, but same information is also present in conversation history. It is possible that Model may not do a fresh tool call. This may be valid for some use-case
Argument with default value: It is possible that argument value for user query is same as default value in tool definition. So Again it is upto the model to add argument or skip it completely.

So we will have to provide ability to handle this non-deterministic behavior.

Approach: Let user provide multiple expected tool calls, eval will pass when one of those expected values matches. It will fail when none of the alternatives match.

Summary by CodeRabbit

New Features
- Tool evaluation now accepts multiple alternative sets of expected tool-call sequences and reports which alternative matched or failed.
Models
- Turn data schema and validation updated to normalize and validate multi-set tool-call formats with backward-compatibility handling and explicit empty-set rules.
Documentation
- README expanded with detailed Tool Evaluation guidance, examples, and script-evaluation notes.
Tests
- Comprehensive unit tests added for multi-set matching, validation, messaging, and error handling.

coderabbitai · 2025-10-31T10:39:40Z

Walkthrough

Adds multi-alternative support for expected tool-call sequences by changing expected_tool_calls to nested alternatives, updates validation and evaluation to handle multiple alternative sets, revises README examples, and adds unit tests validating new formats and backward compatibility.

Changes

Cohort / File(s)	Summary
Documentation `README.md`	Describe multi-alternative `expected_tool_calls` format, update tool-call examples, add Tool Evaluation section and script-based evaluation notes.
Evaluation Logic `src/lightspeed_evaluation/core/metrics/custom/tool_eval.py`	`evaluate_tool_calls` signature updated to accept multiple alternative sets (`list[list[list[dict]]]`); iterates alternatives, returns on first match, and uses helper functions for success/failure messaging and error handling.
Data Model & Validation `src/lightspeed_evaluation/core/models/data.py`	`TurnData.expected_tool_calls` type updated to `Optional[list[list[list[dict[str, Any]]]]]`; `validate_expected_tool_calls` refactored with compatibility and validation helpers (`_ensure_multiple_sets_format`, `_validate_multiple_sets`, `_is_single_set_format`, `_reject_empty_sequences`, `_validate_empty_set_constraints`, `_validate_tool_call_sequences`, etc.) enforcing multi-set structure and empty-set rules.
Test Init Files `tests/unit/core/metrics/__init__.py`, `tests/unit/core/metrics/custom/__init__.py`, `tests/unit/core/models/__init__.py`	Add package initializers for test packages.
Unit Tests — Metrics `tests/unit/core/metrics/custom/test_tool_eval.py`	New tests covering primary/alternative matches, empty-pattern behavior, failure and error paths, and low-level comparison helpers.
Unit Tests — Models `tests/unit/core/models/test_data.py`	New tests validating conversion between single-set and multi-set formats, empty-alternative rules, regex argument handling, and edge-case validation errors.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant ToolEval.evaluate_tool_calls as Evaluator
    participant compare_tool_calls as Comparator
    participant _create_success_message as SuccessMsg
    participant _create_failure_message as FailureMsg

    Caller->>Evaluator: evaluate_tool_calls(expected_alternatives, actual_calls)
    loop for each alternative (i)
        Evaluator->>Comparator: compare_tool_calls(expected_alternatives[i], actual_calls)
        alt match
            Comparator-->>Evaluator: True
            Evaluator->>SuccessMsg: build success message (alternative i)
            SuccessMsg-->>Evaluator: success message
            Evaluator-->>Caller: (True, message)
            Note right of Evaluator: returns on first matching alternative
        else no match
            Comparator-->>Evaluator: False
        end
    end
    alt none matched
        Evaluator->>FailureMsg: build failure message (details for all alternatives)
        FailureMsg-->>Evaluator: failure message
        Evaluator-->>Caller: (False, message)
    end
    alt exception
        Evaluator-->>Caller: (False, error_message) and log
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Areas needing extra attention:
- src/lightspeed_evaluation/core/models/data.py — new validation helpers, normalization logic, empty-set constraints.
- src/lightspeed_evaluation/core/metrics/custom/tool_eval.py — alternative iteration, matching semantics, and message construction.
- tests/unit/core/models/test_data.py and tests/unit/core/metrics/custom/test_tool_eval.py — ensure tests align with validation/evaluation behavior and cover edge cases.

Suggested reviewers

VladimirKadlec
tisnik

Poem

🐰 I hopped through nested lists and clues,

Alternatives hidden in tiny shoes,
I checked each path, one by one,
Matched the calls until the job was done,
Tests applaud — the rabbit twitches its nose.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The PR title "Ability to set alternate tool calls for eval" accurately summarizes the main change across all modified files. The changes enable users to specify multiple alternative sets of tool call sequences (updating the data model from single-set to multi-set format), update the evaluation logic to match against these alternatives, and add comprehensive test coverage. The title is concise, specific, and directly conveys the primary purpose without vague terminology or misleading claims. A teammate reviewing the commit history would immediately understand that this PR adds support for alternative tool call patterns in evaluation.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9a181e5 and ebacc05.

📒 Files selected for processing (8)

README.md (3 hunks)
src/lightspeed_evaluation/core/metrics/custom/tool_eval.py (2 hunks)
src/lightspeed_evaluation/core/models/data.py (3 hunks)
tests/unit/core/metrics/__init__.py (1 hunks)
tests/unit/core/metrics/custom/__init__.py (1 hunks)
tests/unit/core/metrics/custom/test_tool_eval.py (1 hunks)
tests/unit/core/models/__init__.py (1 hunks)
tests/unit/core/models/test_data.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (7)

tests/unit/**

📄 CodeRabbit inference engine (AGENTS.md)

Place unit tests under tests/unit/ mirroring the source structure

Files:

tests/unit/core/metrics/__init__.py
tests/unit/core/metrics/custom/test_tool_eval.py
tests/unit/core/metrics/custom/__init__.py
tests/unit/core/models/test_data.py
tests/unit/core/models/__init__.py

tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest for mocking (pytest-mock’s mocker), not unittest.mock
Name test functions as test_*
Name test classes as Test*
Add comprehensive tests for new features with mocked LLM calls using pytest

Files:

tests/unit/core/metrics/__init__.py
tests/unit/core/metrics/custom/test_tool_eval.py
tests/unit/core/metrics/custom/__init__.py
tests/unit/core/models/test_data.py
tests/unit/core/models/__init__.py

tests/**/test_*.py

📄 CodeRabbit inference engine (AGENTS.md)

Name test files as test_*.py

Files:

tests/unit/core/metrics/custom/test_tool_eval.py
tests/unit/core/models/test_data.py

src/lightspeed_evaluation/**

📄 CodeRabbit inference engine (AGENTS.md)

Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)

Files:

src/lightspeed_evaluation/core/metrics/custom/tool_eval.py
src/lightspeed_evaluation/core/models/data.py

src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Require type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels

Files:

src/lightspeed_evaluation/core/metrics/custom/tool_eval.py
src/lightspeed_evaluation/core/models/data.py

src/lightspeed_evaluation/core/metrics/custom/**

📄 CodeRabbit inference engine (AGENTS.md)

Add new custom metrics under src/lightspeed_evaluation/core/metrics/custom/

Files:

src/lightspeed_evaluation/core/metrics/custom/tool_eval.py

src/lightspeed_evaluation/core/metrics/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Register new metrics in MetricManager’s supported_metrics dictionary

Files:

src/lightspeed_evaluation/core/metrics/custom/tool_eval.py

🧠 Learnings (9)

📓 Common learnings

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.195Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration uses "tool_name" as the key for tool names, not "name". The tool call evaluation implementation specifically looks for the "tool_name" field when comparing expected vs actual tool calls.

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.195Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration and tool call evaluation implementation consistently uses "tool_name" as the key for tool names, not "name". The tool call parsing, comparison, and formatting all expect the "tool_name" field throughout the codebase.

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:140-145
Timestamp: 2025-09-11T12:47:06.747Z
Learning: User asamal4 prefers that non-critical comments are sent when actual code changes are pushed, not on unrelated commits.

📚 Learning: 2025-10-16T11:17:19.324Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/core/metrics/**/*.py : Register new metrics in MetricManager’s supported_metrics dictionary

Applied to files:

tests/unit/core/metrics/__init__.py
tests/unit/core/metrics/custom/__init__.py

📚 Learning: 2025-08-13T14:07:44.195Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.195Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration uses "tool_name" as the key for tool names, not "name". The tool call evaluation implementation specifically looks for the "tool_name" field when comparing expected vs actual tool calls.

Applied to files:

README.md
tests/unit/core/metrics/custom/test_tool_eval.py
src/lightspeed_evaluation/core/metrics/custom/tool_eval.py
src/lightspeed_evaluation/core/models/data.py

📚 Learning: 2025-08-13T14:07:44.195Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.195Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration and tool call evaluation implementation consistently uses "tool_name" as the key for tool names, not "name". The tool call parsing, comparison, and formatting all expect the "tool_name" field throughout the codebase.

Applied to files:

README.md
tests/unit/core/metrics/custom/test_tool_eval.py
src/lightspeed_evaluation/core/metrics/custom/tool_eval.py

📚 Learning: 2025-09-08T11:11:54.516Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.

Applied to files:

README.md
tests/unit/core/metrics/custom/test_tool_eval.py
src/lightspeed_evaluation/core/metrics/custom/tool_eval.py

📚 Learning: 2025-10-16T11:17:19.324Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to tests/**/*.py : Add comprehensive tests for new features with mocked LLM calls using pytest

Applied to files:

tests/unit/core/metrics/custom/test_tool_eval.py
tests/unit/core/models/test_data.py

📚 Learning: 2025-10-16T11:17:19.324Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/core/metrics/custom/** : Add new custom metrics under src/lightspeed_evaluation/core/metrics/custom/

Applied to files:

tests/unit/core/metrics/custom/__init__.py

📚 Learning: 2025-08-26T11:17:48.640Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.

Applied to files:

src/lightspeed_evaluation/core/metrics/custom/tool_eval.py

📚 Learning: 2025-09-18T23:59:37.026Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

src/lightspeed_evaluation/core/models/data.py

🧬 Code graph analysis (2)

tests/unit/core/metrics/custom/test_tool_eval.py (1)

src/lightspeed_evaluation/core/metrics/custom/tool_eval.py (1)

evaluate_tool_calls (10-34)

tests/unit/core/models/test_data.py (1)

src/lightspeed_evaluation/core/models/data.py (2)

TurnData (35-255)

_is_single_set_format (151-170)

🪛 GitHub Actions: Black

tests/unit/core/metrics/custom/test_tool_eval.py

[error] 1-1: Black formatting check failed. 4 files would be reformatted. Run 'black .' to format.

tests/unit/core/models/test_data.py

[error] 1-1: Black formatting check failed. 4 files would be reformatted. Run 'black .' to format.

src/lightspeed_evaluation/core/metrics/custom/tool_eval.py

[error] 1-1: Black formatting check failed. 4 files would be reformatted. Run 'black .' to format.

src/lightspeed_evaluation/core/models/data.py

[error] 1-1: Black formatting check failed. 4 files would be reformatted. Run 'black .' to format.

🪛 GitHub Actions: Pyright

src/lightspeed_evaluation/core/models/data.py

[error] 215-215: pyright: Type "list[Unknown] | bool" is not assignable to return type "bool" in data.py:215:16 (reportReturnType)

🪛 GitHub Actions: Python linter

src/lightspeed_evaluation/core/metrics/custom/tool_eval.py

[error] 169-169: Trailing whitespace (trailing-whitespace).

[error] 172-172: Trailing whitespace (trailing-whitespace).

[warning] 149-149: W0613: Unused argument 'actual' (unused-argument).

src/lightspeed_evaluation/core/models/data.py

[error] 116-116: Trailing whitespace (trailing-whitespace).

[error] 119-119: Trailing whitespace (trailing-whitespace).

[error] 133-133: Trailing whitespace (trailing-whitespace).

[error] 135-135: Line too long (113/100) (line-too-long).

[error] 139-139: Trailing whitespace (trailing-whitespace).

[error] 145-145: Trailing whitespace (trailing-whitespace).

[error] 155-155: Trailing whitespace (trailing-whitespace).

[error] 163-163: Line too long (107/100) (line-too-long).

[error] 169-169: Trailing whitespace (trailing-whitespace).

[error] 202-202: Trailing whitespace (trailing-whitespace).

[error] 208-208: Line too long (105/100) (line-too-long).

[error] 159-159: R1705: Unnecessary "elif" after "return", remove the leading "el" from "elif" (no-else-return).

[error] 162-162: R1705: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it (no-else-return).

🪛 GitHub Actions: Tests

tests/unit/core/metrics/custom/test_tool_eval.py

[error] 57-57: TestEvaluateToolCalls::test_empty_pattern_match_primary failed: AssertionError: 'No tool calls made (valid skip scenario)' not found in details which contains 'Primary pattern matched: No tool calls made (valid alternate skip scenario)'

[error] 71-71: TestEvaluateToolCalls::test_empty_pattern_match_alternative failed: AssertionError: 'valid skip scenario' not found in details which contains 'Alternative 2 matched: No tool calls made (valid alternate skip scenario)'

[error] 100-100: TestEvaluateToolCalls::test_error_handling failed: AssertionError: Expected message to include 'not an expected alternative' or 'Tool evaluation error', but details were 'No actual tool calls made and this is not set as an expected alternative'

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: mypy

src/lightspeed_evaluation/core/metrics/custom/tool_eval.py

src/lightspeed_evaluation/core/models/data.py

Assisted-by: Cursor

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ebacc05 and 31e2708.

📒 Files selected for processing (8)

README.md (3 hunks)
src/lightspeed_evaluation/core/metrics/custom/tool_eval.py (2 hunks)
src/lightspeed_evaluation/core/models/data.py (3 hunks)
tests/unit/core/metrics/__init__.py (1 hunks)
tests/unit/core/metrics/custom/__init__.py (1 hunks)
tests/unit/core/metrics/custom/test_tool_eval.py (1 hunks)
tests/unit/core/models/__init__.py (1 hunks)
tests/unit/core/models/test_data.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (5)

tests/unit/core/models/test_data.py
tests/unit/core/metrics/custom/test_tool_eval.py
tests/unit/core/metrics/init.py
tests/unit/core/metrics/custom/init.py
tests/unit/core/models/init.py

🧰 Additional context used

📓 Path-based instructions (4)

src/lightspeed_evaluation/**

📄 CodeRabbit inference engine (AGENTS.md)

Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)

Files:

src/lightspeed_evaluation/core/metrics/custom/tool_eval.py
src/lightspeed_evaluation/core/models/data.py

src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Require type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels

Files:

src/lightspeed_evaluation/core/metrics/custom/tool_eval.py
src/lightspeed_evaluation/core/models/data.py

src/lightspeed_evaluation/core/metrics/custom/**

📄 CodeRabbit inference engine (AGENTS.md)

Add new custom metrics under src/lightspeed_evaluation/core/metrics/custom/

Files:

src/lightspeed_evaluation/core/metrics/custom/tool_eval.py

src/lightspeed_evaluation/core/metrics/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Register new metrics in MetricManager’s supported_metrics dictionary

Files:

src/lightspeed_evaluation/core/metrics/custom/tool_eval.py

🧠 Learnings (6)

📓 Common learnings

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.195Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration uses "tool_name" as the key for tool names, not "name". The tool call evaluation implementation specifically looks for the "tool_name" field when comparing expected vs actual tool calls.

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.195Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration and tool call evaluation implementation consistently uses "tool_name" as the key for tool names, not "name". The tool call parsing, comparison, and formatting all expect the "tool_name" field throughout the codebase.

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:140-145
Timestamp: 2025-09-11T12:47:06.747Z
Learning: User asamal4 prefers that non-critical comments are sent when actual code changes are pushed, not on unrelated commits.

📚 Learning: 2025-08-13T14:07:44.195Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.195Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration uses "tool_name" as the key for tool names, not "name". The tool call evaluation implementation specifically looks for the "tool_name" field when comparing expected vs actual tool calls.

Applied to files:

README.md
src/lightspeed_evaluation/core/metrics/custom/tool_eval.py
src/lightspeed_evaluation/core/models/data.py

📚 Learning: 2025-08-13T14:07:44.195Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.195Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration and tool call evaluation implementation consistently uses "tool_name" as the key for tool names, not "name". The tool call parsing, comparison, and formatting all expect the "tool_name" field throughout the codebase.

Applied to files:

README.md
src/lightspeed_evaluation/core/metrics/custom/tool_eval.py

📚 Learning: 2025-09-08T11:11:54.516Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.

Applied to files:

README.md
src/lightspeed_evaluation/core/metrics/custom/tool_eval.py

📚 Learning: 2025-08-26T11:17:48.640Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.

Applied to files:

src/lightspeed_evaluation/core/metrics/custom/tool_eval.py

📚 Learning: 2025-09-11T12:47:06.747Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:140-145
Timestamp: 2025-09-11T12:47:06.747Z
Learning: User asamal4 prefers that non-critical comments are sent when actual code changes are pushed, not on unrelated commits.

Applied to files:

src/lightspeed_evaluation/core/metrics/custom/tool_eval.py

src/lightspeed_evaluation/core/models/data.py

Assisted-by: Cursor

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

tests/unit/core/metrics/custom/test_tool_eval.py (1)
90-103: Strengthen error handling test.

The current test passes an invalid type (string instead of list) but has an uncertain assertion with two possible error messages. The comment suggests the error handling might not be as robust as intended.

Consider:

Using more specific assertions for the expected error scenario

Adding separate test cases for different invalid input types (string, None, nested incorrect structure)

Verifying that the error message clearly indicates the validation failure

Example improvement:
 def test_error_handling(self):
-    """Test error handling in evaluate_tool_calls."""
-    # Invalid expected format should be handled gracefully
-    expected = "invalid"  # Not a list
+    """Test error handling for invalid input type."""
+    expected = "invalid"  # String instead of list
     actual = []
 
     success, details = evaluate_tool_calls(expected, actual)
 
     assert success is False
-    # The function iterates over the string characters, so we get a different error
-    assert (
-        "not set as an expected alternative" in details
-        or "Tool evaluation error" in details
-    )
+    assert "Tool evaluation error" in details

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 31e2708 and ec5f63d.

📒 Files selected for processing (6)

README.md (3 hunks)
tests/unit/core/metrics/__init__.py (1 hunks)
tests/unit/core/metrics/custom/__init__.py (1 hunks)
tests/unit/core/metrics/custom/test_tool_eval.py (1 hunks)
tests/unit/core/models/__init__.py (1 hunks)
tests/unit/core/models/test_data.py (1 hunks)

✅ Files skipped from review due to trivial changes (1)

tests/unit/core/metrics/custom/init.py

🚧 Files skipped from review as they are similar to previous changes (3)

tests/unit/core/models/init.py
tests/unit/core/metrics/init.py
tests/unit/core/models/test_data.py

🧰 Additional context used

📓 Path-based instructions (3)

tests/unit/**

📄 CodeRabbit inference engine (AGENTS.md)

Place unit tests under tests/unit/ mirroring the source structure

Files:

tests/unit/core/metrics/custom/test_tool_eval.py

tests/**/test_*.py

📄 CodeRabbit inference engine (AGENTS.md)

Name test files as test_*.py

Files:

tests/unit/core/metrics/custom/test_tool_eval.py

tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest for mocking (pytest-mock’s mocker), not unittest.mock
Name test functions as test_*
Name test classes as Test*
Add comprehensive tests for new features with mocked LLM calls using pytest

Files:

tests/unit/core/metrics/custom/test_tool_eval.py

🧠 Learnings (6)

📓 Common learnings

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.195Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration uses "tool_name" as the key for tool names, not "name". The tool call evaluation implementation specifically looks for the "tool_name" field when comparing expected vs actual tool calls.

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.195Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration and tool call evaluation implementation consistently uses "tool_name" as the key for tool names, not "name". The tool call parsing, comparison, and formatting all expect the "tool_name" field throughout the codebase.

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 90
File: src/lightspeed_evaluation/core/models/data.py:198-208
Timestamp: 2025-10-31T11:54:59.126Z
Learning: In the lightspeed_evaluation framework, the expected_tool_calls validator intentionally rejects a single empty set `[[]]` as the only alternative. This is by design: if no tool calls are expected, the tool_eval metric should not be configured for that turn. Empty sets are only valid as fallback alternatives (e.g., `[[[tool_call]], [[]]]`), representing optional tool call scenarios, not as primary or sole expectations.

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:140-145
Timestamp: 2025-09-11T12:47:06.747Z
Learning: User asamal4 prefers that non-critical comments are sent when actual code changes are pushed, not on unrelated commits.

📚 Learning: 2025-10-31T11:54:59.126Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 90
File: src/lightspeed_evaluation/core/models/data.py:198-208
Timestamp: 2025-10-31T11:54:59.126Z
Learning: In the lightspeed_evaluation framework, the expected_tool_calls validator intentionally rejects a single empty set `[[]]` as the only alternative. This is by design: if no tool calls are expected, the tool_eval metric should not be configured for that turn. Empty sets are only valid as fallback alternatives (e.g., `[[[tool_call]], [[]]]`), representing optional tool call scenarios, not as primary or sole expectations.

Applied to files:

README.md
tests/unit/core/metrics/custom/test_tool_eval.py

📚 Learning: 2025-08-13T14:07:44.195Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.195Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration uses "tool_name" as the key for tool names, not "name". The tool call evaluation implementation specifically looks for the "tool_name" field when comparing expected vs actual tool calls.

Applied to files:

README.md
tests/unit/core/metrics/custom/test_tool_eval.py

📚 Learning: 2025-08-13T14:07:44.195Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.195Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration and tool call evaluation implementation consistently uses "tool_name" as the key for tool names, not "name". The tool call parsing, comparison, and formatting all expect the "tool_name" field throughout the codebase.

Applied to files:

README.md

📚 Learning: 2025-09-08T11:11:54.516Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.

Applied to files:

README.md
tests/unit/core/metrics/custom/test_tool_eval.py

📚 Learning: 2025-10-16T11:17:19.324Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to tests/**/*.py : Add comprehensive tests for new features with mocked LLM calls using pytest

Applied to files:

tests/unit/core/metrics/custom/test_tool_eval.py

🧬 Code graph analysis (1)

tests/unit/core/metrics/custom/test_tool_eval.py (1)

src/lightspeed_evaluation/core/metrics/custom/tool_eval.py (1)

evaluate_tool_calls (10-34)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: mypy
GitHub Check: tests (3.11)
GitHub Check: tests (3.13)
GitHub Check: pydocstyle
GitHub Check: tests (3.12)

🔇 Additional comments (14)

README.md (4)

294-294: LGTM!

The type change from list[list[dict]] to list[list[list[dict]]] correctly reflects the new multi-alternative structure for tool call evaluation.

305-305: LGTM!

The note clearly indicates that expected_tool_calls now requires the multi-alternative sets format for the custom:tool_eval metric.

317-325: LGTM!

The new "Tool Evaluation" section provides clear guidance on the multi-alternative format, including:

Format specification

Matching behavior

Use cases

Empty set constraints

This documentation will help users understand how to configure tool call evaluation with multiple alternatives.

329-348: LGTM!

The updated example clearly demonstrates the multi-alternative format with well-commented scenarios:

Primary approach with tool sequences

Alternative approach with different tools

Skip scenario with empty list

The inline comments help users understand each alternative's purpose.

tests/unit/core/metrics/custom/test_tool_eval.py (10)

1-9: LGTM!

The test file structure follows coding guidelines:

Located under tests/unit/ ✓

Named as test_*.py ✓

Uses pytest ✓

Imports private functions for unit testing (acceptable practice) ✓

15-28: LGTM!

The test validates that the primary pattern matches correctly and returns the expected success message format.

30-46: LGTM!

The test correctly validates alternative pattern matching, ensuring that when the primary pattern doesn't match, the function tries subsequent alternatives.

48-57: Verify consistency with validation rules.

This test validates that [[]] (empty set as the only alternative) can be successfully evaluated. However, based on learnings, the data model validator intentionally rejects [[]] as the only alternative - such data should never reach the evaluate_tool_calls function in production.

Consider whether this test should:

Be removed (if testing unreachable code paths)

Be moved to a separate "invalid input handling" test suite

Remain as-is (if testing low-level function behavior independent of validation)

Based on learnings.

59-71: LGTM!

The test correctly validates the skip scenario where an empty set appears as a fallback alternative (not as the only alternative), which aligns with the design.

73-88: LGTM!

The test validates the failure path when none of the expected patterns match the actual tool calls, including checking for appropriate error messaging.

109-137: LGTM!

The tests for compare_tool_calls provide good coverage:

Exact matches

Length mismatches

Empty sequences

All assertions are clear and validate the expected behavior.

143-168: LGTM!

The tests for _compare_tool_call_sequence validate both matching sequences and length mismatches correctly.

174-199: LGTM!

The tests for _compare_single_tool_call cover:

Successful tool name and argument matches

Tool name mismatches

Missing arguments

All test cases validate expected behavior correctly.

205-256: LGTM!

The tests for _compare_tool_arguments provide comprehensive coverage:

Exact argument matches

Regex pattern matching

Missing argument keys

Extra argument keys

Invalid regex patterns

Non-dictionary arguments

This thorough test suite validates all the critical argument comparison scenarios.

asamal4 · 2025-11-03T11:18:11Z

@VladimirKadlec @tisnik PTAL

VladimirKadlec

LGTM

tisnik

LGTM

coderabbitai bot reviewed Oct 31, 2025

View reviewed changes

src/lightspeed_evaluation/core/metrics/custom/tool_eval.py Show resolved Hide resolved

src/lightspeed_evaluation/core/metrics/custom/tool_eval.py Show resolved Hide resolved

src/lightspeed_evaluation/core/models/data.py Outdated Show resolved Hide resolved

Add alternatives for tool call eval

dbecb2e

Assisted-by: Cursor

asamal4 force-pushed the optional-tool branch from ebacc05 to 31e2708 Compare October 31, 2025 11:30

coderabbitai bot reviewed Oct 31, 2025

View reviewed changes

src/lightspeed_evaluation/core/models/data.py Show resolved Hide resolved

asamal4 added 2 commits October 31, 2025 19:33

test case for alternate tool call evals

796d32b

Assisted-by: Cursor

update readme for alternate tool call eval

ec5f63d

asamal4 force-pushed the optional-tool branch from 31e2708 to ec5f63d Compare October 31, 2025 14:04

coderabbitai bot reviewed Oct 31, 2025

View reviewed changes

VladimirKadlec approved these changes Nov 4, 2025

View reviewed changes

tisnik approved these changes Nov 6, 2025

View reviewed changes

tisnik merged commit 64f8892 into lightspeed-core:main Nov 6, 2025
15 checks passed

coderabbitai bot mentioned this pull request Nov 13, 2025

handle no tool call alternative #101

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ability to set alternate tool calls for eval #90

Ability to set alternate tool calls for eval #90

Uh oh!

asamal4 commented Oct 31, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Oct 31, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

asamal4 commented Nov 3, 2025

Uh oh!

VladimirKadlec left a comment

Uh oh!

tisnik left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Ability to set alternate tool calls for eval #90

Ability to set alternate tool calls for eval #90

Uh oh!

Conversation

asamal4 commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

asamal4 commented Nov 3, 2025

Uh oh!

VladimirKadlec left a comment

Choose a reason for hiding this comment

Uh oh!

tisnik left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

asamal4 commented Oct 31, 2025 •

edited

Loading

coderabbitai bot commented Oct 31, 2025 •

edited

Loading