-
Notifications
You must be signed in to change notification settings - Fork 21
Ability to set alternate tool calls for eval #90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughAdds multi-alternative support for expected tool-call sequences by changing Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant ToolEval.evaluate_tool_calls as Evaluator
participant compare_tool_calls as Comparator
participant _create_success_message as SuccessMsg
participant _create_failure_message as FailureMsg
Caller->>Evaluator: evaluate_tool_calls(expected_alternatives, actual_calls)
loop for each alternative (i)
Evaluator->>Comparator: compare_tool_calls(expected_alternatives[i], actual_calls)
alt match
Comparator-->>Evaluator: True
Evaluator->>SuccessMsg: build success message (alternative i)
SuccessMsg-->>Evaluator: success message
Evaluator-->>Caller: (True, message)
Note right of Evaluator: returns on first matching alternative
else no match
Comparator-->>Evaluator: False
end
end
alt none matched
Evaluator->>FailureMsg: build failure message (details for all alternatives)
FailureMsg-->>Evaluator: failure message
Evaluator-->>Caller: (False, message)
end
alt exception
Evaluator-->>Caller: (False, error_message) and log
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
README.md(3 hunks)src/lightspeed_evaluation/core/metrics/custom/tool_eval.py(2 hunks)src/lightspeed_evaluation/core/models/data.py(3 hunks)tests/unit/core/metrics/__init__.py(1 hunks)tests/unit/core/metrics/custom/__init__.py(1 hunks)tests/unit/core/metrics/custom/test_tool_eval.py(1 hunks)tests/unit/core/models/__init__.py(1 hunks)tests/unit/core/models/test_data.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (7)
tests/unit/**
📄 CodeRabbit inference engine (AGENTS.md)
Place unit tests under tests/unit/ mirroring the source structure
Files:
tests/unit/core/metrics/__init__.pytests/unit/core/metrics/custom/test_tool_eval.pytests/unit/core/metrics/custom/__init__.pytests/unit/core/models/test_data.pytests/unit/core/models/__init__.py
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
tests/**/*.py: Use pytest for mocking (pytest-mock’s mocker), not unittest.mock
Name test functions as test_*
Name test classes as Test*
Add comprehensive tests for new features with mocked LLM calls using pytest
Files:
tests/unit/core/metrics/__init__.pytests/unit/core/metrics/custom/test_tool_eval.pytests/unit/core/metrics/custom/__init__.pytests/unit/core/models/test_data.pytests/unit/core/models/__init__.py
tests/**/test_*.py
📄 CodeRabbit inference engine (AGENTS.md)
Name test files as test_*.py
Files:
tests/unit/core/metrics/custom/test_tool_eval.pytests/unit/core/models/test_data.py
src/lightspeed_evaluation/**
📄 CodeRabbit inference engine (AGENTS.md)
Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)
Files:
src/lightspeed_evaluation/core/metrics/custom/tool_eval.pysrc/lightspeed_evaluation/core/models/data.py
src/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/**/*.py: Require type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels
Files:
src/lightspeed_evaluation/core/metrics/custom/tool_eval.pysrc/lightspeed_evaluation/core/models/data.py
src/lightspeed_evaluation/core/metrics/custom/**
📄 CodeRabbit inference engine (AGENTS.md)
Add new custom metrics under src/lightspeed_evaluation/core/metrics/custom/
Files:
src/lightspeed_evaluation/core/metrics/custom/tool_eval.py
src/lightspeed_evaluation/core/metrics/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Register new metrics in MetricManager’s supported_metrics dictionary
Files:
src/lightspeed_evaluation/core/metrics/custom/tool_eval.py
🧠 Learnings (9)
📓 Common learnings
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.195Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration uses "tool_name" as the key for tool names, not "name". The tool call evaluation implementation specifically looks for the "tool_name" field when comparing expected vs actual tool calls.
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.195Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration and tool call evaluation implementation consistently uses "tool_name" as the key for tool names, not "name". The tool call parsing, comparison, and formatting all expect the "tool_name" field throughout the codebase.
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:140-145
Timestamp: 2025-09-11T12:47:06.747Z
Learning: User asamal4 prefers that non-critical comments are sent when actual code changes are pushed, not on unrelated commits.
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/core/metrics/**/*.py : Register new metrics in MetricManager’s supported_metrics dictionary
Applied to files:
tests/unit/core/metrics/__init__.pytests/unit/core/metrics/custom/__init__.py
📚 Learning: 2025-08-13T14:07:44.195Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.195Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration uses "tool_name" as the key for tool names, not "name". The tool call evaluation implementation specifically looks for the "tool_name" field when comparing expected vs actual tool calls.
Applied to files:
README.mdtests/unit/core/metrics/custom/test_tool_eval.pysrc/lightspeed_evaluation/core/metrics/custom/tool_eval.pysrc/lightspeed_evaluation/core/models/data.py
📚 Learning: 2025-08-13T14:07:44.195Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.195Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration and tool call evaluation implementation consistently uses "tool_name" as the key for tool names, not "name". The tool call parsing, comparison, and formatting all expect the "tool_name" field throughout the codebase.
Applied to files:
README.mdtests/unit/core/metrics/custom/test_tool_eval.pysrc/lightspeed_evaluation/core/metrics/custom/tool_eval.py
📚 Learning: 2025-09-08T11:11:54.516Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.
Applied to files:
README.mdtests/unit/core/metrics/custom/test_tool_eval.pysrc/lightspeed_evaluation/core/metrics/custom/tool_eval.py
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to tests/**/*.py : Add comprehensive tests for new features with mocked LLM calls using pytest
Applied to files:
tests/unit/core/metrics/custom/test_tool_eval.pytests/unit/core/models/test_data.py
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/core/metrics/custom/** : Add new custom metrics under src/lightspeed_evaluation/core/metrics/custom/
Applied to files:
tests/unit/core/metrics/custom/__init__.py
📚 Learning: 2025-08-26T11:17:48.640Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.
Applied to files:
src/lightspeed_evaluation/core/metrics/custom/tool_eval.py
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
Applied to files:
src/lightspeed_evaluation/core/models/data.py
🧬 Code graph analysis (2)
tests/unit/core/metrics/custom/test_tool_eval.py (1)
src/lightspeed_evaluation/core/metrics/custom/tool_eval.py (1)
evaluate_tool_calls(10-34)
tests/unit/core/models/test_data.py (1)
src/lightspeed_evaluation/core/models/data.py (2)
TurnData(35-255)_is_single_set_format(151-170)
🪛 GitHub Actions: Black
tests/unit/core/metrics/custom/test_tool_eval.py
[error] 1-1: Black formatting check failed. 4 files would be reformatted. Run 'black .' to format.
tests/unit/core/models/test_data.py
[error] 1-1: Black formatting check failed. 4 files would be reformatted. Run 'black .' to format.
src/lightspeed_evaluation/core/metrics/custom/tool_eval.py
[error] 1-1: Black formatting check failed. 4 files would be reformatted. Run 'black .' to format.
src/lightspeed_evaluation/core/models/data.py
[error] 1-1: Black formatting check failed. 4 files would be reformatted. Run 'black .' to format.
🪛 GitHub Actions: Pyright
src/lightspeed_evaluation/core/models/data.py
[error] 215-215: pyright: Type "list[Unknown] | bool" is not assignable to return type "bool" in data.py:215:16 (reportReturnType)
🪛 GitHub Actions: Python linter
src/lightspeed_evaluation/core/metrics/custom/tool_eval.py
[error] 169-169: Trailing whitespace (trailing-whitespace).
[error] 172-172: Trailing whitespace (trailing-whitespace).
[warning] 149-149: W0613: Unused argument 'actual' (unused-argument).
src/lightspeed_evaluation/core/models/data.py
[error] 116-116: Trailing whitespace (trailing-whitespace).
[error] 119-119: Trailing whitespace (trailing-whitespace).
[error] 133-133: Trailing whitespace (trailing-whitespace).
[error] 135-135: Line too long (113/100) (line-too-long).
[error] 139-139: Trailing whitespace (trailing-whitespace).
[error] 145-145: Trailing whitespace (trailing-whitespace).
[error] 155-155: Trailing whitespace (trailing-whitespace).
[error] 163-163: Line too long (107/100) (line-too-long).
[error] 169-169: Trailing whitespace (trailing-whitespace).
[error] 202-202: Trailing whitespace (trailing-whitespace).
[error] 208-208: Line too long (105/100) (line-too-long).
[error] 159-159: R1705: Unnecessary "elif" after "return", remove the leading "el" from "elif" (no-else-return).
[error] 162-162: R1705: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it (no-else-return).
🪛 GitHub Actions: Tests
tests/unit/core/metrics/custom/test_tool_eval.py
[error] 57-57: TestEvaluateToolCalls::test_empty_pattern_match_primary failed: AssertionError: 'No tool calls made (valid skip scenario)' not found in details which contains 'Primary pattern matched: No tool calls made (valid alternate skip scenario)'
[error] 71-71: TestEvaluateToolCalls::test_empty_pattern_match_alternative failed: AssertionError: 'valid skip scenario' not found in details which contains 'Alternative 2 matched: No tool calls made (valid alternate skip scenario)'
[error] 100-100: TestEvaluateToolCalls::test_error_handling failed: AssertionError: Expected message to include 'not an expected alternative' or 'Tool evaluation error', but details were 'No actual tool calls made and this is not set as an expected alternative'
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: mypy
Assisted-by: Cursor
ebacc05 to
31e2708
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
README.md(3 hunks)src/lightspeed_evaluation/core/metrics/custom/tool_eval.py(2 hunks)src/lightspeed_evaluation/core/models/data.py(3 hunks)tests/unit/core/metrics/__init__.py(1 hunks)tests/unit/core/metrics/custom/__init__.py(1 hunks)tests/unit/core/metrics/custom/test_tool_eval.py(1 hunks)tests/unit/core/models/__init__.py(1 hunks)tests/unit/core/models/test_data.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
- tests/unit/core/models/test_data.py
- tests/unit/core/metrics/custom/test_tool_eval.py
- tests/unit/core/metrics/init.py
- tests/unit/core/metrics/custom/init.py
- tests/unit/core/models/init.py
🧰 Additional context used
📓 Path-based instructions (4)
src/lightspeed_evaluation/**
📄 CodeRabbit inference engine (AGENTS.md)
Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)
Files:
src/lightspeed_evaluation/core/metrics/custom/tool_eval.pysrc/lightspeed_evaluation/core/models/data.py
src/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/**/*.py: Require type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels
Files:
src/lightspeed_evaluation/core/metrics/custom/tool_eval.pysrc/lightspeed_evaluation/core/models/data.py
src/lightspeed_evaluation/core/metrics/custom/**
📄 CodeRabbit inference engine (AGENTS.md)
Add new custom metrics under src/lightspeed_evaluation/core/metrics/custom/
Files:
src/lightspeed_evaluation/core/metrics/custom/tool_eval.py
src/lightspeed_evaluation/core/metrics/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Register new metrics in MetricManager’s supported_metrics dictionary
Files:
src/lightspeed_evaluation/core/metrics/custom/tool_eval.py
🧠 Learnings (6)
📓 Common learnings
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.195Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration uses "tool_name" as the key for tool names, not "name". The tool call evaluation implementation specifically looks for the "tool_name" field when comparing expected vs actual tool calls.
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.195Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration and tool call evaluation implementation consistently uses "tool_name" as the key for tool names, not "name". The tool call parsing, comparison, and formatting all expect the "tool_name" field throughout the codebase.
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:140-145
Timestamp: 2025-09-11T12:47:06.747Z
Learning: User asamal4 prefers that non-critical comments are sent when actual code changes are pushed, not on unrelated commits.
📚 Learning: 2025-08-13T14:07:44.195Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.195Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration uses "tool_name" as the key for tool names, not "name". The tool call evaluation implementation specifically looks for the "tool_name" field when comparing expected vs actual tool calls.
Applied to files:
README.mdsrc/lightspeed_evaluation/core/metrics/custom/tool_eval.pysrc/lightspeed_evaluation/core/models/data.py
📚 Learning: 2025-08-13T14:07:44.195Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.195Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration and tool call evaluation implementation consistently uses "tool_name" as the key for tool names, not "name". The tool call parsing, comparison, and formatting all expect the "tool_name" field throughout the codebase.
Applied to files:
README.mdsrc/lightspeed_evaluation/core/metrics/custom/tool_eval.py
📚 Learning: 2025-09-08T11:11:54.516Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.
Applied to files:
README.mdsrc/lightspeed_evaluation/core/metrics/custom/tool_eval.py
📚 Learning: 2025-08-26T11:17:48.640Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.
Applied to files:
src/lightspeed_evaluation/core/metrics/custom/tool_eval.py
📚 Learning: 2025-09-11T12:47:06.747Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:140-145
Timestamp: 2025-09-11T12:47:06.747Z
Learning: User asamal4 prefers that non-critical comments are sent when actual code changes are pushed, not on unrelated commits.
Applied to files:
src/lightspeed_evaluation/core/metrics/custom/tool_eval.py
31e2708 to
ec5f63d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tests/unit/core/metrics/custom/test_tool_eval.py (1)
90-103: Strengthen error handling test.The current test passes an invalid type (
stringinstead oflist) but has an uncertain assertion with two possible error messages. The comment suggests the error handling might not be as robust as intended.Consider:
- Using more specific assertions for the expected error scenario
- Adding separate test cases for different invalid input types (string, None, nested incorrect structure)
- Verifying that the error message clearly indicates the validation failure
Example improvement:
def test_error_handling(self): - """Test error handling in evaluate_tool_calls.""" - # Invalid expected format should be handled gracefully - expected = "invalid" # Not a list + """Test error handling for invalid input type.""" + expected = "invalid" # String instead of list actual = [] success, details = evaluate_tool_calls(expected, actual) assert success is False - # The function iterates over the string characters, so we get a different error - assert ( - "not set as an expected alternative" in details - or "Tool evaluation error" in details - ) + assert "Tool evaluation error" in details
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
README.md(3 hunks)tests/unit/core/metrics/__init__.py(1 hunks)tests/unit/core/metrics/custom/__init__.py(1 hunks)tests/unit/core/metrics/custom/test_tool_eval.py(1 hunks)tests/unit/core/models/__init__.py(1 hunks)tests/unit/core/models/test_data.py(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- tests/unit/core/metrics/custom/init.py
🚧 Files skipped from review as they are similar to previous changes (3)
- tests/unit/core/models/init.py
- tests/unit/core/metrics/init.py
- tests/unit/core/models/test_data.py
🧰 Additional context used
📓 Path-based instructions (3)
tests/unit/**
📄 CodeRabbit inference engine (AGENTS.md)
Place unit tests under tests/unit/ mirroring the source structure
Files:
tests/unit/core/metrics/custom/test_tool_eval.py
tests/**/test_*.py
📄 CodeRabbit inference engine (AGENTS.md)
Name test files as test_*.py
Files:
tests/unit/core/metrics/custom/test_tool_eval.py
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
tests/**/*.py: Use pytest for mocking (pytest-mock’s mocker), not unittest.mock
Name test functions as test_*
Name test classes as Test*
Add comprehensive tests for new features with mocked LLM calls using pytest
Files:
tests/unit/core/metrics/custom/test_tool_eval.py
🧠 Learnings (6)
📓 Common learnings
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.195Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration uses "tool_name" as the key for tool names, not "name". The tool call evaluation implementation specifically looks for the "tool_name" field when comparing expected vs actual tool calls.
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.195Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration and tool call evaluation implementation consistently uses "tool_name" as the key for tool names, not "name". The tool call parsing, comparison, and formatting all expect the "tool_name" field throughout the codebase.
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 90
File: src/lightspeed_evaluation/core/models/data.py:198-208
Timestamp: 2025-10-31T11:54:59.126Z
Learning: In the lightspeed_evaluation framework, the expected_tool_calls validator intentionally rejects a single empty set `[[]]` as the only alternative. This is by design: if no tool calls are expected, the tool_eval metric should not be configured for that turn. Empty sets are only valid as fallback alternatives (e.g., `[[[tool_call]], [[]]]`), representing optional tool call scenarios, not as primary or sole expectations.
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:140-145
Timestamp: 2025-09-11T12:47:06.747Z
Learning: User asamal4 prefers that non-critical comments are sent when actual code changes are pushed, not on unrelated commits.
📚 Learning: 2025-10-31T11:54:59.126Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 90
File: src/lightspeed_evaluation/core/models/data.py:198-208
Timestamp: 2025-10-31T11:54:59.126Z
Learning: In the lightspeed_evaluation framework, the expected_tool_calls validator intentionally rejects a single empty set `[[]]` as the only alternative. This is by design: if no tool calls are expected, the tool_eval metric should not be configured for that turn. Empty sets are only valid as fallback alternatives (e.g., `[[[tool_call]], [[]]]`), representing optional tool call scenarios, not as primary or sole expectations.
Applied to files:
README.mdtests/unit/core/metrics/custom/test_tool_eval.py
📚 Learning: 2025-08-13T14:07:44.195Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.195Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration uses "tool_name" as the key for tool names, not "name". The tool call evaluation implementation specifically looks for the "tool_name" field when comparing expected vs actual tool calls.
Applied to files:
README.mdtests/unit/core/metrics/custom/test_tool_eval.py
📚 Learning: 2025-08-13T14:07:44.195Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.195Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration and tool call evaluation implementation consistently uses "tool_name" as the key for tool names, not "name". The tool call parsing, comparison, and formatting all expect the "tool_name" field throughout the codebase.
Applied to files:
README.md
📚 Learning: 2025-09-08T11:11:54.516Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.
Applied to files:
README.mdtests/unit/core/metrics/custom/test_tool_eval.py
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to tests/**/*.py : Add comprehensive tests for new features with mocked LLM calls using pytest
Applied to files:
tests/unit/core/metrics/custom/test_tool_eval.py
🧬 Code graph analysis (1)
tests/unit/core/metrics/custom/test_tool_eval.py (1)
src/lightspeed_evaluation/core/metrics/custom/tool_eval.py (1)
evaluate_tool_calls(10-34)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: mypy
- GitHub Check: tests (3.11)
- GitHub Check: tests (3.13)
- GitHub Check: pydocstyle
- GitHub Check: tests (3.12)
🔇 Additional comments (14)
README.md (4)
294-294: LGTM!The type change from
list[list[dict]]tolist[list[list[dict]]]correctly reflects the new multi-alternative structure for tool call evaluation.
305-305: LGTM!The note clearly indicates that
expected_tool_callsnow requires the multi-alternative sets format for thecustom:tool_evalmetric.
317-325: LGTM!The new "Tool Evaluation" section provides clear guidance on the multi-alternative format, including:
- Format specification
- Matching behavior
- Use cases
- Empty set constraints
This documentation will help users understand how to configure tool call evaluation with multiple alternatives.
329-348: LGTM!The updated example clearly demonstrates the multi-alternative format with well-commented scenarios:
- Primary approach with tool sequences
- Alternative approach with different tools
- Skip scenario with empty list
The inline comments help users understand each alternative's purpose.
tests/unit/core/metrics/custom/test_tool_eval.py (10)
1-9: LGTM!The test file structure follows coding guidelines:
- Located under
tests/unit/✓- Named as
test_*.py✓- Uses pytest ✓
- Imports private functions for unit testing (acceptable practice) ✓
15-28: LGTM!The test validates that the primary pattern matches correctly and returns the expected success message format.
30-46: LGTM!The test correctly validates alternative pattern matching, ensuring that when the primary pattern doesn't match, the function tries subsequent alternatives.
48-57: Verify consistency with validation rules.This test validates that
[[]](empty set as the only alternative) can be successfully evaluated. However, based on learnings, the data model validator intentionally rejects[[]]as the only alternative - such data should never reach theevaluate_tool_callsfunction in production.Consider whether this test should:
- Be removed (if testing unreachable code paths)
- Be moved to a separate "invalid input handling" test suite
- Remain as-is (if testing low-level function behavior independent of validation)
Based on learnings.
59-71: LGTM!The test correctly validates the skip scenario where an empty set appears as a fallback alternative (not as the only alternative), which aligns with the design.
73-88: LGTM!The test validates the failure path when none of the expected patterns match the actual tool calls, including checking for appropriate error messaging.
109-137: LGTM!The tests for
compare_tool_callsprovide good coverage:
- Exact matches
- Length mismatches
- Empty sequences
All assertions are clear and validate the expected behavior.
143-168: LGTM!The tests for
_compare_tool_call_sequencevalidate both matching sequences and length mismatches correctly.
174-199: LGTM!The tests for
_compare_single_tool_callcover:
- Successful tool name and argument matches
- Tool name mismatches
- Missing arguments
All test cases validate expected behavior correctly.
205-256: LGTM!The tests for
_compare_tool_argumentsprovide comprehensive coverage:
- Exact argument matches
- Regex pattern matching
- Missing argument keys
- Extra argument keys
- Invalid regex patterns
- Non-dictionary arguments
This thorough test suite validates all the critical argument comparison scenarios.
|
@VladimirKadlec @tisnik PTAL |
VladimirKadlec
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
tisnik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Background:
Optional tool calls: Imagine a scenario where we have a tool to fetch some information, but same information is also present in conversation history. It is possible that Model may not do a fresh tool call. This may be valid for some use-case
Argument with default value: It is possible that argument value for user query is same as default value in tool definition. So Again it is upto the model to add argument or skip it completely.
So we will have to provide ability to handle this non-deterministic behavior.
Approach: Let user provide multiple expected tool calls, eval will pass when one of those expected values matches. It will fail when none of the alternatives match.
Summary by CodeRabbit
New Features
Models
Documentation
Tests