Skip to content

Conversation

@asamal4
Copy link
Collaborator

@asamal4 asamal4 commented Nov 13, 2025

Handle no tool call alternative.

Summary by CodeRabbit

Bug Fixes

  • Tool call evaluation now proceeds when tool calls are absent, enabling proper assessment of alternative scenarios where no tools are expected.

Tests

  • Added test coverage for evaluation scenarios with missing tool calls.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 13, 2025

Walkthrough

The change removes an early return in _evaluate_tool_calls when tool calls are absent or empty. Instead of immediately returning a 0.0 score, missing/empty tool_calls are now treated as an empty list and forwarded to evaluate_tool_calls for determination. A unit test is added to verify correct handling of None tool_calls with alternative expectations.

Changes

Cohort / File(s) Summary
Production Logic
src/lightspeed_evaluation/core/metrics/custom/custom.py
Removed early return in _evaluate_tool_calls when tool_calls are missing/empty; now treats them as empty list and forwards to evaluate_tool_calls instead of immediately returning 0.0 score
Test Coverage
tests/unit/core/metrics/custom/test_custom.py
New unit test module for CustomMetrics; validates _evaluate_tool_calls correctly handles None tool_calls and matches alternative expectations

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Focus on verifying the control flow change logic is correct and doesn't introduce unintended side effects
  • Confirm the test properly validates the new behavior with None tool_calls and alternative matching expectations

Possibly related PRs

  • Ability to set alternate tool calls for eval #90: Implements multi-alternative/empty-set matching and messaging in evaluate_tool_calls, which complements this change by providing the evaluation logic that now receives the forwarded empty tool_calls

Suggested reviewers

  • VladimirKadlec
  • tisnik

Poem

🐰 No early returns today,
Let empty calls find their way,
Forward to evaluate's care,
Alternatives blooming fair!
Logic flows, not stops—hooray!

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'handle no tool call alternative' directly describes the main change: removing early return for missing tool calls to allow evaluation via alternative paths.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 24f68c4 and e647043.

📒 Files selected for processing (2)
  • src/lightspeed_evaluation/core/metrics/custom/custom.py (1 hunks)
  • tests/unit/core/metrics/custom/test_custom.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (7)
src/lightspeed_evaluation/**

📄 CodeRabbit inference engine (AGENTS.md)

Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)

Files:

  • src/lightspeed_evaluation/core/metrics/custom/custom.py
src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Require type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels

Files:

  • src/lightspeed_evaluation/core/metrics/custom/custom.py
src/lightspeed_evaluation/core/metrics/custom/**

📄 CodeRabbit inference engine (AGENTS.md)

Add new custom metrics under src/lightspeed_evaluation/core/metrics/custom/

Files:

  • src/lightspeed_evaluation/core/metrics/custom/custom.py
src/lightspeed_evaluation/core/metrics/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Register new metrics in MetricManager’s supported_metrics dictionary

Files:

  • src/lightspeed_evaluation/core/metrics/custom/custom.py
tests/unit/**

📄 CodeRabbit inference engine (AGENTS.md)

Place unit tests under tests/unit/ mirroring the source structure

Files:

  • tests/unit/core/metrics/custom/test_custom.py
tests/**/test_*.py

📄 CodeRabbit inference engine (AGENTS.md)

Name test files as test_*.py

Files:

  • tests/unit/core/metrics/custom/test_custom.py
tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest for mocking (pytest-mock’s mocker), not unittest.mock
Name test functions as test_*
Name test classes as Test*
Add comprehensive tests for new features with mocked LLM calls using pytest

Files:

  • tests/unit/core/metrics/custom/test_custom.py
🧠 Learnings (10)
📓 Common learnings
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 90
File: src/lightspeed_evaluation/core/models/data.py:198-208
Timestamp: 2025-10-31T11:54:59.126Z
Learning: In the lightspeed_evaluation framework, the expected_tool_calls validator intentionally rejects a single empty set `[[]]` as the only alternative. This is by design: if no tool calls are expected, the tool_eval metric should not be configured for that turn. Empty sets are only valid as fallback alternatives (e.g., `[[[tool_call]], [[]]]`), representing optional tool call scenarios, not as primary or sole expectations.
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/pipeline/evaluation/evaluator.py:85-89
Timestamp: 2025-09-10T06:57:46.326Z
Learning: For binary metrics like custom:tool_eval, using an explicit threshold of 0.5 is preferred over None threshold with special case handling. This provides consistent behavior where 0.0 scores fail and 1.0 scores pass.
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.195Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration uses "tool_name" as the key for tool names, not "name". The tool call evaluation implementation specifically looks for the "tool_name" field when comparing expected vs actual tool calls.
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.195Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration and tool call evaluation implementation consistently uses "tool_name" as the key for tool names, not "name". The tool call parsing, comparison, and formatting all expect the "tool_name" field throughout the codebase.
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:140-145
Timestamp: 2025-09-11T12:47:06.747Z
Learning: User asamal4 prefers that non-critical comments are sent when actual code changes are pushed, not on unrelated commits.
📚 Learning: 2025-09-08T11:11:54.516Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.

Applied to files:

  • src/lightspeed_evaluation/core/metrics/custom/custom.py
  • tests/unit/core/metrics/custom/test_custom.py
📚 Learning: 2025-10-31T11:54:59.126Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 90
File: src/lightspeed_evaluation/core/models/data.py:198-208
Timestamp: 2025-10-31T11:54:59.126Z
Learning: In the lightspeed_evaluation framework, the expected_tool_calls validator intentionally rejects a single empty set `[[]]` as the only alternative. This is by design: if no tool calls are expected, the tool_eval metric should not be configured for that turn. Empty sets are only valid as fallback alternatives (e.g., `[[[tool_call]], [[]]]`), representing optional tool call scenarios, not as primary or sole expectations.

Applied to files:

  • src/lightspeed_evaluation/core/metrics/custom/custom.py
📚 Learning: 2025-08-26T11:17:48.640Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.

Applied to files:

  • src/lightspeed_evaluation/core/metrics/custom/custom.py
📚 Learning: 2025-09-10T06:57:46.326Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/pipeline/evaluation/evaluator.py:85-89
Timestamp: 2025-09-10T06:57:46.326Z
Learning: For binary metrics like custom:tool_eval, using an explicit threshold of 0.5 is preferred over None threshold with special case handling. This provides consistent behavior where 0.0 scores fail and 1.0 scores pass.

Applied to files:

  • src/lightspeed_evaluation/core/metrics/custom/custom.py
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.

Applied to files:

  • src/lightspeed_evaluation/core/metrics/custom/custom.py
📚 Learning: 2025-09-09T14:58:10.630Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/pipeline/evaluation/amender.py:32-41
Timestamp: 2025-09-09T14:58:10.630Z
Learning: In the lightspeed-evaluation framework, when API is enabled, every turn should make a fresh API call regardless of whether the turn already has response or tool_calls data. This ensures consistency and fresh responses for each evaluation run.

Applied to files:

  • src/lightspeed_evaluation/core/metrics/custom/custom.py
📚 Learning: 2025-08-13T14:07:44.195Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.195Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration uses "tool_name" as the key for tool names, not "name". The tool call evaluation implementation specifically looks for the "tool_name" field when comparing expected vs actual tool calls.

Applied to files:

  • src/lightspeed_evaluation/core/metrics/custom/custom.py
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to tests/**/*.py : Add comprehensive tests for new features with mocked LLM calls using pytest

Applied to files:

  • tests/unit/core/metrics/custom/test_custom.py
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/core/metrics/custom/** : Add new custom metrics under src/lightspeed_evaluation/core/metrics/custom/

Applied to files:

  • tests/unit/core/metrics/custom/test_custom.py
🧬 Code graph analysis (1)
tests/unit/core/metrics/custom/test_custom.py (2)
src/lightspeed_evaluation/core/metrics/custom/custom.py (1)
  • _evaluate_tool_calls (173-199)
src/lightspeed_evaluation/core/models/data.py (1)
  • TurnData (35-261)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: tests (3.12)
  • GitHub Check: tests (3.11)
  • GitHub Check: tests (3.13)
  • GitHub Check: mypy
🔇 Additional comments (2)
src/lightspeed_evaluation/core/metrics/custom/custom.py (1)

191-191: LGTM! Correctly handles None/missing tool_calls.

The logic now properly treats None or missing tool_calls as an empty list, allowing evaluate_tool_calls to match against empty alternatives in the expected tool calls configuration. This aligns with the PR objective to support optional tool call scenarios.

tests/unit/core/metrics/custom/test_custom.py (1)

1-38: LGTM! Well-structured test for None tool_calls handling.

The test correctly validates that None tool_calls are handled by matching against the empty alternative. The test structure follows pytest best practices with proper mocking and appropriate assertions.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@asamal4
Copy link
Collaborator Author

asamal4 commented Nov 13, 2025

@VladimirKadlec @tisnik PTAL

Copy link
Contributor

@VladimirKadlec VladimirKadlec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Side note: Maybe we should find a way how to get rid of all (most of) Optional attributes 😅

Copy link
Contributor

@tisnik tisnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tisnik tisnik merged commit 395631e into lightspeed-core:main Nov 13, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants