-
Notifications
You must be signed in to change notification settings - Fork 107
BFCLv4 support #908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
BFCLv4 support #908
Conversation
Signed-off-by: Michal Bien <[email protected]>
WalkthroughIntroduces BFCL v4: new dataset metadata and splits, scoring extended with memory/web-search/format_sensitivity and agentic/hallucination components, data prepare/utils for v4 ingestion and OpenAI-style tools, evaluator/inference memory & web-search integrations, new WebSearchAPI, and related infra updates. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Prep as prepare.py
participant Git as GitHub
participant FS as Local FS
participant Utils as utils.convert_to_tool
Note over Prep,Git: BFCL v4 download & per-category prepare
Prep->>Git: shallow clone(repo_url, depth=1)
Git-->>Prep: repo_root
Prep->>Prep: discover BFCL_v4 JSON files (filter memory/web_search/format_sensitivity)
loop per category file
Prep->>Prep: load_dataset_entry(target_folder, category)
Prep->>Utils: func_doc_language_specific_pre_processing + convert_to_tool
Utils-->>Prep: tool descriptors
Prep->>FS: write split dir, `__init__.py`, and `test.jsonl`
end
sequenceDiagram
autonumber
participant Eval as Evaluator
participant Score as bfcl_v4.bfcl_score
participant Metrics as Metrics Store
Note over Eval,Score: Scoring includes agentic (memory/web_search) & hallucination components
Eval->>Score: compute_score(metrics)
activate Score
Score->>Score: validate GLOBAL_MAX_K
Score->>Score: get_accuracy_dict(metrics, category, optional=True)
Score->>Score: calculate_agentic_accuracy(metrics)
Score->>Metrics: read MEMORY & WEB_SEARCH category metrics
Metrics-->>Score: per-category accuracies
Score->>Score: aggregate -> overall, agentic, hallucination sub-scores
Score-->>Eval: bfcl_v4 result (overall + sub-scores)
deactivate Score
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Signed-off-by: Michal Bien <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
nemo_skills/dataset/bfcl_v4/__init__.py(1 hunks)nemo_skills/dataset/bfcl_v4/bfcl_score.py(4 hunks)nemo_skills/dataset/bfcl_v4/constants.py(2 hunks)nemo_skills/dataset/bfcl_v4/prepare.py(7 hunks)nemo_skills/dataset/bfcl_v4/utils.py(2 hunks)nemo_skills/evaluation/evaluator/bfcl.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
nemo_skills/dataset/bfcl_v4/prepare.py (1)
nemo_skills/dataset/bfcl_v4/utils.py (7)
convert_to_tool(172-184)func_doc_language_specific_pre_processing(83-134)is_multi_turn(39-43)load_file(62-71)clean_up_memory_prereq_entries(187-217)populate_initial_settings_for_memory_test_cases(220-239)populate_initial_settings_for_web_search_test_cases(242-258)
🪛 Ruff (0.13.3)
nemo_skills/dataset/bfcl_v4/prepare.py
108-108: Undefined name test_cases_to_generate
(F821)
145-145: Starting a process with a partial executable path
(S607)
169-169: Using .strip() with multi-character strings is misleading
(B005)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: unit-tests
- GitHub Check: pre-commit
🔇 Additional comments (15)
nemo_skills/evaluation/evaluator/bfcl.py (1)
44-44: LGTM - Consistent v3 to v4 migration.The file naming and category extraction updates correctly align with BFCL v4 conventions across all three locations.
Also applies to: 51-51, 79-79
nemo_skills/dataset/bfcl_v4/__init__.py (2)
35-40: LGTM - New module exports added.The new splits for web search, memory, and format sensitivity are properly integrated into the SPLITS list.
45-47: LGTM - Version references updated.Both SCORE_MODULE and BENCHMARKS correctly reference bfcl_v4.
nemo_skills/dataset/bfcl_v4/utils.py (4)
27-43: LGTM - Helper predicates added.The new category detection functions (is_memory, is_memory_prereq, is_web_search, is_multi_turn) provide clear, reusable predicates for identifying test categories.
46-59: LGTM - Test category extraction added.The extract_test_category_from_id function properly handles both normal and format_sensitivity test entry IDs, with clear logic for removing the "_prereq" suffix when requested.
187-217: LGTM - Memory prereq cleanup logic added.The clean_up_memory_prereq_entries function correctly:
- Removes orphaned memory-prerequisite test cases
- Filters out already-generated entries from dependency lists
220-258: LGTM - Initial settings population added.Both populate_initial_settings_for_memory_test_cases and populate_initial_settings_for_web_search_test_cases provide clear configuration injection for their respective test categories.
nemo_skills/dataset/bfcl_v4/prepare.py (2)
95-95: LGTM - Signature simplified.Removing the
model_typeparameter simplifies the interface. Ensure downstream callers are updated accordingly.
169-169: LGTM - Correct prefix removal.Using
.lstrip("BFCL_v4_")correctly removes the prefix. The static analysis warning about multi-character strings is a false positive in this case, as the intent is to remove the exact prefix string.nemo_skills/dataset/bfcl_v4/bfcl_score.py (4)
51-62: LGTM - New agentic categories added.The MEMORY, WEB_SEARCH, and FORMAT_SENSITIVITY constants properly define the new test categories for BFCL v4.
94-98: LGTM - Optional category lookup added.The optional parameter allows graceful handling of missing categories like format_sensitivity, which may not always be present.
186-203: LGTM - Agentic accuracy calculation added.The calculate_agentic_accuracy function correctly computes combined accuracy for memory and web search categories, with optional format_sensitivity inclusion.
209-229: LGTM - Agentic metrics integrated.The compute_score function correctly integrates agentic accuracy metrics (memory and web search) into the overall accuracy calculation and result structure.
nemo_skills/dataset/bfcl_v4/constants.py (2)
18-68: LGTM - Comprehensive category definitions.The category constants are well-organized with clear derivations:
- MEMORY_CATEGORY derived from ALL_AVAILABLE_MEMORY_BACKENDS
- SINGLE_TURN_CATEGORY composed from NON_LIVE_CATEGORY and LIVE_CATEGORY
- AGENTIC_CATEGORY combines MEMORY_CATEGORY and WEB_SEARCH_CATEGORY
- Clear separation between scoring and non-scoring categories
70-97: LGTM - Test collection mapping updated.The TEST_COLLECTION_MAPPING uses the new category constants consistently and adds new top-level groups ("memory", "web_search", "agentic") for v4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
nemo_skills/dataset/bfcl_v4/bfcl_score.py (1)
231-232: Fix typo in dictionary key.The key "format_sensistivity" contains a typo and should be "format_sensitivity" to match the constant and source key.
Apply this diff to fix the typo:
if "format_sensitivity" in agentic_accuracy: - res["format_sensistivity"] = agentic_accuracy["format_sensitivity"] + res["format_sensitivity"] = agentic_accuracy["format_sensitivity"]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
nemo_skills/dataset/bfcl_v4/__init__.py(2 hunks)nemo_skills/dataset/bfcl_v4/bfcl_score.py(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- nemo_skills/dataset/bfcl_v4/init.py
🧰 Additional context used
🪛 GitHub Actions: Lint and Format
nemo_skills/dataset/bfcl_v4/bfcl_score.py
[warning] 1-1: ruff-format reformatted this file during hooks. Please review formatting changes.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: unit-tests
🔇 Additional comments (2)
nemo_skills/dataset/bfcl_v4/bfcl_score.py (2)
51-63: LGTM! New agentic constants are well-defined.The new constants for memory, web search, and format sensitivity categories align with the BFCL v4 update.
186-203: LGTM! Agentic accuracy calculation is sound.The function correctly aggregates memory and web search accuracies, and conditionally includes format sensitivity. However, this depends on fixing the critical bug in
get_accuracy_dict(lines 94-98) for the optional parameter to work correctly.
| def get_accuracy_dict(metrics, category, optional=False): | ||
| # reporting aggregation for pass@1[avg-of-k] (for highest k) if available | ||
| category_dict = metrics[f"bfcl_v3.{category}"] | ||
| if optional and f"bfcl_v4.{category}" not in metrics: | ||
| category_dict = {} | ||
| category_dict = metrics[f"bfcl_v4.{category}"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical bug: unconditional key access after optional check.
Line 98 unconditionally accesses metrics[f"bfcl_v4.{category}"] even when optional=True and the key is missing. This causes a KeyError instead of using the empty dict set on line 97.
Apply this diff to fix the logic:
def get_accuracy_dict(metrics, category, optional=False):
# reporting aggregation for pass@1[avg-of-k] (for highest k) if available
if optional and f"bfcl_v4.{category}" not in metrics:
category_dict = {}
- category_dict = metrics[f"bfcl_v4.{category}"]
+ else:
+ category_dict = metrics[f"bfcl_v4.{category}"]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def get_accuracy_dict(metrics, category, optional=False): | |
| # reporting aggregation for pass@1[avg-of-k] (for highest k) if available | |
| category_dict = metrics[f"bfcl_v3.{category}"] | |
| if optional and f"bfcl_v4.{category}" not in metrics: | |
| category_dict = {} | |
| category_dict = metrics[f"bfcl_v4.{category}"] | |
| def get_accuracy_dict(metrics, category, optional=False): | |
| # reporting aggregation for pass@1[avg-of-k] (for highest k) if available | |
| if optional and f"bfcl_v4.{category}" not in metrics: | |
| category_dict = {} | |
| else: | |
| category_dict = metrics[f"bfcl_v4.{category}"] |
🤖 Prompt for AI Agents
In nemo_skills/dataset/bfcl_v4/bfcl_score.py around lines 94 to 98, the function
sets category_dict = {} when optional is True and the metrics key is missing but
then unconditionally accesses metrics[f"bfcl_v4.{category}"] causing a KeyError;
update the control flow to only read metrics[...] when the key exists (e.g.,
return the empty dict immediately when optional is True and key missing, or use
an else branch to assign category_dict from metrics), ensuring no unconditional
key access occurs.
Signed-off-by: Michal Bien <[email protected]>
Update gorilla clone Fix preparation code Fixes for bfclv3 in v4 codebase Fix multi-turn files position Fixed pre-processing Use BFCLv3 snapshot for legacy runs Cleanup Populate initial configs Back to new structure for bfclv3 Structure the scores better Better score compute for v4 Stabilize API Custom web search without SerpAPI Fixing the memory output parsing Initialize prereqs Fix parsing and scoring Fix dataloading Give full generated outputs for agentic eval Signed-off-by: Michal Bien <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 11
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemo_skills/inference/eval/bfcl.py (1)
169-177: Normalize client parser output for no-tool-call case to list-of-strings.Avoid returning a string (later
extendwill split it). Keepgenerationas list.- except Exception: - generation = parsed_response["content"] if isinstance(parsed_response["content"], str) else "" - tool_call_ids = [] + except (KeyError, TypeError, ValueError): + text = parsed_response["content"] if isinstance(parsed_response["content"], str) else "" + generation = [text] if text else [] + tool_call_ids = []
♻️ Duplicate comments (2)
nemo_skills/dataset/bfcl_v4/bfcl_score.py (1)
95-101: Fix optional access to avoid KeyError when category missing.Unconditionally indexing
metrics[...]defeats theoptionalguard.-def get_accuracy_dict(metrics, category, optional=False): +def get_accuracy_dict(metrics, category, optional=False): # reporting aggregation for pass@1[avg-of-k] (for highest k) if available - if optional and f"bfcl_v4.{category}" not in metrics: - category_dict = {} - category_dict = metrics[f"bfcl_v4.{category}"] + key = f"bfcl_v4.{category}" + if optional and key not in metrics: + return {"accuracy": 0, "num_entries": 0} + category_dict = metrics[key]nemo_skills/dataset/bfcl_v4/prepare.py (1)
155-157: Version pinning concern remains unresolved.As noted in a previous review, shallow-cloning HEAD without version pinning risks future breakage if the upstream repository structure changes. This concern from the prior review has not been addressed in the current implementation.
Consider one of these approaches:
- Clone with an explicit tag:
["git", "clone", "--depth=1", "--branch", "v4", repo_url, temp_dir]- Add a post-clone checkout:
subprocess.run(["git", "-C", temp_dir, "checkout", KNOWN_COMMIT], check=True)Store the commit hash or tag as a constant at the module level for easy updates.
🧹 Nitpick comments (8)
nemo_skills/inference/eval/bfcl.py (5)
403-404: Prevent accidental char-wise extend when generation is string.Even with parser fixes, defensively normalize to list.
- current_turn_response.extend(model_response["generation"]) + gen = model_response["generation"] + current_turn_response.extend(gen if isinstance(gen, list) else [gen])
256-263: Use pathlib and robust key extraction for MemoryAPI path fix.Avoid string replace and single-element list indexing; use pathlib and next(iter(...)).
- for datapoint in data: - if "initial_config" in datapoint and list(datapoint["initial_config"].keys())[0].startswith("MemoryAPI"): - datapoint["initial_config"][list(datapoint["initial_config"].keys())[0]]["model_result_dir"] = self.cfg.output_file.replace("/output.jsonl", "") + for datapoint in data: + if "initial_config" in datapoint: + api_key = next(iter(datapoint["initial_config"]), None) + if api_key and api_key.startswith("MemoryAPI"): + datapoint["initial_config"][api_key]["model_result_dir"] = str(Path(self.cfg.output_file).parent)Add this import at the top of the file:
+from pathlib import Path
270-273: Avoidasyncio.runin environments with a running event loop.This will raise at runtime (e.g., Jupyter/async runners). Consider refactoring to async
load_dataand invoking from the existing loop, or provide a helper that schedules coroutines when a loop is running.Would you like a small refactor to make prereq processing async-safe?
430-436: Use iterator for first (and only) instance.Satisfy RUF015 and avoid list materialization.
Based on static analysis hints
- memory_instance: "MemoryAPI" = list(involved_instances.values())[0] + memory_instance: "MemoryAPI" = next(iter(involved_instances.values()))
451-455: Same iterator fix when flushing memory.Based on static analysis hints
- memory_instance: "MemoryAPI" = list(involved_instances.values())[0] + memory_instance: "MemoryAPI" = next(iter(involved_instances.values()))nemo_skills/dataset/bfcl_v4/utils.py (1)
96-103: Guard nested type mapping to avoid KeyError on unseen types.Use
mapping.get(..., "string")for robustness; recurse safely.- properties[key]["items"]["type"] = mapping[properties[key]["items"]["type"]] + item_type = properties[key]["items"].get("type") + properties[key]["items"]["type"] = mapping.get(item_type, "string") if properties[key]["items"]["type"] == "array" and "items" in properties[key]["items"]: - properties[key]["items"]["items"]["type"] = mapping[properties[key]["items"]["items"]["type"]] + nested_item_type = properties[key]["items"]["items"].get("type") + properties[key]["items"]["items"]["type"] = mapping.get(nested_item_type, "string") elif properties[key]["items"]["type"] == "object" and "properties" in properties[key]["items"]: properties[key]["items"]["properties"] = _cast_to_openai_type( properties[key]["items"]["properties"], mapping )nemo_skills/inference/eval/bfcl_web_search.py (1)
265-267: Narrow exception handling in fetch_url_content.Catching broad
Exceptionhides actionable errors; preferrequests.exceptions.RequestException.- except Exception as e: + except requests.exceptions.RequestException as e: return {"error": f"An error occurred while fetching {url}: {str(e)}"}nemo_skills/dataset/bfcl_v4/prepare.py (1)
201-208: Remove unused function parameter.The
argsparameter is not used in the function body. If the--model_typeargument parsed on line 213 was intended for future use, consider removing both the argument parsing and the parameter until it's actually needed.Apply this diff:
-def main(args): +def main(): LOG.warning( "Currently processing according to the OpenAI model style which works for most models, including Qwen/Llama-Nemotron/DeepSeek." ) download_and_process_bfcl_data( REPO_URL, DATA_FOLDER_PATH, output_dir=os.path.join(os.path.dirname(__file__)), )And remove the unused argument parsing:
if __name__ == "__main__": - parser = argparse.ArgumentParser() - parser.add_argument("--model_type", type=str, default=None, required=False) - args = parser.parse_args() - - main(args) + main()
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (15)
.gitignore(1 hunks)dockerfiles/Dockerfile.nemo-skills(1 hunks)nemo_skills/dataset/bfcl_v3/__init__.py(1 hunks)nemo_skills/dataset/bfcl_v3/bfcl_score.py(3 hunks)nemo_skills/dataset/bfcl_v3/constants.py(4 hunks)nemo_skills/dataset/bfcl_v3/prepare.py(4 hunks)nemo_skills/dataset/bfcl_v4/__init__.py(1 hunks)nemo_skills/dataset/bfcl_v4/bfcl_score.py(1 hunks)nemo_skills/dataset/bfcl_v4/constants.py(1 hunks)nemo_skills/dataset/bfcl_v4/prepare.py(1 hunks)nemo_skills/dataset/bfcl_v4/utils.py(1 hunks)nemo_skills/evaluation/evaluator/bfcl.py(3 hunks)nemo_skills/inference/eval/bfcl.py(6 hunks)nemo_skills/inference/eval/bfcl_utils.py(3 hunks)nemo_skills/inference/eval/bfcl_web_search.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (6)
nemo_skills/dataset/bfcl_v3/prepare.py (1)
nemo_skills/dataset/bfcl_v4/prepare.py (1)
download_and_process_bfcl_data(140-198)
nemo_skills/inference/eval/bfcl.py (2)
nemo_skills/inference/generate.py (3)
setup_prompt(365-378)load_data(418-435)process_single_datapoint(537-562)nemo_skills/inference/eval/bfcl_utils.py (3)
convert_to_function_call(80-91)is_empty_execute_response(184-189)execute_multi_turn_func_call(94-181)
nemo_skills/evaluation/evaluator/bfcl.py (1)
nemo_skills/utils.py (2)
get_logger_name(130-134)nested_dataclass(49-82)
nemo_skills/inference/eval/bfcl_utils.py (1)
nemo_skills/inference/eval/bfcl_web_search.py (1)
_load_scenario(42-45)
nemo_skills/dataset/bfcl_v4/prepare.py (4)
nemo_skills/dataset/bfcl_v4/utils.py (2)
func_doc_language_specific_pre_processing(17-68)convert_to_tool(106-118)nemo_skills/utils.py (1)
get_logger_name(130-134)nemo_skills/dataset/bfcl_v3/utils.py (2)
is_multi_turn(51-55)load_file(39-48)nemo_skills/dataset/bfcl_v3/prepare.py (1)
download_and_process_bfcl_data(117-178)
nemo_skills/dataset/bfcl_v4/bfcl_score.py (1)
nemo_skills/dataset/bfcl_v3/bfcl_score.py (6)
calculate_combined_accuracy(52-74)get_accuracy_dict(77-112)calculate_non_live_single_turn_accuracy(115-137)calculate_live_single_turn_accuracy(140-156)calculate_multi_turn_accuracy(159-165)compute_score(168-190)
🪛 GitHub Actions: Copyright check
nemo_skills/inference/eval/bfcl_web_search.py
[error] 1-1: Missing copyright notice in the first 10 lines of this Python file.
nemo_skills/dataset/bfcl_v4/utils.py
[error] 1-1: Missing copyright notice in the first 10 lines of this Python file.
🪛 GitHub Actions: Lint and Format
nemo_skills/dataset/bfcl_v3/prepare.py
[error] 1-1: Code adjustments via pre-commit (git clone invocation formatting) in prepare.py.
nemo_skills/inference/eval/bfcl.py
[error] 1-1: Code adjustments via pre-commit (imports cleanup) in bfcl.py.
nemo_skills/dataset/bfcl_v4/constants.py
[error] 1-1: Code style cleanup: fixed spacing around VERSION_PREFIX constant.
nemo_skills/inference/eval/bfcl_web_search.py
[error] 1-1: Code adjustments via pre-commit (import cleanup) in bfcl_web_search.py.
nemo_skills/dataset/bfcl_v4/utils.py
[error] 1-1: Ruff check noted missing newline at end of file; newline added during formatting.
nemo_skills/evaluation/evaluator/bfcl.py
[error] 1-1: Code style adjustments via pre-commit (imports/formatting) in bfcl.py.
nemo_skills/dataset/bfcl_v3/bfcl_score.py
[error] 1-1: Ruff check reformatted code in bfcl_score.py. 9 issues fixed across the worktree.
[error] 1-1: Code adjustments via pre-commit (dictionary literal formatting) in bfcl_score.py.
nemo_skills/inference/eval/bfcl_utils.py
[error] 1-1: Code adjustments via pre-commit (string literal formatting) in bfcl_utils.py.
nemo_skills/dataset/bfcl_v4/prepare.py
[error] 1-1: Code adjustments via pre-commit (imports/order and formatting) applied in bfcl_v4/prepare.py.
[error] 1-1: Code adjustments via pre-commit (path construction and calls) in prepare.py.
nemo_skills/dataset/bfcl_v4/bfcl_score.py
[error] 1-1: Ruff check reformatted code in bfcl_score.py. Several formatting changes applied.
[error] 1-1: Code adjustments via pre-commit (dictionary formatting) in bfcl_score.py.
🪛 Ruff (0.14.0)
nemo_skills/dataset/bfcl_v3/prepare.py
133-133: Starting a process with a partial executable path
(S607)
160-160: Using .strip() with multi-character strings is misleading
(B005)
nemo_skills/inference/eval/bfcl.py
219-219: Do not catch blind exception: Exception
(BLE001)
261-261: Prefer next(iter(datapoint["initial_config"].keys())) over single element slice
Replace with next(iter(datapoint["initial_config"].keys()))
(RUF015)
262-262: Prefer next(iter(datapoint["initial_config"].keys())) over single element slice
Replace with next(iter(datapoint["initial_config"].keys()))
(RUF015)
412-412: Do not catch blind exception: Exception
(BLE001)
430-430: Prefer next(iter(involved_instances.values())) over single element slice
Replace with next(iter(involved_instances.values()))
(RUF015)
453-453: Prefer next(iter(involved_instances.values())) over single element slice
Replace with next(iter(involved_instances.values()))
(RUF015)
nemo_skills/inference/eval/bfcl_web_search.py
38-38: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
40-40: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
42-42: Unused method argument: long_context
(ARG002)
140-140: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
150-150: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
163-163: Use explicit conversion flag
Replace with conversion flag
(RUF010)
169-169: Comment contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?
(RUF003)
210-210: Avoid specifying long messages outside the exception class
(TRY003)
263-263: Abstract raise to an inner function
(TRY301)
263-263: Avoid specifying long messages outside the exception class
(TRY003)
265-265: Do not catch blind exception: Exception
(BLE001)
266-266: Use explicit conversion flag
Replace with conversion flag
(RUF010)
270-270: Docstring contains ambiguous ‑ (NON-BREAKING HYPHEN). Did you mean - (HYPHEN-MINUS)?
(RUF002)
nemo_skills/dataset/bfcl_v4/prepare.py
155-155: subprocess call: check for execution of untrusted input
(S603)
156-156: Starting a process with a partial executable path
(S607)
164-166: Avoid specifying long messages outside the exception class
(TRY003)
197-197: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
198-198: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
201-201: Unused function argument: args
(ARG001)
nemo_skills/dataset/bfcl_v4/bfcl_score.py
126-130: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: unit-tests
🔇 Additional comments (20)
.gitignore (1)
42-42: LGTM!The ignore pattern correctly excludes BFCL v4 dataset directories, consistent with the new BFCL v4 support added in this PR.
nemo_skills/dataset/bfcl_v3/__init__.py (1)
18-35: LGTM!The split renaming (e.g., "simple" → "simple_python") and expansion with new live/multi-turn entries align with the BFCL v4 migration pattern and improve clarity.
nemo_skills/dataset/bfcl_v4/constants.py (1)
22-23: LGTM!The Path constants are well-defined and provide centralized location references for BFCL v4 data handling.
nemo_skills/dataset/bfcl_v3/constants.py (1)
19-92: LGTM!The systematic renaming of split identifiers (e.g., "simple" → "simple_python") across all test collection mappings is consistent with the BFCL v3
__init__.pyupdates.nemo_skills/dataset/bfcl_v3/prepare.py (3)
117-117: LGTM!Updating the default
file_prefixto "BFCL_v4" aligns with the BFCL v4 migration pattern throughout the PR.
149-150: LGTM!The filtering logic appropriately excludes "format_sensitivity", "memory", and "web_search" patterns from the BFCL v4 data processing, which aligns with the specialized handling of these categories introduced elsewhere in the PR.
133-133: Default branch “main” contains BFCL v4; removing-b v1.3is safe.nemo_skills/dataset/bfcl_v3/bfcl_score.py (3)
17-19: LGTM!The updated SIMPLE_AST list entries align with the split renaming across BFCL v3 modules.
136-136: LGTM!Renaming the key to "non_live_irrelevance" improves clarity and consistency with other output keys in the function.
183-189: LGTM!Nesting the results under a "bfcl_v3" top-level key clearly differentiates BFCL v3 scores from BFCL v4 scores in the overall output structure.
nemo_skills/evaluation/evaluator/bfcl.py (4)
31-31: Verify the new default model configuration.The default model was updated from "o3-mini-2025-01-31-FC" to "o4-mini-2025-04-16-FC". Ensure this model exists and is compatible with the BFCL evaluation pipeline.
46-46: LGTM!Adding support for both "bfcl_v3." and "bfcl_v4." prefixes ensures backward compatibility while enabling BFCL v4 evaluation.
82-88: LGTM!The updated file naming to use "BFCL_v4" and the addition of memory prerequisite filtering align with the BFCL v4 evaluation flow. The filtering logic correctly excludes memory prereq samples that are only needed during inference but not for evaluation scoring.
24-24: Ensure bfcl_eval.utils exports the imported utilities
Confirm thatget_directory_structure_by_categoryandis_memory_prereqare present in the installedbfcl_evalpackage (and update the dependency version or import paths if not).nemo_skills/inference/eval/bfcl_utils.py (4)
37-37: LGTM!The Path import supports the new Path conversion logic added later in the file.
49-66: LGTM!The introduction of
BACKEND_PATH_PREFIXand the updatedCLASS_FILE_PATH_MAPPINGwith f-strings provides a cleaner, more maintainable approach to constructing module paths. The addition of WebSearchAPI and MemoryAPI variants aligns with the BFCL v4 support being added.
117-118: LGTM!The instance naming changes improve robustness:
- Preserving the original
class_namecase maintains consistency- The regex substitution
re.sub(r'[-./]', '_', instance_name)ensures the instance name is a valid Python identifier by replacing potentially problematic characters
125-129: LGTM!The deep copy of
class_initial_configprevents unintended mutations of the original configuration dictionary. Convertingmodel_result_dirto a Path object before passing to_load_scenarioensures type consistency.nemo_skills/dataset/bfcl_v4/__init__.py (1)
15-48: BFCL v4 dataset metadata looks good.SPLITS, group flags, and SCORE_MODULE align with scoring code.
nemo_skills/dataset/bfcl_v4/prepare.py (1)
1-1: Address pre-commit formatting failures.The pipeline indicates that pre-commit hooks will apply automatic adjustments for import ordering, formatting, and path construction. Ensure you run the pre-commit hooks locally or allow the automated fixes to be applied before merging.
| if self.cfg.remove_thinking: | ||
| # If no tool calling was used, apply reasoning cleanup to both the message and generation | ||
| trimmed_response_text = self._remove_thinking_from_message_content( | ||
| self.message_parser.get_response_text(model_response["message"]) | ||
| ) | ||
| # If no tool calling was used, apply reasoning cleanup to both the message and generation | ||
| if model_response["message"].content == model_response["generation"]: | ||
| model_response["generation"] = [trimmed_response_text] | ||
|
|
||
| self.message_parser.set_response_text(model_response["message"], trimmed_response_text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix remove_thinking gating; use tool-call presence, not equality check.
Equality between message content and generation is unreliable (types differ across parsers). Gate by tool-call absence; set generation to trimmed text list.
- if self.cfg.remove_thinking:
- # If no tool calling was used, apply reasoning cleanup to both the message and generation
+ if self.cfg.remove_thinking:
trimmed_response_text = self._remove_thinking_from_message_content(
self.message_parser.get_response_text(model_response["message"])
)
- # If no tool calling was used, apply reasoning cleanup to both the message and generation
- if model_response["message"].content == model_response["generation"]:
- model_response["generation"] = [trimmed_response_text]
+ # If no tool calling was used, apply reasoning cleanup to both the message and generation
+ if not model_response.get("tool_calls"):
+ model_response["generation"] = [trimmed_response_text]
self.message_parser.set_response_text(model_response["message"], trimmed_response_text)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if self.cfg.remove_thinking: | |
| # If no tool calling was used, apply reasoning cleanup to both the message and generation | |
| trimmed_response_text = self._remove_thinking_from_message_content( | |
| self.message_parser.get_response_text(model_response["message"]) | |
| ) | |
| # If no tool calling was used, apply reasoning cleanup to both the message and generation | |
| if model_response["message"].content == model_response["generation"]: | |
| model_response["generation"] = [trimmed_response_text] | |
| self.message_parser.set_response_text(model_response["message"], trimmed_response_text) | |
| if self.cfg.remove_thinking: | |
| trimmed_response_text = self._remove_thinking_from_message_content( | |
| self.message_parser.get_response_text(model_response["message"]) | |
| ) | |
| # If no tool calling was used, apply reasoning cleanup to both the message and generation | |
| if not model_response.get("tool_calls"): | |
| model_response["generation"] = [trimmed_response_text] | |
| self.message_parser.set_response_text(model_response["message"], trimmed_response_text) |
🤖 Prompt for AI Agents
In nemo_skills/inference/eval/bfcl.py around lines 387 to 396, the current
gating uses equality between message.content and generation to decide whether to
replace generation, which is unreliable across parsers; instead check for
absence of a tool call on the message (e.g., if
model_response["message"].get("tool_call") is falsy) and when no tool call is
present set model_response["generation"] = [trimmed_response_text] (ensuring it
becomes a list of the trimmed text), and still call
self.message_parser.set_response_text(...) as before.
|
|
||
| RUN git clone https://github.com/ShishirPatil/gorilla.git /opt/gorilla | ||
| RUN cd /opt/gorilla && git checkout d2177992bbba9aa228b53c0645bf8f5613a5a7c6 | ||
| RUN cd /opt/gorilla && git checkout d85e9ad3a3009be3d609c5550799389a16599f88 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want two checkouts.. Maybe checkout in /opt/bfcl_v3/gorilla and /opt/bfcl_v4/gorilla
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's do that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, actually - we cannot easily pip install both versions of bfcl_eval. I made the changes in such a way that there is a minimum set of modifications and now testing if the original BFCLv3 scores are reproducible - they should be as both the logic and the datasets didn't change (v4 is a superset of v3)
|
|
||
| SPLITS = [ | ||
| "simple", # Simple function calls | ||
| "simple_python", # Simple function calls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you rever the changes made to bfcl_v3?
| TEST_COLLECTION_MAPPING = { | ||
| "all": [ | ||
| "simple", | ||
| "simple_python", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to be reverted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the changes in this file I mean
| non_live_ast_accuracy_list + [non_live_irrelevance_accuracy], weighted=False | ||
| ) | ||
|
|
||
| return { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need s to be reverted
|
|
||
|
|
||
| def download_and_process_bfcl_data(repo_url, subfolder_path, output_dir, file_prefix="BFCL_v3", model_type="nemotron"): | ||
| def download_and_process_bfcl_data(repo_url, subfolder_path, output_dir, file_prefix="BFCL_v4", model_type="nemotron"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file also needs to be reverted
| @@ -0,0 +1,118 @@ | |||
| import json | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it the same as bfcl_v3/utils.py? If yes, you can just refer to that file, and avoid duplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
nemo_skills/evaluation/evaluator/bfcl.py (2)
63-67: Avoid shell=True; pass args and env explicitly.Prevents shell injection via model/test_category and makes OPENAI_API_KEY scoping explicit.
Apply:
- cmd = f"OPENAI_API_KEY=dummy bfcl evaluate --model {eval_config.model} --test-category {test_category}" - - LOG.info(f"Running BFCL evaluation: {cmd}") - subprocess.run(cmd, shell=True, check=True, timeout=eval_config.timeout) + LOG.info("Running BFCL evaluation") + args = ["bfcl", "evaluate", "--model", eval_config.model, "--test-category", test_category] + env = {**os.environ, "OPENAI_API_KEY": "dummy"} + subprocess.run(args, check=True, timeout=eval_config.timeout, env=env)Also add:
import osnear the imports at the top.
122-133: Merging with zip() drops lines and can misalign IDs; also marks unevaluated items as correct.Because memory prereqs are filtered out upstream, bfcl_fmted_file has fewer rows than generation_file. zip() truncates and mislabels is_correct. Merge by id; set is_correct only for evaluated ids.
- for gen_line, bfcl_line in zip(gen_f, bfcl_f): - gen_instance = json.loads(gen_line) - # Add the bfcl result to the generation instance - gen_instance.update(json.loads(bfcl_line)) - - if gen_instance["id"] in wrong_instance_ids: - gen_instance["is_correct"] = False - else: - gen_instance["is_correct"] = True - - fout.write(json.dumps(gen_instance) + "\n") + # Index BFCL-evaluated entries by id + bfcl_by_id = {} + for bfcl_line in bfcl_f: + obj = json.loads(bfcl_line) + bfcl_by_id[obj.get("id")] = obj + + for gen_line in gen_f: + gen_instance = json.loads(gen_line) + _id = gen_instance.get("id") + + if _id in bfcl_by_id: + # Merge back evaluated fields + gen_instance.update(bfcl_by_id[_id]) + gen_instance["is_correct"] = _id not in wrong_instance_ids + else: + # Not evaluated (e.g., memory prereq); preserve record and omit correctness + gen_instance.pop("is_correct", None) + + fout.write(json.dumps(gen_instance) + "\n")nemo_skills/inference/eval/bfcl.py (1)
202-219: Normalize client parser outputs; never return bare strings; narrow exceptions.generation can be a bare string, causing .extend() to split into characters. Also catches broad Exception.
- try: - generation = [ - {func_call["name"]: json.dumps(func_call["arguments"])} for func_call in model_response["tool_calls"] - ] - tool_call_ids = [idx for idx in range(len(generation))] - except Exception: - generation = parsed_response["content"] if isinstance(parsed_response["content"], str) else "" - tool_call_ids = [] + try: + tool_calls = model_response.get("tool_calls") or [] + if tool_calls: + generation = [ + {tc["name"]: json.dumps(tc["arguments"])} for tc in tool_calls if isinstance(tc, dict) + ] + tool_call_ids = [str(i) for i in range(len(generation))] + else: + content = parsed_response.get("content", "") + generation = [content] if isinstance(content, str) else [""] + tool_call_ids = [] + except (KeyError, TypeError, ValueError): + content = parsed_response.get("content", "") + generation = [content] if isinstance(content, str) else [""] + tool_call_ids = []
♻️ Duplicate comments (3)
nemo_skills/dataset/bfcl_v4/bfcl_score.py (1)
95-100: Fix optional handling to avoid KeyError and return a neutral metric when missing.Current code sets an empty dict then unconditionally indexes metrics[…], causing KeyError and later lookup failures.
-def get_accuracy_dict(metrics, category, optional=False): +def get_accuracy_dict(metrics, category, optional=False): # reporting aggregation for pass@1[avg-of-k] (for highest k) if available - if optional and f"bfcl_v4.{category}" not in metrics: - category_dict = {} - category_dict = metrics[f"bfcl_v4.{category}"] + key = f"bfcl_v4.{category}" + if key not in metrics: + if optional: + return {"accuracy": 0, "num_entries": 0} + raise KeyError(f"Missing metrics for category: {key}") + category_dict = metrics[key]nemo_skills/inference/eval/bfcl.py (2)
442-451: Fix remove_thinking gating; key off tool-call absence, not equality.- if self.cfg.remove_thinking: - # If no tool calling was used, apply reasoning cleanup to both the message and generation + if self.cfg.remove_thinking: trimmed_response_text = self._remove_thinking_from_message_content( self.message_parser.get_response_text(model_response["message"]) ) - # If no tool calling was used, apply reasoning cleanup to both the message and generation - if model_response["message"].content == model_response["generation"]: - model_response["generation"] = [trimmed_response_text] + # If no tool calling was used, apply reasoning cleanup to both the message and generation + if not model_response.get("tool_calls"): + model_response["generation"] = [trimmed_response_text] self.message_parser.set_response_text(model_response["message"], trimmed_response_text)This avoids cross-parser type comparisons and ensures generation is a list. Based on learnings
248-259: Normalize server parser outputs; avoid string-vs-list shape bugs and narrow except.- try: - tool_calls = output_dict["message"].tool_calls - generation = [{func_call.function.name: func_call.function.arguments} for func_call in tool_calls] - tool_call_ids = [func_call.id for func_call in tool_calls] - except Exception: - tool_calls = [] - generation = output_dict["message"].content - tool_call_ids = [] - - # Use model output if not a tool call - output_dict["generation"] = generation if generation else [output_dict["message"].content] + try: + tool_calls = output_dict["message"].tool_calls or [] + if tool_calls: + generation = [{fc.function.name: fc.function.arguments} for fc in tool_calls] + tool_call_ids = [fc.id for fc in tool_calls] + else: + generation = [output_dict["message"].content] + tool_call_ids = [] + except (AttributeError, KeyError, TypeError): + tool_calls = [] + generation = [getattr(output_dict["message"], "content", "")] + tool_call_ids = [] + + # Always lists + output_dict["generation"] = generationRuff BLE001 suggests avoiding broad Exception; the diff applies that. Based on learnings
🧹 Nitpick comments (10)
nemo_skills/evaluation/evaluator/bfcl.py (2)
51-55: Hardcoded BFCL repo path; make it configurable.Bakes in /opt/gorilla/... which breaks non-standard deployments. Move base path into BFCLEvaluatorConfig (or cfg) with a sane default.
Would introducing eval_config.bfcl_repo_root (defaulting to /opt/gorilla/berkeley-function-call-leaderboard) work for your runners?
45-47: Minor: update comment to v4.The comment still references bfcl_v3 structure; this code supports both prefixes.
nemo_skills/dataset/bfcl_v4/bfcl_score.py (2)
121-130: TRY003: shorten or centralize long exception message.Either compress the text or define a small custom exception with the message in the class docstring.
58-64: Unused constant FORMAT_SENSITIVITY.If not reported in v4 output, remove; otherwise, integrate it into agentic or separate reporting.
Is format_sensitivity still a tracked metric in your pipeline?
nemo_skills/inference/eval/bfcl.py (6)
52-57: Avoid hard dependency on bfcl_eval; make imports lazy/TYPE_CHECKING.Top-level imports will break module import for users without bfcl_eval (earlier code intentionally delayed such imports). Move to local/lazy imports and keep only type hints under TYPE_CHECKING.
-from bfcl_eval.utils import is_memory_prereq, is_memory -from bfcl_eval.model_handler.utils import add_memory_instruction_system_prompt -from bfcl_eval.eval_checker.multi_turn_eval.func_source_code.memory_api_metaclass import ( - MemoryAPI, -) +from typing import TYPE_CHECKING +if TYPE_CHECKING: + from bfcl_eval.eval_checker.multi_turn_eval.func_source_code.memory_api_metaclass import MemoryAPIAdditionally, import where used:
- Inside load_data():
+ from bfcl_eval.utils import is_memory_prereq # local import to keep bfcl_eval optional
- Inside _generate_single_data_point_multi_turn() before the memory block:
+ from bfcl_eval.utils import is_memory + from bfcl_eval.model_handler.utils import add_memory_instruction_system_prompt
288-307: Make path handling robust and simplify key access; optional: event-loop guard.
- Use Path().parent instead of string replace.
- Use next(iter(...)) instead of list(...)[0] (RUF015).
- Confirm load_data will never run inside an active event loop before calling asyncio.run.
- # First, fix the target paths to point to the actual target paths for memory stores - for datapoint in data: - if "initial_config" in datapoint and list(datapoint["initial_config"].keys())[0].startswith("MemoryAPI"): - datapoint["initial_config"][list(datapoint["initial_config"].keys())[0]]["model_result_dir"] = self.cfg.output_file.replace("/output.jsonl", "") + # First, fix the target paths to point to the actual target paths for memory stores + from pathlib import Path + for datapoint in data: + if "initial_config" in datapoint: + key = next(iter(datapoint["initial_config"])) + if key.startswith("MemoryAPI"): + datapoint["initial_config"][key]["model_result_dir"] = str(Path(self.cfg.output_file).parent)Please confirm this method is not invoked under a running event loop (e.g., Jupyter) since it calls asyncio.run.
386-408: Harden memory-instance setup; avoid assert in prod and use RUF015 pattern.
- Replace assert with explicit check and error.
- Use next(iter(...)) for instance extraction.
- Optionally define memory_instance in outer scope for later flush.
- if is_memory(test_category): + memory_instance = None + if is_memory(test_category): # Execute no function call, but just to get a reference to all the instances to get the initial state for logging purpose _, involved_instances = execute_multi_turn_func_call( [], initial_config, involved_classes, test_entry_id=test_entry_id, long_context=("long_context" in test_category or "composite" in test_category), ) - - assert ( - len(involved_instances) == 1 - ), "Memory category should only involve one class." - - memory_instance: "MemoryAPI" = list(involved_instances.values())[0] + if len(involved_instances) != 1: + raise ValueError("Memory category should only involve one class.") + memory_instance = next(iter(involved_instances.values())) data_point["question"] = add_memory_instruction_system_prompt( data_point["question"], test_category, data_point["scenario"], memory_instance, )
457-459: Defensive: ensure generation is a list before extend.Protect against accidental bare strings.
- current_turn_response.extend(model_response["generation"]) + gens = model_response["generation"] + if not isinstance(gens, list): + gens = [gens] + current_turn_response.extend(gens)
464-469: Narrow except when decoding tool calls; avoid BLE001; keep log level low.Catch only expected parse errors from convert_to_function_call/json.
- except Exception: - LOG.info("No tools to execute in this turn. Proceed to next turn.") + except (json.JSONDecodeError, KeyError, TypeError, ValueError): + LOG.debug("No tools to execute in this turn. Proceed to next turn.") break
483-487: Ensure tool_call_id is a string for downstream compatibility.Tokenizer/chat templates typically expect string IDs.
- tool_message = { + tool_message = { "role": "tool", "content": execution_result, - "tool_call_id": tool_call_id, + "tool_call_id": str(tool_call_id), }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
dockerfiles/Dockerfile.nemo-skills(1 hunks)nemo_skills/dataset/bfcl_v4/bfcl_score.py(1 hunks)nemo_skills/evaluation/evaluator/bfcl.py(3 hunks)nemo_skills/inference/eval/bfcl.py(7 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- dockerfiles/Dockerfile.nemo-skills
🧰 Additional context used
🧬 Code graph analysis (3)
nemo_skills/inference/eval/bfcl.py (2)
nemo_skills/inference/generate.py (3)
setup_prompt(368-381)load_data(427-444)process_single_datapoint(546-571)nemo_skills/inference/eval/bfcl_utils.py (3)
execute_multi_turn_func_call(94-181)convert_to_function_call(80-91)is_empty_execute_response(184-189)
nemo_skills/evaluation/evaluator/bfcl.py (1)
nemo_skills/utils.py (2)
get_logger_name(131-135)nested_dataclass(50-83)
nemo_skills/dataset/bfcl_v4/bfcl_score.py (1)
nemo_skills/dataset/bfcl_v3/bfcl_score.py (6)
calculate_combined_accuracy(52-74)get_accuracy_dict(77-112)calculate_non_live_single_turn_accuracy(115-137)calculate_live_single_turn_accuracy(140-156)calculate_multi_turn_accuracy(159-165)compute_score(168-190)
🪛 Ruff (0.14.1)
nemo_skills/inference/eval/bfcl.py
252-252: Do not catch blind exception: Exception
(BLE001)
294-294: Prefer next(iter(datapoint["initial_config"].keys())) over single element slice
Replace with next(iter(datapoint["initial_config"].keys()))
(RUF015)
295-295: Prefer next(iter(datapoint["initial_config"].keys())) over single element slice
Replace with next(iter(datapoint["initial_config"].keys()))
(RUF015)
400-400: Prefer next(iter(involved_instances.values())) over single element slice
Replace with next(iter(involved_instances.values()))
(RUF015)
466-466: Do not catch blind exception: Exception
(BLE001)
511-511: Prefer next(iter(involved_instances.values())) over single element slice
Replace with next(iter(involved_instances.values()))
(RUF015)
nemo_skills/dataset/bfcl_v4/bfcl_score.py
126-130: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: pre-commit
- GitHub Check: unit-tests
🔇 Additional comments (2)
nemo_skills/dataset/bfcl_v4/bfcl_score.py (1)
206-213: Good: overall uses numeric accuracies.
This fixes the prior bug of adding dicts to floats.Please confirm the weights (0.4/0.3/0.1/0.1/0.1) match the latest BFCL v4 spec for your run.
nemo_skills/inference/eval/bfcl.py (1)
20-20: OK to add asyncio import.No issues.
| def compute_score(metrics: dict): | ||
| non_live_single_turn_accuracy = calculate_non_live_single_turn_accuracy(metrics) | ||
| live_single_turn_accuracy = calculate_live_single_turn_accuracy(metrics) | ||
| multi_turn_accuracy = calculate_multi_turn_accuracy(metrics) | ||
| agentic_accuracy = calculate_agentic_accuracy(metrics) | ||
| hallucination_accuracy = calculate_hallucination_measurement(metrics) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reset GLOBAL_MAX_K per compute to avoid cross-call contamination.
Without reset, prior calls can spuriously raise ValueError on later datasets.
def compute_score(metrics: dict):
+ global GLOBAL_MAX_K
+ GLOBAL_MAX_K = None📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def compute_score(metrics: dict): | |
| non_live_single_turn_accuracy = calculate_non_live_single_turn_accuracy(metrics) | |
| live_single_turn_accuracy = calculate_live_single_turn_accuracy(metrics) | |
| multi_turn_accuracy = calculate_multi_turn_accuracy(metrics) | |
| agentic_accuracy = calculate_agentic_accuracy(metrics) | |
| hallucination_accuracy = calculate_hallucination_measurement(metrics) | |
| def compute_score(metrics: dict): | |
| global GLOBAL_MAX_K | |
| GLOBAL_MAX_K = None | |
| non_live_single_turn_accuracy = calculate_non_live_single_turn_accuracy(metrics) | |
| live_single_turn_accuracy = calculate_live_single_turn_accuracy(metrics) | |
| multi_turn_accuracy = calculate_multi_turn_accuracy(metrics) | |
| agentic_accuracy = calculate_agentic_accuracy(metrics) | |
| hallucination_accuracy = calculate_hallucination_measurement(metrics) |
🤖 Prompt for AI Agents
In nemo_skills/dataset/bfcl_v4/bfcl_score.py around lines 199 to 205, reset the
module-level GLOBAL_MAX_K at the start of compute_score to avoid cross-call
contamination; specifically, add a line that reinitializes GLOBAL_MAX_K to its
default (e.g., None or the original initial value) before calling the individual
calculate_* functions so each compute_score invocation starts with a clean
GLOBAL_MAX_K and cannot spuriously raise ValueError due to state left over from
previous calls.
| # Remove memory prereq from eval as those are only needed for inference | ||
| if is_memory_prereq(sample["id"]): | ||
| continue | ||
| if sample.get("result", None) is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard against missing ids when filtering memory prereqs.
If a line lacks "id", this raises KeyError. Use get() and skip safely.
- if is_memory_prereq(sample["id"]):
+ sid = sample.get("id")
+ if sid is not None and is_memory_prereq(sid):
continue🤖 Prompt for AI Agents
In nemo_skills/evaluation/evaluator/bfcl.py around lines 86 to 89, the code
accesses sample["id"] directly which can raise KeyError for samples missing an
"id"; change the check to safely retrieve the id first (e.g., id_val =
sample.get("id")) and only call is_memory_prereq when id_val is present,
skipping the sample otherwise (for example: if id_val and
is_memory_prereq(id_val): continue), keeping the existing check for
sample.get("result", None) afterward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemo_skills/dataset/bfcl_v3/prepare.py (1)
168-169: Fix Path type passed to process_file to avoid TypeError.process_multi_turn_test_case uses Path arithmetic ("/") with repo_root_dir; passing a str will fail. Pass a Path.
- process_file(temp_dir, input_file, output_file, model_type=model_type) + process_file(Path(temp_dir), input_file, output_file, model_type=model_type)
♻️ Duplicate comments (1)
nemo_skills/dataset/bfcl_v4/utils.py (1)
15-21: Avoid duplication with bfcl_v3.utils where possible.If v4 logic mirrors v3, consider reusing/importing shared helpers to reduce drift.
🧹 Nitpick comments (6)
nemo_skills/dataset/bfcl_v3/prepare.py (2)
160-160: Use removesuffix() or Path.stem for extension handling (safer than replace).replace(".json","") can misfire if ".json" appears elsewhere. Prefer removesuffix (Py3.9+) or Path.stem.
- split_dirname = os.path.join(output_dir, filename.removeprefix("BFCL_v4_").replace(".json", "")) + name = filename.removeprefix("BFCL_v4_").removesuffix(".json") + split_dirname = os.path.join(output_dir, name)Alternatively:
- split_dirname = os.path.join(output_dir, filename.removeprefix("BFCL_v4_").replace(".json", "")) + name = Path(filename).stem + name = name.removeprefix("BFCL_v4_") + split_dirname = os.path.join(output_dir, name)
176-179: Log full traceback on git failures.Use logging.exception for context.
- except subprocess.CalledProcessError as e: - LOG.error(f"Git command failed: {e}") - LOG.error("Make sure git is installed and the repository URL is correct") + except subprocess.CalledProcessError: + LOG.exception("Git command failed") + LOG.error("Make sure git is installed and the repository URL is correct")nemo_skills/dataset/bfcl_v4/utils.py (1)
107-118: Guard against unknown nested item types in _cast_to_openai_type.Direct dict indexing may KeyError for unseen types; fall back to "string".
- elif "items" in properties[key]: - properties[key]["items"]["type"] = mapping[properties[key]["items"]["type"]] + elif "items" in properties[key]: + item_type = properties[key]["items"].get("type", "string") + properties[key]["items"]["type"] = mapping.get(item_type, "string") if properties[key]["items"]["type"] == "array" and "items" in properties[key]["items"]: - properties[key]["items"]["items"]["type"] = mapping[properties[key]["items"]["items"]["type"]] + inner_type = properties[key]["items"]["items"].get("type", "string") + properties[key]["items"]["items"]["type"] = mapping.get(inner_type, "string") elif properties[key]["items"]["type"] == "object" and "properties" in properties[key]["items"]: properties[key]["items"]["properties"] = _cast_to_openai_type( properties[key]["items"]["properties"], mapping )nemo_skills/inference/eval/bfcl_web_search.py (2)
60-61: Use dict.get for config with a safe default.Avoid KeyError when show_snippet is omitted.
- self.show_snippet = initial_config["show_snippet"] + self.show_snippet = initial_config.get("show_snippet", True)
280-281: Narrow the exception in fetch_url_content and keep trace.Catch requests exceptions, not bare Exception; include traceback via logging if available.
- except Exception as e: - return {"error": f"An error occurred while fetching {url}: {str(e)}"} + except requests.RequestException as e: + return {"error": f"requests error while fetching {url}: {e!s}"}nemo_skills/dataset/bfcl_v4/prepare.py (1)
195-197: Remove unnecessary f-string; keep traceback via exception().Minor cleanup and better diagnostics.
- except subprocess.CalledProcessError as e: - LOG.exception(f"Git command failed") + except subprocess.CalledProcessError: + LOG.exception("Git command failed") LOG.error("Make sure git is installed and the repository URL is correct")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
nemo_skills/dataset/bfcl_v3/prepare.py(4 hunks)nemo_skills/dataset/bfcl_v4/__init__.py(1 hunks)nemo_skills/dataset/bfcl_v4/constants.py(1 hunks)nemo_skills/dataset/bfcl_v4/prepare.py(1 hunks)nemo_skills/dataset/bfcl_v4/utils.py(1 hunks)nemo_skills/inference/eval/bfcl_web_search.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- nemo_skills/dataset/bfcl_v4/constants.py
🧰 Additional context used
🧬 Code graph analysis (2)
nemo_skills/dataset/bfcl_v4/prepare.py (3)
nemo_skills/dataset/bfcl_v4/utils.py (2)
func_doc_language_specific_pre_processing(32-83)convert_to_tool(121-133)nemo_skills/utils.py (1)
get_logger_name(131-135)nemo_skills/dataset/bfcl_v3/utils.py (2)
is_multi_turn(51-55)load_file(39-48)
nemo_skills/dataset/bfcl_v3/prepare.py (1)
nemo_skills/dataset/bfcl_v4/prepare.py (1)
download_and_process_bfcl_data(140-197)
🪛 Ruff (0.14.1)
nemo_skills/dataset/bfcl_v4/prepare.py
154-154: subprocess call: check for execution of untrusted input
(S603)
155-155: Starting a process with a partial executable path
(S607)
163-165: Avoid specifying long messages outside the exception class
(TRY003)
195-195: Local variable e is assigned to but never used
Remove assignment to unused variable e
(F841)
196-196: f-string without any placeholders
Remove extraneous f prefix
(F541)
197-197: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
nemo_skills/dataset/bfcl_v3/prepare.py
133-133: Starting a process with a partial executable path
(S607)
nemo_skills/inference/eval/bfcl_web_search.py
53-53: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
55-55: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
57-57: Unused method argument: long_context
(ARG002)
155-155: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
165-165: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
178-178: Use explicit conversion flag
Replace with conversion flag
(RUF010)
184-184: Comment contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?
(RUF003)
225-225: Avoid specifying long messages outside the exception class
(TRY003)
278-278: Abstract raise to an inner function
(TRY301)
278-278: Avoid specifying long messages outside the exception class
(TRY003)
280-280: Do not catch blind exception: Exception
(BLE001)
281-281: Use explicit conversion flag
Replace with conversion flag
(RUF010)
285-285: Docstring contains ambiguous ‑ (NON-BREAKING HYPHEN). Did you mean - (HYPHEN-MINUS)?
(RUF002)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: pre-commit
- GitHub Check: unit-tests
🔇 Additional comments (3)
nemo_skills/dataset/bfcl_v3/prepare.py (1)
117-117: Confirm intent: modifying bfcl_v3 prepare to process v4 files.Defaulting file_prefix to "BFCL_v4" inside bfcl_v3/prepare.py blurs version boundaries. If unintentional, revert to v3 defaults and move v4 logic to bfcl_v4. If intentional, please confirm downstream callers expect v4 outputs here.
nemo_skills/dataset/bfcl_v4/__init__.py (1)
15-48: BFCLv4 dataset manifest looks good; ensure split parity with evaluator.Constants and BENCHMARKS map are fine. Please verify SPLITS matches bfcl_eval.constants.category_mapping.ALL_SCORING_CATEGORIES to avoid drift with prepare/evaluator.
nemo_skills/dataset/bfcl_v4/prepare.py (1)
154-156: Review is based on incorrect assumption; bfcl_v3 does not implement the suggested pattern.The review claims the fix follows "same rationale as bfcl_v3," but inspection of
nemo_skills/dataset/bfcl_v3/prepare.py(line 133) shows it contains onlygit clone --depth=1without any subsequentgit checkout. The suggested pattern does not exist as a reference implementation. Additionally, the proposed diff usesPLACEHOLDER_SHArather than an actual commit identifier, and there is no documented pinning strategy or specific commit/tag identified in the codebase for either dataset version.Likely an incorrect or invalid review comment.
| # v1.3 corresponds the release version for BFCL v3 | ||
| subprocess.run( | ||
| ["git", "clone", "-b", "v1.3", "--depth=1", repo_url, temp_dir], check=True, capture_output=True | ||
| ["git", "clone", "--depth=1", repo_url, temp_dir], check=True, capture_output=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
🧩 Analysis chain
Pin the BFCL repo clone to a known commit/tag; avoid floating HEAD.
Shallow clone of HEAD risks future breakage. Pin via branch/tag or checkout a SHA; also consider resolving git via shutil.which to appease S607.
- subprocess.run(
- ["git", "clone", "--depth=1", repo_url, temp_dir], check=True, capture_output=True
- )
+ subprocess.run(["git", "clone", "--depth=1", repo_url, temp_dir], check=True, capture_output=True)
+ # Optional: pin to a known commit (replace PLACEHOLDER_SHA)
+ subprocess.run(["git", "-C", temp_dir, "checkout", "PLACEHOLDER_SHA"], check=True, capture_output=True)If you prefer a tag/branch:
- subprocess.run(["git", "clone", "--depth=1", repo_url, temp_dir], check=True, capture_output=True)
+ subprocess.run(["git", "clone", "--depth=1", "--branch", "v1.3", repo_url, temp_dir], check=True, capture_output=True)
---
🏁 Script executed:
```shell
#!/bin/bash
# Examine the actual code at prepare.py around line 133
rg -n "git.*clone.*depth=1" --type py -C 10
Length of output: 3814
🏁 Script executed:
#!/bin/bash
# Search for BFCL version/commit references in the codebase
rg -i "bfcl.*commit|bfcl.*sha|bfcl.*tag|bfcl.*version" --type py --type yaml --type json --type tomlLength of output: 0
🏁 Script executed:
#!/bin/bash
# Look for repo_url definition and how it's used in prepare.py
rg -n "repo_url" nemo_skills/dataset/bfcl_v3/prepare.py -B 5 -A 5Length of output: 1713
🏁 Script executed:
#!/bin/bash
# Check if there's any documentation or configuration for BFCL in the project
fd -t f -e md -e txt -e yaml -e json -e toml | xargs rg -l -i "bfcl" 2>/dev/null | head -20Length of output: 183
---
🏁 Script executed:
```shell
#!/bin/bash
# Find all call sites of download_and_process_bfcl_data
rg -n "download_and_process_bfcl_data" --type py -B 3 -A 3
Length of output: 2226
🏁 Script executed:
#!/bin/bash
# Search for where repo URLs or BFCL repo references are defined
rg -n "github.com.*bfcl|https.*bfcl" --type py --type yaml --type jsonLength of output: 1039
🏁 Script executed:
#!/bin/bash
# Check if shutil.which is used elsewhere for subprocess security
rg -n "shutil.which|S607" --type py -A 2Length of output: 266
🏁 Script executed:
#!/bin/bash
# Look for main entry points and argument parsing
fd -t f "main.py|cli.py|__main__.py" --type py | xargs rg -l "bfcl" 2>/dev/nullLength of output: 1190
---
🏁 Script executed:
```shell
#!/bin/bash
# Find REPO_URL definition in bfcl_v3/prepare.py
rg -n "REPO_URL\s*=" nemo_skills/dataset/bfcl_v3/prepare.py -A 2 -B 2
Length of output: 140
🏁 Script executed:
#!/bin/bash
# Check all references to REPO_URL in both bfcl_v3 and bfcl_v4 prepare files
rg -n "REPO_URL" nemo_skills/dataset/bfcl_v3/prepare.py nemo_skills/dataset/bfcl_v4/prepare.pyLength of output: 486
🏁 Script executed:
#!/bin/bash
# Examine context around subprocess.run in prepare.py to understand if git is pre-checked
rg -n "subprocess.run" nemo_skills/dataset/bfcl_v3/prepare.py -B 5 -A 2Length of output: 364
Apply S607 security check and pin BFCL repository to a stable commit.
The shallow clone without pinning creates two addressable risks: (1) subprocess call without validating git availability violates S607 (Bandit security rule already applied in ojbench/prepare.py); (2) floating HEAD on shallow clone breaks if upstream repo structure changes.
Both issues are fixable. For (1), add shutil.which("git") validation before the subprocess call. For (2), either pin to a commit via git -C checkout after clone, or use --branch during clone (requires knowing a stable branch name from the upstream repo).
Consider implementing both:
- Validate git executable is available
- Pin to a known commit SHA (preferred for reproducibility) or stable tag/branch
🧰 Tools
🪛 Ruff (0.14.1)
133-133: Starting a process with a partial executable path
(S607)
🤖 Prompt for AI Agents
In nemo_skills/dataset/bfcl_v3/prepare.py around line 133, the subprocess git
clone call does not validate git is available and performs a shallow clone
without pinning, risking S607 bandit failure and a floating HEAD; fix by first
checking shutil.which("git") and raising a clear exception if not found, then
perform the clone and immediately pin to a stable commit or tag (preferred:
define a BFCL_COMMIT_SHA constant and run a git -C <temp_dir> checkout
<BFCL_COMMIT_SHA> after clone) or alternatively use --branch <stable-branch> in
the clone command; ensure subprocess calls use check=True and capture_output for
errors and include helpful error messages when git commands fail.
| """ | ||
| This function retrieves the dataset entry for a given test category. | ||
| The input should not be a test category goup, but a specific test category. | ||
| If `contain_prereq` is True, it will include the pre-requisite entries for the memory test categories. | ||
| If `include_language_specific_hint` is True, it will include the language-specific hint for the function description (for Java, JavaScript, and Python). | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docstring parameter typo.
The text says contain_prereq; the parameter is include_prereq.
- If `contain_prereq` is True, it will include the pre-requisite entries for the memory test categories.
+ If `include_prereq` is True, it will include the pre-requisite entries for the memory test categories.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| """ | |
| This function retrieves the dataset entry for a given test category. | |
| The input should not be a test category goup, but a specific test category. | |
| If `contain_prereq` is True, it will include the pre-requisite entries for the memory test categories. | |
| If `include_language_specific_hint` is True, it will include the language-specific hint for the function description (for Java, JavaScript, and Python). | |
| """ | |
| """ | |
| This function retrieves the dataset entry for a given test category. | |
| The input should not be a test category goup, but a specific test category. | |
| If `include_prereq` is True, it will include the pre-requisite entries for the memory test categories. | |
| If `include_language_specific_hint` is True, it will include the language-specific hint for the function description (for Java, JavaScript, and Python). | |
| """ |
🤖 Prompt for AI Agents
nemo_skills/dataset/bfcl_v4/prepare.py around lines 86 to 91: the docstring
incorrectly references the parameter name `contain_prereq`; update it to
`include_prereq` so the description matches the actual function signature, and
scan the docstring for any other mismatched parameter names (e.g.,
`include_language_specific_hint`) and correct them for consistency.
| if include_language_specific_hint: | ||
| all_entries = add_language_specific_hint_to_function_doc(all_entries) | ||
|
|
||
| all_entries = populate_initial_settings_for_memory_test_cases( | ||
| all_entries, str(target_folder) | ||
| ) | ||
| all_entries = populate_initial_settings_for_web_search_test_cases( | ||
| all_entries | ||
| ) | ||
|
|
||
| # Convert function calls to tools format and add them to the system prompt | ||
| for instance in all_entries: | ||
| if "function" in instance: | ||
| # Add the tools to the system prompt | ||
| instance["function"] = func_doc_language_specific_pre_processing(instance["function"], test_category) | ||
| instance["tools"] = convert_to_tool(instance["function"]) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid double-adding language hints to function descriptions.
Both add_language_specific_hint_to_function_doc and func_doc_language_specific_pre_processing append hints, leading to duplicates. Keep one source of truth.
- if include_language_specific_hint:
- all_entries = add_language_specific_hint_to_function_doc(all_entries)
+ if include_language_specific_hint:
+ # Defer language hinting to func_doc_language_specific_pre_processing to avoid duplication.
+ passOr drop the description-append in func_doc_language_specific_pre_processing and leave casting only. Pick one consistently.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In nemo_skills/dataset/bfcl_v4/prepare.py around lines 120-136, the language
hint is being appended twice because add_language_specific_hint_to_function_doc
is called earlier and func_doc_language_specific_pre_processing also appends
hints; to fix, remove the hint-appending behavior from
func_doc_language_specific_pre_processing so it only normalizes/casts the
function doc (or alternatively skip calling
add_language_specific_hint_to_function_doc here), and ensure only one place is
responsible for adding the language-specific hint to function descriptions to
prevent duplicates.
| def _get_language_specific_hint(test_category): | ||
| if test_category == "java": | ||
| return " Note that the provided function is in Java 8 SDK syntax." | ||
| elif test_category == "javascript": | ||
| return " Note that the provided function is in JavaScript syntax." | ||
| else: | ||
| return " Note that the provided function is in Python 3 syntax." | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Language detection will misclassify categories (e.g., simple_java) as Python.
test_category equals split names like "simple_java"/"simple_javascript", not bare "java"/"javascript". Hints and type coercion won’t apply.
-def _get_language_specific_hint(test_category):
- if test_category == "java":
+def _detect_lang(test_category: str) -> str:
+ t = test_category.lower()
+ if "javascript" in t:
+ return "javascript"
+ if "java" in t:
+ return "java"
+ return "python"
+
+def _get_language_specific_hint(test_category):
+ lang = _detect_lang(test_category)
+ if lang == "java":
return " Note that the provided function is in Java 8 SDK syntax."
- elif test_category == "javascript":
+ elif lang == "javascript":
return " Note that the provided function is in JavaScript syntax."
else:
return " Note that the provided function is in Python 3 syntax."
@@
-def func_doc_language_specific_pre_processing(function, test_category):
+def func_doc_language_specific_pre_processing(function, test_category):
+ lang = _detect_lang(test_category)
@@
- if test_category == "java":
+ if lang == "java":
@@
- elif test_category == "javascript":
+ elif lang == "javascript":Also applies to: 32-83
🤖 Prompt for AI Agents
In nemo_skills/dataset/bfcl_v4/utils.py around lines 23 to 30 (and similarly
lines 32-83), the function compares test_category to exact strings
("java"/"javascript") but actual categories are named like "simple_java" or
"simple_javascript", so detections fall through to Python; update the
conditional logic to detect language tokens within the category string (e.g.,
check for "java" and "javascript" substrings or use a regex/startswith pattern)
and return the appropriate language hint; apply the same substring-based
detection wherever test_category is used in lines 32-83 to ensure correct hints
and type coercion for categories like "simple_java" and "simple_javascript".
| backoff = 10 # initial back-off in seconds | ||
|
|
||
| # Infinite retry loop with exponential backoff | ||
| while True: | ||
| try: | ||
| wait_time = backoff + random.uniform(0, backoff) | ||
| search_results = DDGS(timeout=60).text( | ||
| query=keywords, | ||
| region=region, | ||
| max_results=max_results, | ||
| backend="duckduckgo" | ||
| ) | ||
|
|
||
| except DDGSException as e: | ||
| if "No results found" in str(e): | ||
| wait_time = backoff + random.uniform(0, backoff) | ||
| error_block = ( | ||
| "*" * 100 | ||
| + f"\n❗️❗️ [WebSearchAPI] Hit rate limit on DuckDuckGo requests. This is a common behaviour. If unable to run eval due to repeated rate limits, try to decrease job parallelism. Retrying in {wait_time:.1f} seconds…" | ||
| + "*" * 100 | ||
| ) | ||
| print(error_block) | ||
| time.sleep(wait_time) | ||
| backoff = min(backoff * 2, 120) # cap the back-off | ||
| continue | ||
| else: | ||
| error_block = ( | ||
| "*" * 100 | ||
| + f"\n❗️❗️ [WebSearchAPI] Error from DuckDuckGo: {str(e)}. This is not a rate-limit error, so it will not be retried." | ||
| + "*" * 100 | ||
| ) | ||
| print(error_block) | ||
| return {"error": str(e)} | ||
|
|
||
| break # Success – no rate-limit error detected | ||
|
|
||
| if not search_results: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make retries finite and handle DDGS generator/results correctly.
- DDGS.text may return a generator; slicing and truthiness checks fail.
- Infinite loop risks hanging; handle Ratelimit/Timeout explicitly and cap retries.
- backoff = 10 # initial back-off in seconds
-
- # Infinite retry loop with exponential backoff
- while True:
+ backoff = 10 # initial back-off in seconds
+ max_retries = 6
+ attempts = 0
+ while attempts < max_retries:
try:
wait_time = backoff + random.uniform(0, backoff)
- search_results = DDGS(timeout=60).text(
+ search_results = list(DDGS(timeout=60).text(
query=keywords,
region=region,
max_results=max_results,
backend="duckduckgo"
- )
+ ))
- except DDGSException as e:
- if "No results found" in str(e):
- wait_time = backoff + random.uniform(0, backoff)
- error_block = (
- "*" * 100
- + f"\n❗️❗️ [WebSearchAPI] Hit rate limit on DuckDuckGo requests. This is a common behaviour. If unable to run eval due to repeated rate limits, try to decrease job parallelism. Retrying in {wait_time:.1f} seconds…"
- + "*" * 100
- )
- print(error_block)
- time.sleep(wait_time)
- backoff = min(backoff * 2, 120) # cap the back-off
- continue
- else:
- error_block = (
- "*" * 100
- + f"\n❗️❗️ [WebSearchAPI] Error from DuckDuckGo: {str(e)}. This is not a rate-limit error, so it will not be retried."
- + "*" * 100
- )
- print(error_block)
- return {"error": str(e)}
+ except (RatelimitException, TimeoutException) as e:
+ attempts += 1
+ print("*" * 100 + f"\n[WebSearchAPI] Rate limited/timed out. Retrying in {wait_time:.1f}s ({attempts}/{max_retries})..." + "*" * 100)
+ time.sleep(wait_time)
+ backoff = min(backoff * 2, 120)
+ continue
+ except DDGSException as e:
+ print("*" * 100 + f"\n[WebSearchAPI] DuckDuckGo error: {e!s}. Not retriable." + "*" * 100)
+ return {"error": str(e)}
break # Success – no rate-limit error detected
+ if attempts >= max_retries:
+ return {"error": "Exceeded max retries while querying DuckDuckGo due to rate limits/timeouts."}
- if not search_results:
+ if not search_results:
return {
"error": "Failed to retrieve the search results from server. Please try again later."
}
# Convert the search results to the desired format
- results = []
- for result in search_results[:max_results]:
+ results = []
+ for result in search_results[:max_results]:
if self.show_snippet:
results.append(
{
"title": result["title"],
"href": result["href"],
"body": result["body"],
}
)
else:
results.append(
{
"title": result["title"],
"href": result["href"],
}
)
return resultsAlso applies to: 191-210
🧰 Tools
🪛 Ruff (0.14.1)
155-155: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
165-165: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
178-178: Use explicit conversion flag
Replace with conversion flag
(RUF010)
184-184: Comment contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?
(RUF003)
🤖 Prompt for AI Agents
In nemo_skills/inference/eval/bfcl_web_search.py around lines 150-186 (and
similarly 191-210), the code uses an infinite retry loop and treats DDGS.text as
if it were directly truthy/sliceable; change to a finite retry loop (e.g.,
max_retries with attempt counter), explicitly catch DDGSException and timeout
errors, and on each try convert the DDGS.text result to a list (results =
list(DDGS(...).text(...))) before checking emptiness or slicing; when
rate-limit/timeout errors occur, log and sleep with exponential backoff capped
(e.g., backoff = min(backoff*2, 120)) and retry until max_retries is reached,
then return a clear error; for non-retryable errors return immediately; ensure
checks use if not results rather than truthiness of the generator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
nemo_skills/inference/eval/bfcl_web_search.py (1)
152-209: CRITICAL: Past review comments on infinite retry loop and generator handling were NOT addressed.The issues flagged in the previous review are still present:
- Infinite loop (line 152):
while True:with no max retry limit can hang indefinitely under persistent rate-limiting.- Generator handling (line 155):
DDGS().text()returns a generator, not a list.- Truthiness check fails (line 185):
if not search_results:will not work correctly on a generator (generators are always truthy until exhausted).- Slicing fails (line 192):
search_results[:max_results]will raiseTypeErrorbecause generators don't support slicing.These are runtime blockers that will cause failures in production.
The previous review provided a comprehensive fix. Please apply it:
- backoff = 10 # initial back-off in seconds - - # Infinite retry loop with exponential backoff - while True: + backoff = 10 # initial back-off in seconds + max_retries = 6 + attempts = 0 + + while attempts < max_retries: try: wait_time = backoff + random.uniform(0, backoff) - search_results = DDGS(timeout=60).text( + search_results = list(DDGS(timeout=60).text( query=keywords, region=region, max_results=max_results, backend="duckduckgo" - ) + )) except DDGSException as e: if "No results found" in str(e): + attempts += 1 wait_time = backoff + random.uniform(0, backoff) error_block = ( "*" * 100 - + f"\n❗️❗️ [WebSearchAPI] Hit rate limit on DuckDuckGo requests. This is a common behaviour. If unable to run eval due to repeated rate limits, try to decrease job parallelism. Retrying in {wait_time:.1f} seconds…" + + f"\n❗️❗️ [WebSearchAPI] Hit rate limit on DuckDuckGo requests ({attempts}/{max_retries}). This is a common behaviour. If unable to run eval due to repeated rate limits, try to decrease job parallelism. Retrying in {wait_time:.1f} seconds..." + "*" * 100 ) print(error_block) time.sleep(wait_time) backoff = min(backoff * 2, 120) # cap the back-off continue else: error_block = ( "*" * 100 + f"\n❗️❗️ [WebSearchAPI] Error from DuckDuckGo: {str(e)}. This is not a rate-limit error, so it will not be retried." + "*" * 100 ) print(error_block) return {"error": str(e)} - break # Success – no rate-limit error detected + break # Success - no rate-limit error detected + + if attempts >= max_retries: + return {"error": "Exceeded max retries while querying DuckDuckGo due to rate limits."} if not search_results: return { "error": "Failed to retrieve the search results from server. Please try again later." } # Convert the search results to the desired format results = [] for result in search_results[:max_results]: if self.show_snippet: results.append( { "title": result["title"], "href": result["href"], "body": result["body"], } ) else: results.append( { "title": result["title"], "href": result["href"], } ) return results
🧹 Nitpick comments (2)
requirements/main.txt (1)
40-40: Verify CPU-only FAISS variant is intentional.The addition of
faiss-cpusuggests CPU-only deployment. If GPU acceleration is available or expected in production,faiss-gpuor the basefaisspackage may be more appropriate. Confirm this choice aligns with deployment targets.Please clarify: Is the CPU-only variant intentional, or should this be
faiss(which auto-selects based on environment)?nemo_skills/inference/eval/bfcl_web_search.py (1)
228-249: Consider more robust browser emulation for production use.The custom headers help avoid 403 Forbidden errors, but some sites may still detect this as a bot. For more reliable fetching, consider:
- Option 1: Use
requests-htmlwhich includes a browser-like session with better default headers.- Option 2: For sites with heavy JavaScript, consider
playwrightorseleniumfor full browser automation.This is noted by the TODO comment on line 228. The current approach is acceptable for initial implementation, but may need enhancement based on real-world evaluation results.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
nemo_skills/inference/eval/bfcl_web_search.py(1 hunks)requirements/main.txt(2 hunks)
🧰 Additional context used
🪛 Ruff (0.14.1)
nemo_skills/inference/eval/bfcl_web_search.py
52-52: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
54-54: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
56-56: Unused method argument: long_context
(ARG002)
154-154: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
164-164: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
177-177: Use explicit conversion flag
Replace with conversion flag
(RUF010)
183-183: Comment contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?
(RUF003)
224-224: Avoid specifying long messages outside the exception class
(TRY003)
277-277: Abstract raise to an inner function
(TRY301)
277-277: Avoid specifying long messages outside the exception class
(TRY003)
279-279: Do not catch blind exception: Exception
(BLE001)
280-280: Use explicit conversion flag
Replace with conversion flag
(RUF010)
284-284: Docstring contains ambiguous ‑ (NON-BREAKING HYPHEN). Did you mean - (HYPHEN-MINUS)?
(RUF002)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: unit-tests
- GitHub Check: pre-commit
🔇 Additional comments (5)
requirements/main.txt (2)
16-16: Good practice: explanatory comments for new dependencies.The inline comments ("Needed for BFCLv4") are helpful for maintainability and context. This aligns well with other similar comments in the file (e.g., "Requires patching…" on line 31).
Also applies to: 38-40
16-16: Remove unusedfaiss-cpudependency; version pinning concern is invalid.The original review comment's version pinning concern misrepresents project practice—the codebase intentionally leaves most dependencies unpinned, pinning only those with known issues (e.g.,
litellm==1.77.5). The new packages follow this existing pattern correctly.However, verification reveals a critical issue:
faiss-cpuis added to requirements but never imported or used anywhere in the codebase. Onlyddgsandsentence_transformersare actively used for BFCLv4. Either removefaiss-cpuor confirm if supporting code is missing.Likely an incorrect or invalid review comment.
nemo_skills/inference/eval/bfcl_web_search.py (3)
46-54: Random instances are appropriate for non-cryptographic error simulation.The S311 static analysis warnings about
random.Randomare false positives here. These instances with fixed seeds are explicitly used to simulate deterministic errors for testing purposes, not for any security-sensitive operations.
279-280: Broad exception handling is appropriate for this utility function.Catching
Exceptionhere is reasonable for a general-purpose URL fetching utility that needs to gracefully handle network errors, timeouts, parsing failures, etc., and return a consistent error dict format. The BLE001 static analysis warning can be safely ignored in this context.
282-298: LGTM: Error simulation helper is well-implemented.The helper method correctly generates realistic error messages by parsing the URL and formatting templates with appropriate context. While currently unused (as noted in comments), this provides valuable testing infrastructure for future use.
| # limitations under the License. | ||
|
|
||
| bs4 | ||
| ddgs # Needed for BFCLv4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Verify new dependencies are actually used in BFCLv4 code.
The added packages (ddgs, sentence_transformers, faiss-cpu) include helpful explanatory comments indicating BFCLv4 use, but this should be verified to ensure they're not accidentally included.
Also applies to: 38-40
🏁 Script executed:
#!/bin/bash
# Verify new packages are imported in the codebase
echo "=== Searching for ddgs imports ==="
rg "import\s+ddgs|from\s+ddgs" --type py
echo ""
echo "=== Searching for sentence_transformers imports ==="
rg "import\s+sentence_transformers|from\s+sentence_transformers" --type py
echo ""
echo "=== Searching for faiss imports ==="
rg "import\s+faiss|from\s+faiss" --type pyLength of output: 382
I need to verify the faiss package more thoroughly and check the actual requirements file.
Now let me check the requirements file and search for faiss more broadly:
🏁 Script executed:
#!/bin/bash
# Check requirements/main.txt for the new packages
echo "=== Checking requirements/main.txt around lines 16 and 38-40 ==="
cat requirements/main.txt | head -45 | tail -30
echo ""
echo "=== Broader search for faiss (case-insensitive) ==="
rg -i "faiss" --type py
echo ""
echo "=== Search for any Python files importing from faiss ==="
rg "faiss\." --type py | head -20Length of output: 876
Remove faiss-cpu dependency or verify if it's indirectly required.
Verification shows that ddgs and sentence_transformers are genuinely used in the codebase:
ddgsis imported innemo_skills/inference/eval/bfcl_web_search.pysentence_transformersis imported innemo_skills/inference/retrieve_similar.py
However, faiss-cpu shows no imports or usage anywhere in the codebase despite the comment claiming it's "Needed for BFCLv4". Either remove this unused dependency or clarify if it's an indirect transitive dependency that shouldn't be explicitly listed.
🤖 Prompt for AI Agents
In requirements/main.txt around line 16, the faiss-cpu entry appears unused per
the review; either remove the faiss-cpu dependency from the requirements file or
verify and document why it must remain: search the repo for any direct
imports/usages of faiss (e.g., grep for "faiss" and check runtime code paths),
check whether a package like sentence_transformers or ddgs pulls it in
transitively (pipdeptree or pip install . and pip check), and if it is only a
transitive dependency remove the explicit listing (or replace with a comment
noting it is provided transitively); update the requirements file accordingly
and add a brief note in the file or PR explaining the change.
Summary by CodeRabbit
New Features
Refactor
Chores