Skip to content

Conversation

@Glorf
Copy link
Collaborator

@Glorf Glorf commented Oct 8, 2025

Summary by CodeRabbit

  • New Features

    • BFCL v4 support: dataset downloader/prep pipeline, expanded splits (multi-turn, live, language-specific), memory and web-search handling.
    • New evaluation outputs: agentic (memory + web-search) and hallucination scores.
    • Built-in web-search backend with query and URL-fetch capabilities.
  • Refactor

    • Migration of dataset, evaluation, and inference flows to BFCL v4 naming and score structure; multi-turn and memory flows enhanced.
  • Chores

    • Updated constants, default evaluator model, ignored dataset paths, and added runtime dependencies.

Signed-off-by: Michal Bien <[email protected]>
@Glorf Glorf requested a review from shtoshni October 8, 2025 08:56
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 8, 2025

Walkthrough

Introduces BFCL v4: new dataset metadata and splits, scoring extended with memory/web-search/format_sensitivity and agentic/hallucination components, data prepare/utils for v4 ingestion and OpenAI-style tools, evaluator/inference memory & web-search integrations, new WebSearchAPI, and related infra updates.

Changes

Cohort / File(s) Summary of Changes
Dataset v4 metadata
nemo_skills/dataset/bfcl_v4/__init__.py
New module defining DATASET_GROUP, SPLITS (many BFCL v4 splits), IS_BENCHMARK_GROUP, SCORE_MODULE, and BENCHMARKS mapping.
Scoring / bfcl_v4 logic
nemo_skills/dataset/bfcl_v4/bfcl_score.py
New constants MEMORY, WEB_SEARCH, FORMAT_SENSITIVITY; GLOBAL_MAX_K validation; get_accuracy_dict(..., optional=False) extended; added calculate_agentic_accuracy, calculate_hallucination_measurement, and compute_score(metrics: dict) to return consolidated bfcl_v4 output including agentic and hallucination components.
v4 constants
nemo_skills/dataset/bfcl_v4/constants.py
New path/version constants: VERSION_PREFIX, MULTI_TURN_FUNC_DOC_PATH, DATA_FOLDER_PATH (Path objects).
Preparation pipeline (v4)
nemo_skills/dataset/bfcl_v4/prepare.py
New prepare module: shallow-clone BFCL repo, download_and_process_bfcl_data, load_dataset_entry, process_multi_turn_test_case; per-category processing, language hints, multi-turn handling, conversion to tool-format, and writing per-split test.jsonl and __init__.py.
Function/tool utilities (v4)
nemo_skills/dataset/bfcl_v4/utils.py
New utilities: _get_language_specific_hint, func_doc_language_specific_pre_processing, _cast_to_openai_type, and convert_to_tool to produce OpenAI-style function/tool descriptors and normalize types.
Evaluator updates (bfcl)
nemo_skills/evaluation/evaluator/bfcl.py
Default model updated; test-category extraction now strips bfcl_v3. and bfcl_v4.; output/score filenames/paths updated to BFCL_v4 with category directory structure; memory-prereq samples filtered during BFCL format conversion.
v3 → naming & score adjustments
nemo_skills/dataset/bfcl_v3/__init__.py, .../constants.py, .../prepare.py, .../bfcl_score.py
Renamed splits (simplesimple_python, javasimple_java, javascriptsimple_javascript), added live/multi-turn splits, updated TEST_COLLECTION_MAPPING, adjusted prepare.py defaults to BFCL_v4 and filters, and restructured bfcl_score.py return shape.
Inference / execution changes
nemo_skills/inference/eval/bfcl.py, .../bfcl_utils.py
Added memory handling (is_memory_prereq/is_memory, MemoryAPI integration), BFCLGenerationTask.load_data to order/prioritize memory prereqs, deep-copy config loading, BACKEND_PATH_PREFIX, expanded CLASS_FILE_PATH_MAPPING (WebSearchAPI, MemoryAPI variants), and memory flush behavior.
Web search backend
nemo_skills/inference/eval/bfcl_web_search.py
New WebSearchAPI class with _load_scenario, search_engine_query (DDGS with retries/backoff), fetch_url_content (raw/markdown/truncate), and error-message simulation (ERROR_TEMPLATES).
Requirements & infra
requirements/main.txt, .gitignore, dockerfiles/Dockerfile.nemo-skills
Added dependencies ddgs, sentence_transformers, faiss-cpu; .gitignore adds nemo_skills/dataset/bfcl_v4/*/; Dockerfile pinned Gorilla repo commit updated.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Prep as prepare.py
  participant Git as GitHub
  participant FS as Local FS
  participant Utils as utils.convert_to_tool

  Note over Prep,Git: BFCL v4 download & per-category prepare
  Prep->>Git: shallow clone(repo_url, depth=1)
  Git-->>Prep: repo_root
  Prep->>Prep: discover BFCL_v4 JSON files (filter memory/web_search/format_sensitivity)
  loop per category file
    Prep->>Prep: load_dataset_entry(target_folder, category)
    Prep->>Utils: func_doc_language_specific_pre_processing + convert_to_tool
    Utils-->>Prep: tool descriptors
    Prep->>FS: write split dir, `__init__.py`, and `test.jsonl`
  end
Loading
sequenceDiagram
  autonumber
  participant Eval as Evaluator
  participant Score as bfcl_v4.bfcl_score
  participant Metrics as Metrics Store

  Note over Eval,Score: Scoring includes agentic (memory/web_search) & hallucination components

  Eval->>Score: compute_score(metrics)
  activate Score
  Score->>Score: validate GLOBAL_MAX_K
  Score->>Score: get_accuracy_dict(metrics, category, optional=True)
  Score->>Score: calculate_agentic_accuracy(metrics)
  Score->>Metrics: read MEMORY & WEB_SEARCH category metrics
  Metrics-->>Score: per-category accuracies
  Score->>Score: aggregate -> overall, agentic, hallucination sub-scores
  Score-->>Eval: bfcl_v4 result (overall + sub-scores)
  deactivate Score
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

I hop through files with eager paws,
Cloning repos and mapping laws.
Memory, web-search, tools in tune,
v4 splits sprout beneath the moon.
A rabbit cheers: prepare and run! 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "BFCLv4 support" directly and clearly summarizes the primary objective of this pull request. The changes consistently introduce comprehensive support for Berkeley Function Call Leaderboard v4, including new dataset modules, scoring functions, preparation scripts, web search and memory APIs, and infrastructure updates. The title is concise, specific, and accurately conveys the main intent without unnecessary details, making it clear to teammates scanning the history that this PR adds BFCL version 4 support.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch mbien/bfcl_v4

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: Michal Bien <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 94a7619 and 02d6300.

📒 Files selected for processing (6)
  • nemo_skills/dataset/bfcl_v4/__init__.py (1 hunks)
  • nemo_skills/dataset/bfcl_v4/bfcl_score.py (4 hunks)
  • nemo_skills/dataset/bfcl_v4/constants.py (2 hunks)
  • nemo_skills/dataset/bfcl_v4/prepare.py (7 hunks)
  • nemo_skills/dataset/bfcl_v4/utils.py (2 hunks)
  • nemo_skills/evaluation/evaluator/bfcl.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
nemo_skills/dataset/bfcl_v4/prepare.py (1)
nemo_skills/dataset/bfcl_v4/utils.py (7)
  • convert_to_tool (172-184)
  • func_doc_language_specific_pre_processing (83-134)
  • is_multi_turn (39-43)
  • load_file (62-71)
  • clean_up_memory_prereq_entries (187-217)
  • populate_initial_settings_for_memory_test_cases (220-239)
  • populate_initial_settings_for_web_search_test_cases (242-258)
🪛 Ruff (0.13.3)
nemo_skills/dataset/bfcl_v4/prepare.py

108-108: Undefined name test_cases_to_generate

(F821)


145-145: Starting a process with a partial executable path

(S607)


169-169: Using .strip() with multi-character strings is misleading

(B005)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: unit-tests
  • GitHub Check: pre-commit
🔇 Additional comments (15)
nemo_skills/evaluation/evaluator/bfcl.py (1)

44-44: LGTM - Consistent v3 to v4 migration.

The file naming and category extraction updates correctly align with BFCL v4 conventions across all three locations.

Also applies to: 51-51, 79-79

nemo_skills/dataset/bfcl_v4/__init__.py (2)

35-40: LGTM - New module exports added.

The new splits for web search, memory, and format sensitivity are properly integrated into the SPLITS list.


45-47: LGTM - Version references updated.

Both SCORE_MODULE and BENCHMARKS correctly reference bfcl_v4.

nemo_skills/dataset/bfcl_v4/utils.py (4)

27-43: LGTM - Helper predicates added.

The new category detection functions (is_memory, is_memory_prereq, is_web_search, is_multi_turn) provide clear, reusable predicates for identifying test categories.


46-59: LGTM - Test category extraction added.

The extract_test_category_from_id function properly handles both normal and format_sensitivity test entry IDs, with clear logic for removing the "_prereq" suffix when requested.


187-217: LGTM - Memory prereq cleanup logic added.

The clean_up_memory_prereq_entries function correctly:

  1. Removes orphaned memory-prerequisite test cases
  2. Filters out already-generated entries from dependency lists

220-258: LGTM - Initial settings population added.

Both populate_initial_settings_for_memory_test_cases and populate_initial_settings_for_web_search_test_cases provide clear configuration injection for their respective test categories.

nemo_skills/dataset/bfcl_v4/prepare.py (2)

95-95: LGTM - Signature simplified.

Removing the model_type parameter simplifies the interface. Ensure downstream callers are updated accordingly.


169-169: LGTM - Correct prefix removal.

Using .lstrip("BFCL_v4_") correctly removes the prefix. The static analysis warning about multi-character strings is a false positive in this case, as the intent is to remove the exact prefix string.

nemo_skills/dataset/bfcl_v4/bfcl_score.py (4)

51-62: LGTM - New agentic categories added.

The MEMORY, WEB_SEARCH, and FORMAT_SENSITIVITY constants properly define the new test categories for BFCL v4.


94-98: LGTM - Optional category lookup added.

The optional parameter allows graceful handling of missing categories like format_sensitivity, which may not always be present.


186-203: LGTM - Agentic accuracy calculation added.

The calculate_agentic_accuracy function correctly computes combined accuracy for memory and web search categories, with optional format_sensitivity inclusion.


209-229: LGTM - Agentic metrics integrated.

The compute_score function correctly integrates agentic accuracy metrics (memory and web search) into the overall accuracy calculation and result structure.

nemo_skills/dataset/bfcl_v4/constants.py (2)

18-68: LGTM - Comprehensive category definitions.

The category constants are well-organized with clear derivations:

  • MEMORY_CATEGORY derived from ALL_AVAILABLE_MEMORY_BACKENDS
  • SINGLE_TURN_CATEGORY composed from NON_LIVE_CATEGORY and LIVE_CATEGORY
  • AGENTIC_CATEGORY combines MEMORY_CATEGORY and WEB_SEARCH_CATEGORY
  • Clear separation between scoring and non-scoring categories

70-97: LGTM - Test collection mapping updated.

The TEST_COLLECTION_MAPPING uses the new category constants consistently and adds new top-level groups ("memory", "web_search", "agentic") for v4.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
nemo_skills/dataset/bfcl_v4/bfcl_score.py (1)

231-232: Fix typo in dictionary key.

The key "format_sensistivity" contains a typo and should be "format_sensitivity" to match the constant and source key.

Apply this diff to fix the typo:

     if "format_sensitivity" in agentic_accuracy:
-        res["format_sensistivity"] = agentic_accuracy["format_sensitivity"]
+        res["format_sensitivity"] = agentic_accuracy["format_sensitivity"]
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 02d6300 and 9d99897.

📒 Files selected for processing (2)
  • nemo_skills/dataset/bfcl_v4/__init__.py (2 hunks)
  • nemo_skills/dataset/bfcl_v4/bfcl_score.py (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • nemo_skills/dataset/bfcl_v4/init.py
🧰 Additional context used
🪛 GitHub Actions: Lint and Format
nemo_skills/dataset/bfcl_v4/bfcl_score.py

[warning] 1-1: ruff-format reformatted this file during hooks. Please review formatting changes.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: unit-tests
🔇 Additional comments (2)
nemo_skills/dataset/bfcl_v4/bfcl_score.py (2)

51-63: LGTM! New agentic constants are well-defined.

The new constants for memory, web search, and format sensitivity categories align with the BFCL v4 update.


186-203: LGTM! Agentic accuracy calculation is sound.

The function correctly aggregates memory and web search accuracies, and conditionally includes format sensitivity. However, this depends on fixing the critical bug in get_accuracy_dict (lines 94-98) for the optional parameter to work correctly.

Comment on lines 94 to 98
def get_accuracy_dict(metrics, category, optional=False):
# reporting aggregation for pass@1[avg-of-k] (for highest k) if available
category_dict = metrics[f"bfcl_v3.{category}"]
if optional and f"bfcl_v4.{category}" not in metrics:
category_dict = {}
category_dict = metrics[f"bfcl_v4.{category}"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical bug: unconditional key access after optional check.

Line 98 unconditionally accesses metrics[f"bfcl_v4.{category}"] even when optional=True and the key is missing. This causes a KeyError instead of using the empty dict set on line 97.

Apply this diff to fix the logic:

 def get_accuracy_dict(metrics, category, optional=False):
     # reporting aggregation for pass@1[avg-of-k] (for highest k) if available
     if optional and f"bfcl_v4.{category}" not in metrics:
         category_dict = {}
-    category_dict = metrics[f"bfcl_v4.{category}"]
+    else:
+        category_dict = metrics[f"bfcl_v4.{category}"]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def get_accuracy_dict(metrics, category, optional=False):
# reporting aggregation for pass@1[avg-of-k] (for highest k) if available
category_dict = metrics[f"bfcl_v3.{category}"]
if optional and f"bfcl_v4.{category}" not in metrics:
category_dict = {}
category_dict = metrics[f"bfcl_v4.{category}"]
def get_accuracy_dict(metrics, category, optional=False):
# reporting aggregation for pass@1[avg-of-k] (for highest k) if available
if optional and f"bfcl_v4.{category}" not in metrics:
category_dict = {}
else:
category_dict = metrics[f"bfcl_v4.{category}"]
🤖 Prompt for AI Agents
In nemo_skills/dataset/bfcl_v4/bfcl_score.py around lines 94 to 98, the function
sets category_dict = {} when optional is True and the metrics key is missing but
then unconditionally accesses metrics[f"bfcl_v4.{category}"] causing a KeyError;
update the control flow to only read metrics[...] when the key exists (e.g.,
return the empty dict immediately when optional is True and key missing, or use
an else branch to assign category_dict from metrics), ensuring no unconditional
key access occurs.

@shtoshni shtoshni marked this pull request as draft October 9, 2025 12:05
Glorf added 3 commits October 10, 2025 12:31
Signed-off-by: Michal Bien <[email protected]>
Update gorilla clone

Fix preparation code

Fixes for bfclv3 in v4 codebase

Fix multi-turn files position

Fixed pre-processing

Use BFCLv3 snapshot for legacy runs

Cleanup

Populate initial configs

Back to new structure for bfclv3

Structure the scores better

Better score compute for v4

Stabilize API

Custom web search without SerpAPI

Fixing the memory output parsing

Initialize prereqs

Fix parsing and scoring

Fix dataloading

Give full generated outputs for agentic eval

Signed-off-by: Michal Bien <[email protected]>
@Glorf Glorf changed the title [DRAFT] BFCLv4 support BFCLv4 support Oct 15, 2025
@Glorf Glorf marked this pull request as ready for review October 15, 2025 12:11
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
nemo_skills/inference/eval/bfcl.py (1)

169-177: Normalize client parser output for no-tool-call case to list-of-strings.

Avoid returning a string (later extend will split it). Keep generation as list.

-        except Exception:
-            generation = parsed_response["content"] if isinstance(parsed_response["content"], str) else ""
-            tool_call_ids = []
+        except (KeyError, TypeError, ValueError):
+            text = parsed_response["content"] if isinstance(parsed_response["content"], str) else ""
+            generation = [text] if text else []
+            tool_call_ids = []
♻️ Duplicate comments (2)
nemo_skills/dataset/bfcl_v4/bfcl_score.py (1)

95-101: Fix optional access to avoid KeyError when category missing.

Unconditionally indexing metrics[...] defeats the optional guard.

-def get_accuracy_dict(metrics, category, optional=False):
+def get_accuracy_dict(metrics, category, optional=False):
     # reporting aggregation for pass@1[avg-of-k] (for highest k) if available
-    if optional and f"bfcl_v4.{category}" not in metrics:
-        category_dict = {}
-    category_dict = metrics[f"bfcl_v4.{category}"]
+    key = f"bfcl_v4.{category}"
+    if optional and key not in metrics:
+        return {"accuracy": 0, "num_entries": 0}
+    category_dict = metrics[key]
nemo_skills/dataset/bfcl_v4/prepare.py (1)

155-157: Version pinning concern remains unresolved.

As noted in a previous review, shallow-cloning HEAD without version pinning risks future breakage if the upstream repository structure changes. This concern from the prior review has not been addressed in the current implementation.

Consider one of these approaches:

  1. Clone with an explicit tag: ["git", "clone", "--depth=1", "--branch", "v4", repo_url, temp_dir]
  2. Add a post-clone checkout: subprocess.run(["git", "-C", temp_dir, "checkout", KNOWN_COMMIT], check=True)

Store the commit hash or tag as a constant at the module level for easy updates.

🧹 Nitpick comments (8)
nemo_skills/inference/eval/bfcl.py (5)

403-404: Prevent accidental char-wise extend when generation is string.

Even with parser fixes, defensively normalize to list.

-                current_turn_response.extend(model_response["generation"])
+                gen = model_response["generation"]
+                current_turn_response.extend(gen if isinstance(gen, list) else [gen])

256-263: Use pathlib and robust key extraction for MemoryAPI path fix.

Avoid string replace and single-element list indexing; use pathlib and next(iter(...)).

-        for datapoint in data:
-            if "initial_config" in datapoint and list(datapoint["initial_config"].keys())[0].startswith("MemoryAPI"):
-                datapoint["initial_config"][list(datapoint["initial_config"].keys())[0]]["model_result_dir"] = self.cfg.output_file.replace("/output.jsonl", "")
+        for datapoint in data:
+            if "initial_config" in datapoint:
+                api_key = next(iter(datapoint["initial_config"]), None)
+                if api_key and api_key.startswith("MemoryAPI"):
+                    datapoint["initial_config"][api_key]["model_result_dir"] = str(Path(self.cfg.output_file).parent)

Add this import at the top of the file:

+from pathlib import Path

270-273: Avoid asyncio.run in environments with a running event loop.

This will raise at runtime (e.g., Jupyter/async runners). Consider refactoring to async load_data and invoking from the existing loop, or provide a helper that schedules coroutines when a loop is running.

Would you like a small refactor to make prereq processing async-safe?


430-436: Use iterator for first (and only) instance.

Satisfy RUF015 and avoid list materialization.

Based on static analysis hints

-                    memory_instance: "MemoryAPI" = list(involved_instances.values())[0]
+                    memory_instance: "MemoryAPI" = next(iter(involved_instances.values()))

451-455: Same iterator fix when flushing memory.

Based on static analysis hints

-                    memory_instance: "MemoryAPI" = list(involved_instances.values())[0]
+                    memory_instance: "MemoryAPI" = next(iter(involved_instances.values()))
nemo_skills/dataset/bfcl_v4/utils.py (1)

96-103: Guard nested type mapping to avoid KeyError on unseen types.

Use mapping.get(..., "string") for robustness; recurse safely.

-                properties[key]["items"]["type"] = mapping[properties[key]["items"]["type"]]
+                item_type = properties[key]["items"].get("type")
+                properties[key]["items"]["type"] = mapping.get(item_type, "string")
                 if properties[key]["items"]["type"] == "array" and "items" in properties[key]["items"]:
-                    properties[key]["items"]["items"]["type"] = mapping[properties[key]["items"]["items"]["type"]]
+                    nested_item_type = properties[key]["items"]["items"].get("type")
+                    properties[key]["items"]["items"]["type"] = mapping.get(nested_item_type, "string")
                 elif properties[key]["items"]["type"] == "object" and "properties" in properties[key]["items"]:
                     properties[key]["items"]["properties"] = _cast_to_openai_type(
                         properties[key]["items"]["properties"], mapping
                     )
nemo_skills/inference/eval/bfcl_web_search.py (1)

265-267: Narrow exception handling in fetch_url_content.

Catching broad Exception hides actionable errors; prefer requests.exceptions.RequestException.

-        except Exception as e:
+        except requests.exceptions.RequestException as e:
             return {"error": f"An error occurred while fetching {url}: {str(e)}"}
nemo_skills/dataset/bfcl_v4/prepare.py (1)

201-208: Remove unused function parameter.

The args parameter is not used in the function body. If the --model_type argument parsed on line 213 was intended for future use, consider removing both the argument parsing and the parameter until it's actually needed.

Apply this diff:

-def main(args):
+def main():
     LOG.warning(
         "Currently processing according to the OpenAI model style which works for most models, including Qwen/Llama-Nemotron/DeepSeek."
     )
 
     download_and_process_bfcl_data(
         REPO_URL, DATA_FOLDER_PATH, output_dir=os.path.join(os.path.dirname(__file__)),
     )

And remove the unused argument parsing:

 if __name__ == "__main__":
-    parser = argparse.ArgumentParser()
-    parser.add_argument("--model_type", type=str, default=None, required=False)
-    args = parser.parse_args()
-
-    main(args)
+    main()
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9d99897 and 5f5b3f8.

📒 Files selected for processing (15)
  • .gitignore (1 hunks)
  • dockerfiles/Dockerfile.nemo-skills (1 hunks)
  • nemo_skills/dataset/bfcl_v3/__init__.py (1 hunks)
  • nemo_skills/dataset/bfcl_v3/bfcl_score.py (3 hunks)
  • nemo_skills/dataset/bfcl_v3/constants.py (4 hunks)
  • nemo_skills/dataset/bfcl_v3/prepare.py (4 hunks)
  • nemo_skills/dataset/bfcl_v4/__init__.py (1 hunks)
  • nemo_skills/dataset/bfcl_v4/bfcl_score.py (1 hunks)
  • nemo_skills/dataset/bfcl_v4/constants.py (1 hunks)
  • nemo_skills/dataset/bfcl_v4/prepare.py (1 hunks)
  • nemo_skills/dataset/bfcl_v4/utils.py (1 hunks)
  • nemo_skills/evaluation/evaluator/bfcl.py (3 hunks)
  • nemo_skills/inference/eval/bfcl.py (6 hunks)
  • nemo_skills/inference/eval/bfcl_utils.py (3 hunks)
  • nemo_skills/inference/eval/bfcl_web_search.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (6)
nemo_skills/dataset/bfcl_v3/prepare.py (1)
nemo_skills/dataset/bfcl_v4/prepare.py (1)
  • download_and_process_bfcl_data (140-198)
nemo_skills/inference/eval/bfcl.py (2)
nemo_skills/inference/generate.py (3)
  • setup_prompt (365-378)
  • load_data (418-435)
  • process_single_datapoint (537-562)
nemo_skills/inference/eval/bfcl_utils.py (3)
  • convert_to_function_call (80-91)
  • is_empty_execute_response (184-189)
  • execute_multi_turn_func_call (94-181)
nemo_skills/evaluation/evaluator/bfcl.py (1)
nemo_skills/utils.py (2)
  • get_logger_name (130-134)
  • nested_dataclass (49-82)
nemo_skills/inference/eval/bfcl_utils.py (1)
nemo_skills/inference/eval/bfcl_web_search.py (1)
  • _load_scenario (42-45)
nemo_skills/dataset/bfcl_v4/prepare.py (4)
nemo_skills/dataset/bfcl_v4/utils.py (2)
  • func_doc_language_specific_pre_processing (17-68)
  • convert_to_tool (106-118)
nemo_skills/utils.py (1)
  • get_logger_name (130-134)
nemo_skills/dataset/bfcl_v3/utils.py (2)
  • is_multi_turn (51-55)
  • load_file (39-48)
nemo_skills/dataset/bfcl_v3/prepare.py (1)
  • download_and_process_bfcl_data (117-178)
nemo_skills/dataset/bfcl_v4/bfcl_score.py (1)
nemo_skills/dataset/bfcl_v3/bfcl_score.py (6)
  • calculate_combined_accuracy (52-74)
  • get_accuracy_dict (77-112)
  • calculate_non_live_single_turn_accuracy (115-137)
  • calculate_live_single_turn_accuracy (140-156)
  • calculate_multi_turn_accuracy (159-165)
  • compute_score (168-190)
🪛 GitHub Actions: Copyright check
nemo_skills/inference/eval/bfcl_web_search.py

[error] 1-1: Missing copyright notice in the first 10 lines of this Python file.

nemo_skills/dataset/bfcl_v4/utils.py

[error] 1-1: Missing copyright notice in the first 10 lines of this Python file.

🪛 GitHub Actions: Lint and Format
nemo_skills/dataset/bfcl_v3/prepare.py

[error] 1-1: Code adjustments via pre-commit (git clone invocation formatting) in prepare.py.

nemo_skills/inference/eval/bfcl.py

[error] 1-1: Code adjustments via pre-commit (imports cleanup) in bfcl.py.

nemo_skills/dataset/bfcl_v4/constants.py

[error] 1-1: Code style cleanup: fixed spacing around VERSION_PREFIX constant.

nemo_skills/inference/eval/bfcl_web_search.py

[error] 1-1: Code adjustments via pre-commit (import cleanup) in bfcl_web_search.py.

nemo_skills/dataset/bfcl_v4/utils.py

[error] 1-1: Ruff check noted missing newline at end of file; newline added during formatting.

nemo_skills/evaluation/evaluator/bfcl.py

[error] 1-1: Code style adjustments via pre-commit (imports/formatting) in bfcl.py.

nemo_skills/dataset/bfcl_v3/bfcl_score.py

[error] 1-1: Ruff check reformatted code in bfcl_score.py. 9 issues fixed across the worktree.


[error] 1-1: Code adjustments via pre-commit (dictionary literal formatting) in bfcl_score.py.

nemo_skills/inference/eval/bfcl_utils.py

[error] 1-1: Code adjustments via pre-commit (string literal formatting) in bfcl_utils.py.

nemo_skills/dataset/bfcl_v4/prepare.py

[error] 1-1: Code adjustments via pre-commit (imports/order and formatting) applied in bfcl_v4/prepare.py.


[error] 1-1: Code adjustments via pre-commit (path construction and calls) in prepare.py.

nemo_skills/dataset/bfcl_v4/bfcl_score.py

[error] 1-1: Ruff check reformatted code in bfcl_score.py. Several formatting changes applied.


[error] 1-1: Code adjustments via pre-commit (dictionary formatting) in bfcl_score.py.

🪛 Ruff (0.14.0)
nemo_skills/dataset/bfcl_v3/prepare.py

133-133: Starting a process with a partial executable path

(S607)


160-160: Using .strip() with multi-character strings is misleading

(B005)

nemo_skills/inference/eval/bfcl.py

219-219: Do not catch blind exception: Exception

(BLE001)


261-261: Prefer next(iter(datapoint["initial_config"].keys())) over single element slice

Replace with next(iter(datapoint["initial_config"].keys()))

(RUF015)


262-262: Prefer next(iter(datapoint["initial_config"].keys())) over single element slice

Replace with next(iter(datapoint["initial_config"].keys()))

(RUF015)


412-412: Do not catch blind exception: Exception

(BLE001)


430-430: Prefer next(iter(involved_instances.values())) over single element slice

Replace with next(iter(involved_instances.values()))

(RUF015)


453-453: Prefer next(iter(involved_instances.values())) over single element slice

Replace with next(iter(involved_instances.values()))

(RUF015)

nemo_skills/inference/eval/bfcl_web_search.py

38-38: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)


40-40: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)


42-42: Unused method argument: long_context

(ARG002)


140-140: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)


150-150: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)


163-163: Use explicit conversion flag

Replace with conversion flag

(RUF010)


169-169: Comment contains ambiguous (EN DASH). Did you mean - (HYPHEN-MINUS)?

(RUF003)


210-210: Avoid specifying long messages outside the exception class

(TRY003)


263-263: Abstract raise to an inner function

(TRY301)


263-263: Avoid specifying long messages outside the exception class

(TRY003)


265-265: Do not catch blind exception: Exception

(BLE001)


266-266: Use explicit conversion flag

Replace with conversion flag

(RUF010)


270-270: Docstring contains ambiguous (NON-BREAKING HYPHEN). Did you mean - (HYPHEN-MINUS)?

(RUF002)

nemo_skills/dataset/bfcl_v4/prepare.py

155-155: subprocess call: check for execution of untrusted input

(S603)


156-156: Starting a process with a partial executable path

(S607)


164-166: Avoid specifying long messages outside the exception class

(TRY003)


197-197: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


198-198: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


201-201: Unused function argument: args

(ARG001)

nemo_skills/dataset/bfcl_v4/bfcl_score.py

126-130: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: unit-tests
🔇 Additional comments (20)
.gitignore (1)

42-42: LGTM!

The ignore pattern correctly excludes BFCL v4 dataset directories, consistent with the new BFCL v4 support added in this PR.

nemo_skills/dataset/bfcl_v3/__init__.py (1)

18-35: LGTM!

The split renaming (e.g., "simple" → "simple_python") and expansion with new live/multi-turn entries align with the BFCL v4 migration pattern and improve clarity.

nemo_skills/dataset/bfcl_v4/constants.py (1)

22-23: LGTM!

The Path constants are well-defined and provide centralized location references for BFCL v4 data handling.

nemo_skills/dataset/bfcl_v3/constants.py (1)

19-92: LGTM!

The systematic renaming of split identifiers (e.g., "simple" → "simple_python") across all test collection mappings is consistent with the BFCL v3 __init__.py updates.

nemo_skills/dataset/bfcl_v3/prepare.py (3)

117-117: LGTM!

Updating the default file_prefix to "BFCL_v4" aligns with the BFCL v4 migration pattern throughout the PR.


149-150: LGTM!

The filtering logic appropriately excludes "format_sensitivity", "memory", and "web_search" patterns from the BFCL v4 data processing, which aligns with the specialized handling of these categories introduced elsewhere in the PR.


133-133: Default branch “main” contains BFCL v4; removing -b v1.3 is safe.

nemo_skills/dataset/bfcl_v3/bfcl_score.py (3)

17-19: LGTM!

The updated SIMPLE_AST list entries align with the split renaming across BFCL v3 modules.


136-136: LGTM!

Renaming the key to "non_live_irrelevance" improves clarity and consistency with other output keys in the function.


183-189: LGTM!

Nesting the results under a "bfcl_v3" top-level key clearly differentiates BFCL v3 scores from BFCL v4 scores in the overall output structure.

nemo_skills/evaluation/evaluator/bfcl.py (4)

31-31: Verify the new default model configuration.

The default model was updated from "o3-mini-2025-01-31-FC" to "o4-mini-2025-04-16-FC". Ensure this model exists and is compatible with the BFCL evaluation pipeline.


46-46: LGTM!

Adding support for both "bfcl_v3." and "bfcl_v4." prefixes ensures backward compatibility while enabling BFCL v4 evaluation.


82-88: LGTM!

The updated file naming to use "BFCL_v4" and the addition of memory prerequisite filtering align with the BFCL v4 evaluation flow. The filtering logic correctly excludes memory prereq samples that are only needed during inference but not for evaluation scoring.


24-24: Ensure bfcl_eval.utils exports the imported utilities
Confirm that get_directory_structure_by_category and is_memory_prereq are present in the installed bfcl_eval package (and update the dependency version or import paths if not).

nemo_skills/inference/eval/bfcl_utils.py (4)

37-37: LGTM!

The Path import supports the new Path conversion logic added later in the file.


49-66: LGTM!

The introduction of BACKEND_PATH_PREFIX and the updated CLASS_FILE_PATH_MAPPING with f-strings provides a cleaner, more maintainable approach to constructing module paths. The addition of WebSearchAPI and MemoryAPI variants aligns with the BFCL v4 support being added.


117-118: LGTM!

The instance naming changes improve robustness:

  • Preserving the original class_name case maintains consistency
  • The regex substitution re.sub(r'[-./]', '_', instance_name) ensures the instance name is a valid Python identifier by replacing potentially problematic characters

125-129: LGTM!

The deep copy of class_initial_config prevents unintended mutations of the original configuration dictionary. Converting model_result_dir to a Path object before passing to _load_scenario ensures type consistency.

nemo_skills/dataset/bfcl_v4/__init__.py (1)

15-48: BFCL v4 dataset metadata looks good.

SPLITS, group flags, and SCORE_MODULE align with scoring code.

nemo_skills/dataset/bfcl_v4/prepare.py (1)

1-1: Address pre-commit formatting failures.

The pipeline indicates that pre-commit hooks will apply automatic adjustments for import ordering, formatting, and path construction. Ensure you run the pre-commit hooks locally or allow the automated fixes to be applied before merging.

Comment on lines 387 to 396
if self.cfg.remove_thinking:
# If no tool calling was used, apply reasoning cleanup to both the message and generation
trimmed_response_text = self._remove_thinking_from_message_content(
self.message_parser.get_response_text(model_response["message"])
)
# If no tool calling was used, apply reasoning cleanup to both the message and generation
if model_response["message"].content == model_response["generation"]:
model_response["generation"] = [trimmed_response_text]

self.message_parser.set_response_text(model_response["message"], trimmed_response_text)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix remove_thinking gating; use tool-call presence, not equality check.

Equality between message content and generation is unreliable (types differ across parsers). Gate by tool-call absence; set generation to trimmed text list.

-                if self.cfg.remove_thinking:
-                    # If no tool calling was used, apply reasoning cleanup to both the message and generation
+                if self.cfg.remove_thinking:
                     trimmed_response_text = self._remove_thinking_from_message_content(
                         self.message_parser.get_response_text(model_response["message"])
                     )
-                    # If no tool calling was used, apply reasoning cleanup to both the message and generation
-                    if model_response["message"].content == model_response["generation"]:
-                        model_response["generation"] = [trimmed_response_text]
+                    # If no tool calling was used, apply reasoning cleanup to both the message and generation
+                    if not model_response.get("tool_calls"):
+                        model_response["generation"] = [trimmed_response_text]
 
                     self.message_parser.set_response_text(model_response["message"], trimmed_response_text)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if self.cfg.remove_thinking:
# If no tool calling was used, apply reasoning cleanup to both the message and generation
trimmed_response_text = self._remove_thinking_from_message_content(
self.message_parser.get_response_text(model_response["message"])
)
# If no tool calling was used, apply reasoning cleanup to both the message and generation
if model_response["message"].content == model_response["generation"]:
model_response["generation"] = [trimmed_response_text]
self.message_parser.set_response_text(model_response["message"], trimmed_response_text)
if self.cfg.remove_thinking:
trimmed_response_text = self._remove_thinking_from_message_content(
self.message_parser.get_response_text(model_response["message"])
)
# If no tool calling was used, apply reasoning cleanup to both the message and generation
if not model_response.get("tool_calls"):
model_response["generation"] = [trimmed_response_text]
self.message_parser.set_response_text(model_response["message"], trimmed_response_text)
🤖 Prompt for AI Agents
In nemo_skills/inference/eval/bfcl.py around lines 387 to 396, the current
gating uses equality between message.content and generation to decide whether to
replace generation, which is unreliable across parsers; instead check for
absence of a tool call on the message (e.g., if
model_response["message"].get("tool_call") is falsy) and when no tool call is
present set model_response["generation"] = [trimmed_response_text] (ensuring it
becomes a list of the trimmed text), and still call
self.message_parser.set_response_text(...) as before.


RUN git clone https://github.com/ShishirPatil/gorilla.git /opt/gorilla
RUN cd /opt/gorilla && git checkout d2177992bbba9aa228b53c0645bf8f5613a5a7c6
RUN cd /opt/gorilla && git checkout d85e9ad3a3009be3d609c5550799389a16599f88
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want two checkouts.. Maybe checkout in /opt/bfcl_v3/gorilla and /opt/bfcl_v4/gorilla

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's do that!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, actually - we cannot easily pip install both versions of bfcl_eval. I made the changes in such a way that there is a minimum set of modifications and now testing if the original BFCLv3 scores are reproducible - they should be as both the logic and the datasets didn't change (v4 is a superset of v3)


SPLITS = [
"simple", # Simple function calls
"simple_python", # Simple function calls
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rever the changes made to bfcl_v3?

TEST_COLLECTION_MAPPING = {
"all": [
"simple",
"simple_python",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be reverted

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the changes in this file I mean

non_live_ast_accuracy_list + [non_live_irrelevance_accuracy], weighted=False
)

return {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need s to be reverted



def download_and_process_bfcl_data(repo_url, subfolder_path, output_dir, file_prefix="BFCL_v3", model_type="nemotron"):
def download_and_process_bfcl_data(repo_url, subfolder_path, output_dir, file_prefix="BFCL_v4", model_type="nemotron"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file also needs to be reverted

@@ -0,0 +1,118 @@
import json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it the same as bfcl_v3/utils.py? If yes, you can just refer to that file, and avoid duplication.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
nemo_skills/evaluation/evaluator/bfcl.py (2)

63-67: Avoid shell=True; pass args and env explicitly.

Prevents shell injection via model/test_category and makes OPENAI_API_KEY scoping explicit.

Apply:

-            cmd = f"OPENAI_API_KEY=dummy bfcl evaluate --model {eval_config.model} --test-category {test_category}"
-
-            LOG.info(f"Running BFCL evaluation: {cmd}")
-            subprocess.run(cmd, shell=True, check=True, timeout=eval_config.timeout)
+            LOG.info("Running BFCL evaluation")
+            args = ["bfcl", "evaluate", "--model", eval_config.model, "--test-category", test_category]
+            env = {**os.environ, "OPENAI_API_KEY": "dummy"}
+            subprocess.run(args, check=True, timeout=eval_config.timeout, env=env)

Also add:

import os

near the imports at the top.


122-133: Merging with zip() drops lines and can misalign IDs; also marks unevaluated items as correct.

Because memory prereqs are filtered out upstream, bfcl_fmted_file has fewer rows than generation_file. zip() truncates and mislabels is_correct. Merge by id; set is_correct only for evaluated ids.

-        for gen_line, bfcl_line in zip(gen_f, bfcl_f):
-            gen_instance = json.loads(gen_line)
-            # Add the bfcl result to the generation instance
-            gen_instance.update(json.loads(bfcl_line))
-
-            if gen_instance["id"] in wrong_instance_ids:
-                gen_instance["is_correct"] = False
-            else:
-                gen_instance["is_correct"] = True
-
-            fout.write(json.dumps(gen_instance) + "\n")
+        # Index BFCL-evaluated entries by id
+        bfcl_by_id = {}
+        for bfcl_line in bfcl_f:
+            obj = json.loads(bfcl_line)
+            bfcl_by_id[obj.get("id")] = obj
+
+        for gen_line in gen_f:
+            gen_instance = json.loads(gen_line)
+            _id = gen_instance.get("id")
+
+            if _id in bfcl_by_id:
+                # Merge back evaluated fields
+                gen_instance.update(bfcl_by_id[_id])
+                gen_instance["is_correct"] = _id not in wrong_instance_ids
+            else:
+                # Not evaluated (e.g., memory prereq); preserve record and omit correctness
+                gen_instance.pop("is_correct", None)
+
+            fout.write(json.dumps(gen_instance) + "\n")
nemo_skills/inference/eval/bfcl.py (1)

202-219: Normalize client parser outputs; never return bare strings; narrow exceptions.

generation can be a bare string, causing .extend() to split into characters. Also catches broad Exception.

-        try:
-            generation = [
-                {func_call["name"]: json.dumps(func_call["arguments"])} for func_call in model_response["tool_calls"]
-            ]
-            tool_call_ids = [idx for idx in range(len(generation))]
-        except Exception:
-            generation = parsed_response["content"] if isinstance(parsed_response["content"], str) else ""
-            tool_call_ids = []
+        try:
+            tool_calls = model_response.get("tool_calls") or []
+            if tool_calls:
+                generation = [
+                    {tc["name"]: json.dumps(tc["arguments"])} for tc in tool_calls if isinstance(tc, dict)
+                ]
+                tool_call_ids = [str(i) for i in range(len(generation))]
+            else:
+                content = parsed_response.get("content", "")
+                generation = [content] if isinstance(content, str) else [""]
+                tool_call_ids = []
+        except (KeyError, TypeError, ValueError):
+            content = parsed_response.get("content", "")
+            generation = [content] if isinstance(content, str) else [""]
+            tool_call_ids = []
♻️ Duplicate comments (3)
nemo_skills/dataset/bfcl_v4/bfcl_score.py (1)

95-100: Fix optional handling to avoid KeyError and return a neutral metric when missing.

Current code sets an empty dict then unconditionally indexes metrics[…], causing KeyError and later lookup failures.

-def get_accuracy_dict(metrics, category, optional=False):
+def get_accuracy_dict(metrics, category, optional=False):
     # reporting aggregation for pass@1[avg-of-k] (for highest k) if available
-    if optional and f"bfcl_v4.{category}" not in metrics:
-        category_dict = {}
-    category_dict = metrics[f"bfcl_v4.{category}"]
+    key = f"bfcl_v4.{category}"
+    if key not in metrics:
+        if optional:
+            return {"accuracy": 0, "num_entries": 0}
+        raise KeyError(f"Missing metrics for category: {key}")
+    category_dict = metrics[key]
nemo_skills/inference/eval/bfcl.py (2)

442-451: Fix remove_thinking gating; key off tool-call absence, not equality.

-                if self.cfg.remove_thinking:
-                    # If no tool calling was used, apply reasoning cleanup to both the message and generation
+                if self.cfg.remove_thinking:
                     trimmed_response_text = self._remove_thinking_from_message_content(
                         self.message_parser.get_response_text(model_response["message"])
                     )
-                    # If no tool calling was used, apply reasoning cleanup to both the message and generation
-                    if model_response["message"].content == model_response["generation"]:
-                        model_response["generation"] = [trimmed_response_text]
+                    # If no tool calling was used, apply reasoning cleanup to both the message and generation
+                    if not model_response.get("tool_calls"):
+                        model_response["generation"] = [trimmed_response_text]
 
                     self.message_parser.set_response_text(model_response["message"], trimmed_response_text)

This avoids cross-parser type comparisons and ensures generation is a list. Based on learnings


248-259: Normalize server parser outputs; avoid string-vs-list shape bugs and narrow except.

-        try:
-            tool_calls = output_dict["message"].tool_calls
-            generation = [{func_call.function.name: func_call.function.arguments} for func_call in tool_calls]
-            tool_call_ids = [func_call.id for func_call in tool_calls]
-        except Exception:
-            tool_calls = []
-            generation = output_dict["message"].content
-            tool_call_ids = []
-
-        # Use model output if not a tool call
-        output_dict["generation"] = generation if generation else [output_dict["message"].content]
+        try:
+            tool_calls = output_dict["message"].tool_calls or []
+            if tool_calls:
+                generation = [{fc.function.name: fc.function.arguments} for fc in tool_calls]
+                tool_call_ids = [fc.id for fc in tool_calls]
+            else:
+                generation = [output_dict["message"].content]
+                tool_call_ids = []
+        except (AttributeError, KeyError, TypeError):
+            tool_calls = []
+            generation = [getattr(output_dict["message"], "content", "")]
+            tool_call_ids = []
+
+        # Always lists
+        output_dict["generation"] = generation

Ruff BLE001 suggests avoiding broad Exception; the diff applies that. Based on learnings

🧹 Nitpick comments (10)
nemo_skills/evaluation/evaluator/bfcl.py (2)

51-55: Hardcoded BFCL repo path; make it configurable.

Bakes in /opt/gorilla/... which breaks non-standard deployments. Move base path into BFCLEvaluatorConfig (or cfg) with a sane default.

Would introducing eval_config.bfcl_repo_root (defaulting to /opt/gorilla/berkeley-function-call-leaderboard) work for your runners?


45-47: Minor: update comment to v4.

The comment still references bfcl_v3 structure; this code supports both prefixes.

nemo_skills/dataset/bfcl_v4/bfcl_score.py (2)

121-130: TRY003: shorten or centralize long exception message.

Either compress the text or define a small custom exception with the message in the class docstring.


58-64: Unused constant FORMAT_SENSITIVITY.

If not reported in v4 output, remove; otherwise, integrate it into agentic or separate reporting.

Is format_sensitivity still a tracked metric in your pipeline?

nemo_skills/inference/eval/bfcl.py (6)

52-57: Avoid hard dependency on bfcl_eval; make imports lazy/TYPE_CHECKING.

Top-level imports will break module import for users without bfcl_eval (earlier code intentionally delayed such imports). Move to local/lazy imports and keep only type hints under TYPE_CHECKING.

-from bfcl_eval.utils import is_memory_prereq, is_memory
-from bfcl_eval.model_handler.utils import add_memory_instruction_system_prompt
-from bfcl_eval.eval_checker.multi_turn_eval.func_source_code.memory_api_metaclass import (
-    MemoryAPI,
-)
+from typing import TYPE_CHECKING
+if TYPE_CHECKING:
+    from bfcl_eval.eval_checker.multi_turn_eval.func_source_code.memory_api_metaclass import MemoryAPI

Additionally, import where used:

  • Inside load_data():
+        from bfcl_eval.utils import is_memory_prereq  # local import to keep bfcl_eval optional
  • Inside _generate_single_data_point_multi_turn() before the memory block:
+        from bfcl_eval.utils import is_memory
+        from bfcl_eval.model_handler.utils import add_memory_instruction_system_prompt

288-307: Make path handling robust and simplify key access; optional: event-loop guard.

  • Use Path().parent instead of string replace.
  • Use next(iter(...)) instead of list(...)[0] (RUF015).
  • Confirm load_data will never run inside an active event loop before calling asyncio.run.
-        # First, fix the target paths to point to the actual target paths for memory stores
-        for datapoint in data:
-            if "initial_config" in datapoint and list(datapoint["initial_config"].keys())[0].startswith("MemoryAPI"):
-                datapoint["initial_config"][list(datapoint["initial_config"].keys())[0]]["model_result_dir"] = self.cfg.output_file.replace("/output.jsonl", "")
+        # First, fix the target paths to point to the actual target paths for memory stores
+        from pathlib import Path
+        for datapoint in data:
+            if "initial_config" in datapoint:
+                key = next(iter(datapoint["initial_config"]))
+                if key.startswith("MemoryAPI"):
+                    datapoint["initial_config"][key]["model_result_dir"] = str(Path(self.cfg.output_file).parent)

Please confirm this method is not invoked under a running event loop (e.g., Jupyter) since it calls asyncio.run.


386-408: Harden memory-instance setup; avoid assert in prod and use RUF015 pattern.

  • Replace assert with explicit check and error.
  • Use next(iter(...)) for instance extraction.
  • Optionally define memory_instance in outer scope for later flush.
-        if is_memory(test_category):
+        memory_instance = None
+        if is_memory(test_category):
             # Execute no function call, but just to get a reference to all the instances to get the initial state for logging purpose
             _, involved_instances = execute_multi_turn_func_call(
                 [],
                 initial_config,
                 involved_classes,
                 test_entry_id=test_entry_id,
                 long_context=("long_context" in test_category or "composite" in test_category),
             )
-
-            assert (
-                len(involved_instances) == 1
-            ), "Memory category should only involve one class."
-
-            memory_instance: "MemoryAPI" = list(involved_instances.values())[0]
+            if len(involved_instances) != 1:
+                raise ValueError("Memory category should only involve one class.")
+            memory_instance = next(iter(involved_instances.values()))
             data_point["question"] = add_memory_instruction_system_prompt(
                 data_point["question"],
                 test_category,
                 data_point["scenario"],
                 memory_instance,
             )

457-459: Defensive: ensure generation is a list before extend.

Protect against accidental bare strings.

-                current_turn_response.extend(model_response["generation"])
+                gens = model_response["generation"]
+                if not isinstance(gens, list):
+                    gens = [gens]
+                current_turn_response.extend(gens)

464-469: Narrow except when decoding tool calls; avoid BLE001; keep log level low.

Catch only expected parse errors from convert_to_function_call/json.

-                except Exception:
-                    LOG.info("No tools to execute in this turn. Proceed to next turn.")
+                except (json.JSONDecodeError, KeyError, TypeError, ValueError):
+                    LOG.debug("No tools to execute in this turn. Proceed to next turn.")
                     break

483-487: Ensure tool_call_id is a string for downstream compatibility.

Tokenizer/chat templates typically expect string IDs.

-                    tool_message = {
+                    tool_message = {
                         "role": "tool",
                         "content": execution_result,
-                        "tool_call_id": tool_call_id,
+                        "tool_call_id": str(tool_call_id),
                     }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5f5b3f8 and 21f16f7.

📒 Files selected for processing (4)
  • dockerfiles/Dockerfile.nemo-skills (1 hunks)
  • nemo_skills/dataset/bfcl_v4/bfcl_score.py (1 hunks)
  • nemo_skills/evaluation/evaluator/bfcl.py (3 hunks)
  • nemo_skills/inference/eval/bfcl.py (7 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • dockerfiles/Dockerfile.nemo-skills
🧰 Additional context used
🧬 Code graph analysis (3)
nemo_skills/inference/eval/bfcl.py (2)
nemo_skills/inference/generate.py (3)
  • setup_prompt (368-381)
  • load_data (427-444)
  • process_single_datapoint (546-571)
nemo_skills/inference/eval/bfcl_utils.py (3)
  • execute_multi_turn_func_call (94-181)
  • convert_to_function_call (80-91)
  • is_empty_execute_response (184-189)
nemo_skills/evaluation/evaluator/bfcl.py (1)
nemo_skills/utils.py (2)
  • get_logger_name (131-135)
  • nested_dataclass (50-83)
nemo_skills/dataset/bfcl_v4/bfcl_score.py (1)
nemo_skills/dataset/bfcl_v3/bfcl_score.py (6)
  • calculate_combined_accuracy (52-74)
  • get_accuracy_dict (77-112)
  • calculate_non_live_single_turn_accuracy (115-137)
  • calculate_live_single_turn_accuracy (140-156)
  • calculate_multi_turn_accuracy (159-165)
  • compute_score (168-190)
🪛 Ruff (0.14.1)
nemo_skills/inference/eval/bfcl.py

252-252: Do not catch blind exception: Exception

(BLE001)


294-294: Prefer next(iter(datapoint["initial_config"].keys())) over single element slice

Replace with next(iter(datapoint["initial_config"].keys()))

(RUF015)


295-295: Prefer next(iter(datapoint["initial_config"].keys())) over single element slice

Replace with next(iter(datapoint["initial_config"].keys()))

(RUF015)


400-400: Prefer next(iter(involved_instances.values())) over single element slice

Replace with next(iter(involved_instances.values()))

(RUF015)


466-466: Do not catch blind exception: Exception

(BLE001)


511-511: Prefer next(iter(involved_instances.values())) over single element slice

Replace with next(iter(involved_instances.values()))

(RUF015)

nemo_skills/dataset/bfcl_v4/bfcl_score.py

126-130: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: pre-commit
  • GitHub Check: unit-tests
🔇 Additional comments (2)
nemo_skills/dataset/bfcl_v4/bfcl_score.py (1)

206-213: Good: overall uses numeric accuracies.
This fixes the prior bug of adding dicts to floats.

Please confirm the weights (0.4/0.3/0.1/0.1/0.1) match the latest BFCL v4 spec for your run.

nemo_skills/inference/eval/bfcl.py (1)

20-20: OK to add asyncio import.

No issues.

Comment on lines +199 to +205
def compute_score(metrics: dict):
non_live_single_turn_accuracy = calculate_non_live_single_turn_accuracy(metrics)
live_single_turn_accuracy = calculate_live_single_turn_accuracy(metrics)
multi_turn_accuracy = calculate_multi_turn_accuracy(metrics)
agentic_accuracy = calculate_agentic_accuracy(metrics)
hallucination_accuracy = calculate_hallucination_measurement(metrics)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reset GLOBAL_MAX_K per compute to avoid cross-call contamination.

Without reset, prior calls can spuriously raise ValueError on later datasets.

 def compute_score(metrics: dict):
+    global GLOBAL_MAX_K
+    GLOBAL_MAX_K = None
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def compute_score(metrics: dict):
non_live_single_turn_accuracy = calculate_non_live_single_turn_accuracy(metrics)
live_single_turn_accuracy = calculate_live_single_turn_accuracy(metrics)
multi_turn_accuracy = calculate_multi_turn_accuracy(metrics)
agentic_accuracy = calculate_agentic_accuracy(metrics)
hallucination_accuracy = calculate_hallucination_measurement(metrics)
def compute_score(metrics: dict):
global GLOBAL_MAX_K
GLOBAL_MAX_K = None
non_live_single_turn_accuracy = calculate_non_live_single_turn_accuracy(metrics)
live_single_turn_accuracy = calculate_live_single_turn_accuracy(metrics)
multi_turn_accuracy = calculate_multi_turn_accuracy(metrics)
agentic_accuracy = calculate_agentic_accuracy(metrics)
hallucination_accuracy = calculate_hallucination_measurement(metrics)
🤖 Prompt for AI Agents
In nemo_skills/dataset/bfcl_v4/bfcl_score.py around lines 199 to 205, reset the
module-level GLOBAL_MAX_K at the start of compute_score to avoid cross-call
contamination; specifically, add a line that reinitializes GLOBAL_MAX_K to its
default (e.g., None or the original initial value) before calling the individual
calculate_* functions so each compute_score invocation starts with a clean
GLOBAL_MAX_K and cannot spuriously raise ValueError due to state left over from
previous calls.

Comment on lines +86 to 89
# Remove memory prereq from eval as those are only needed for inference
if is_memory_prereq(sample["id"]):
continue
if sample.get("result", None) is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Guard against missing ids when filtering memory prereqs.

If a line lacks "id", this raises KeyError. Use get() and skip safely.

-            if is_memory_prereq(sample["id"]):
+            sid = sample.get("id")
+            if sid is not None and is_memory_prereq(sid):
                 continue
🤖 Prompt for AI Agents
In nemo_skills/evaluation/evaluator/bfcl.py around lines 86 to 89, the code
accesses sample["id"] directly which can raise KeyError for samples missing an
"id"; change the check to safely retrieve the id first (e.g., id_val =
sample.get("id")) and only call is_memory_prereq when id_val is present,
skipping the sample otherwise (for example: if id_val and
is_memory_prereq(id_val): continue), keeping the existing check for
sample.get("result", None) afterward.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
nemo_skills/dataset/bfcl_v3/prepare.py (1)

168-169: Fix Path type passed to process_file to avoid TypeError.

process_multi_turn_test_case uses Path arithmetic ("/") with repo_root_dir; passing a str will fail. Pass a Path.

-                process_file(temp_dir, input_file, output_file, model_type=model_type)
+                process_file(Path(temp_dir), input_file, output_file, model_type=model_type)
♻️ Duplicate comments (1)
nemo_skills/dataset/bfcl_v4/utils.py (1)

15-21: Avoid duplication with bfcl_v3.utils where possible.

If v4 logic mirrors v3, consider reusing/importing shared helpers to reduce drift.

🧹 Nitpick comments (6)
nemo_skills/dataset/bfcl_v3/prepare.py (2)

160-160: Use removesuffix() or Path.stem for extension handling (safer than replace).

replace(".json","") can misfire if ".json" appears elsewhere. Prefer removesuffix (Py3.9+) or Path.stem.

-                split_dirname = os.path.join(output_dir, filename.removeprefix("BFCL_v4_").replace(".json", ""))
+                name = filename.removeprefix("BFCL_v4_").removesuffix(".json")
+                split_dirname = os.path.join(output_dir, name)

Alternatively:

-                split_dirname = os.path.join(output_dir, filename.removeprefix("BFCL_v4_").replace(".json", ""))
+                name = Path(filename).stem
+                name = name.removeprefix("BFCL_v4_")
+                split_dirname = os.path.join(output_dir, name)

176-179: Log full traceback on git failures.

Use logging.exception for context.

-        except subprocess.CalledProcessError as e:
-            LOG.error(f"Git command failed: {e}")
-            LOG.error("Make sure git is installed and the repository URL is correct")
+        except subprocess.CalledProcessError:
+            LOG.exception("Git command failed")
+            LOG.error("Make sure git is installed and the repository URL is correct")
nemo_skills/dataset/bfcl_v4/utils.py (1)

107-118: Guard against unknown nested item types in _cast_to_openai_type.

Direct dict indexing may KeyError for unseen types; fall back to "string".

-            elif "items" in properties[key]:
-                properties[key]["items"]["type"] = mapping[properties[key]["items"]["type"]]
+            elif "items" in properties[key]:
+                item_type = properties[key]["items"].get("type", "string")
+                properties[key]["items"]["type"] = mapping.get(item_type, "string")
                 if properties[key]["items"]["type"] == "array" and "items" in properties[key]["items"]:
-                    properties[key]["items"]["items"]["type"] = mapping[properties[key]["items"]["items"]["type"]]
+                    inner_type = properties[key]["items"]["items"].get("type", "string")
+                    properties[key]["items"]["items"]["type"] = mapping.get(inner_type, "string")
                 elif properties[key]["items"]["type"] == "object" and "properties" in properties[key]["items"]:
                     properties[key]["items"]["properties"] = _cast_to_openai_type(
                         properties[key]["items"]["properties"], mapping
                     )
nemo_skills/inference/eval/bfcl_web_search.py (2)

60-61: Use dict.get for config with a safe default.

Avoid KeyError when show_snippet is omitted.

-        self.show_snippet = initial_config["show_snippet"]
+        self.show_snippet = initial_config.get("show_snippet", True)

280-281: Narrow the exception in fetch_url_content and keep trace.

Catch requests exceptions, not bare Exception; include traceback via logging if available.

-        except Exception as e:
-            return {"error": f"An error occurred while fetching {url}: {str(e)}"}
+        except requests.RequestException as e:
+            return {"error": f"requests error while fetching {url}: {e!s}"}
nemo_skills/dataset/bfcl_v4/prepare.py (1)

195-197: Remove unnecessary f-string; keep traceback via exception().

Minor cleanup and better diagnostics.

-        except subprocess.CalledProcessError as e:
-            LOG.exception(f"Git command failed")
+        except subprocess.CalledProcessError:
+            LOG.exception("Git command failed")
             LOG.error("Make sure git is installed and the repository URL is correct")
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 21f16f7 and 01c8c34.

📒 Files selected for processing (6)
  • nemo_skills/dataset/bfcl_v3/prepare.py (4 hunks)
  • nemo_skills/dataset/bfcl_v4/__init__.py (1 hunks)
  • nemo_skills/dataset/bfcl_v4/constants.py (1 hunks)
  • nemo_skills/dataset/bfcl_v4/prepare.py (1 hunks)
  • nemo_skills/dataset/bfcl_v4/utils.py (1 hunks)
  • nemo_skills/inference/eval/bfcl_web_search.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • nemo_skills/dataset/bfcl_v4/constants.py
🧰 Additional context used
🧬 Code graph analysis (2)
nemo_skills/dataset/bfcl_v4/prepare.py (3)
nemo_skills/dataset/bfcl_v4/utils.py (2)
  • func_doc_language_specific_pre_processing (32-83)
  • convert_to_tool (121-133)
nemo_skills/utils.py (1)
  • get_logger_name (131-135)
nemo_skills/dataset/bfcl_v3/utils.py (2)
  • is_multi_turn (51-55)
  • load_file (39-48)
nemo_skills/dataset/bfcl_v3/prepare.py (1)
nemo_skills/dataset/bfcl_v4/prepare.py (1)
  • download_and_process_bfcl_data (140-197)
🪛 Ruff (0.14.1)
nemo_skills/dataset/bfcl_v4/prepare.py

154-154: subprocess call: check for execution of untrusted input

(S603)


155-155: Starting a process with a partial executable path

(S607)


163-165: Avoid specifying long messages outside the exception class

(TRY003)


195-195: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)


196-196: f-string without any placeholders

Remove extraneous f prefix

(F541)


197-197: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

nemo_skills/dataset/bfcl_v3/prepare.py

133-133: Starting a process with a partial executable path

(S607)

nemo_skills/inference/eval/bfcl_web_search.py

53-53: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)


55-55: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)


57-57: Unused method argument: long_context

(ARG002)


155-155: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)


165-165: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)


178-178: Use explicit conversion flag

Replace with conversion flag

(RUF010)


184-184: Comment contains ambiguous (EN DASH). Did you mean - (HYPHEN-MINUS)?

(RUF003)


225-225: Avoid specifying long messages outside the exception class

(TRY003)


278-278: Abstract raise to an inner function

(TRY301)


278-278: Avoid specifying long messages outside the exception class

(TRY003)


280-280: Do not catch blind exception: Exception

(BLE001)


281-281: Use explicit conversion flag

Replace with conversion flag

(RUF010)


285-285: Docstring contains ambiguous (NON-BREAKING HYPHEN). Did you mean - (HYPHEN-MINUS)?

(RUF002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: pre-commit
  • GitHub Check: unit-tests
🔇 Additional comments (3)
nemo_skills/dataset/bfcl_v3/prepare.py (1)

117-117: Confirm intent: modifying bfcl_v3 prepare to process v4 files.

Defaulting file_prefix to "BFCL_v4" inside bfcl_v3/prepare.py blurs version boundaries. If unintentional, revert to v3 defaults and move v4 logic to bfcl_v4. If intentional, please confirm downstream callers expect v4 outputs here.

nemo_skills/dataset/bfcl_v4/__init__.py (1)

15-48: BFCLv4 dataset manifest looks good; ensure split parity with evaluator.

Constants and BENCHMARKS map are fine. Please verify SPLITS matches bfcl_eval.constants.category_mapping.ALL_SCORING_CATEGORIES to avoid drift with prepare/evaluator.

nemo_skills/dataset/bfcl_v4/prepare.py (1)

154-156: Review is based on incorrect assumption; bfcl_v3 does not implement the suggested pattern.

The review claims the fix follows "same rationale as bfcl_v3," but inspection of nemo_skills/dataset/bfcl_v3/prepare.py (line 133) shows it contains only git clone --depth=1 without any subsequent git checkout. The suggested pattern does not exist as a reference implementation. Additionally, the proposed diff uses PLACEHOLDER_SHA rather than an actual commit identifier, and there is no documented pinning strategy or specific commit/tag identified in the codebase for either dataset version.

Likely an incorrect or invalid review comment.

# v1.3 corresponds the release version for BFCL v3
subprocess.run(
["git", "clone", "-b", "v1.3", "--depth=1", repo_url, temp_dir], check=True, capture_output=True
["git", "clone", "--depth=1", repo_url, temp_dir], check=True, capture_output=True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

🧩 Analysis chain

Pin the BFCL repo clone to a known commit/tag; avoid floating HEAD.

Shallow clone of HEAD risks future breakage. Pin via branch/tag or checkout a SHA; also consider resolving git via shutil.which to appease S607.

-            subprocess.run(
-                ["git", "clone", "--depth=1", repo_url, temp_dir], check=True, capture_output=True
-            )
+            subprocess.run(["git", "clone", "--depth=1", repo_url, temp_dir], check=True, capture_output=True)
+            # Optional: pin to a known commit (replace PLACEHOLDER_SHA)
+            subprocess.run(["git", "-C", temp_dir, "checkout", "PLACEHOLDER_SHA"], check=True, capture_output=True)

If you prefer a tag/branch:

-            subprocess.run(["git", "clone", "--depth=1", repo_url, temp_dir], check=True, capture_output=True)
+            subprocess.run(["git", "clone", "--depth=1", "--branch", "v1.3", repo_url, temp_dir], check=True, capture_output=True)


---

🏁 Script executed:

```shell
#!/bin/bash
# Examine the actual code at prepare.py around line 133
rg -n "git.*clone.*depth=1" --type py -C 10

Length of output: 3814


🏁 Script executed:

#!/bin/bash
# Search for BFCL version/commit references in the codebase
rg -i "bfcl.*commit|bfcl.*sha|bfcl.*tag|bfcl.*version" --type py --type yaml --type json --type toml

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Look for repo_url definition and how it's used in prepare.py
rg -n "repo_url" nemo_skills/dataset/bfcl_v3/prepare.py -B 5 -A 5

Length of output: 1713


🏁 Script executed:

#!/bin/bash
# Check if there's any documentation or configuration for BFCL in the project
fd -t f -e md -e txt -e yaml -e json -e toml | xargs rg -l -i "bfcl" 2>/dev/null | head -20

Length of output: 183



---

🏁 Script executed:

```shell
#!/bin/bash
# Find all call sites of download_and_process_bfcl_data
rg -n "download_and_process_bfcl_data" --type py -B 3 -A 3

Length of output: 2226


🏁 Script executed:

#!/bin/bash
# Search for where repo URLs or BFCL repo references are defined
rg -n "github.com.*bfcl|https.*bfcl" --type py --type yaml --type json

Length of output: 1039


🏁 Script executed:

#!/bin/bash
# Check if shutil.which is used elsewhere for subprocess security
rg -n "shutil.which|S607" --type py -A 2

Length of output: 266


🏁 Script executed:

#!/bin/bash
# Look for main entry points and argument parsing
fd -t f "main.py|cli.py|__main__.py" --type py | xargs rg -l "bfcl" 2>/dev/null

Length of output: 1190



---

🏁 Script executed:

```shell
#!/bin/bash
# Find REPO_URL definition in bfcl_v3/prepare.py
rg -n "REPO_URL\s*=" nemo_skills/dataset/bfcl_v3/prepare.py -A 2 -B 2

Length of output: 140


🏁 Script executed:

#!/bin/bash
# Check all references to REPO_URL in both bfcl_v3 and bfcl_v4 prepare files
rg -n "REPO_URL" nemo_skills/dataset/bfcl_v3/prepare.py nemo_skills/dataset/bfcl_v4/prepare.py

Length of output: 486


🏁 Script executed:

#!/bin/bash
# Examine context around subprocess.run in prepare.py to understand if git is pre-checked
rg -n "subprocess.run" nemo_skills/dataset/bfcl_v3/prepare.py -B 5 -A 2

Length of output: 364


Apply S607 security check and pin BFCL repository to a stable commit.

The shallow clone without pinning creates two addressable risks: (1) subprocess call without validating git availability violates S607 (Bandit security rule already applied in ojbench/prepare.py); (2) floating HEAD on shallow clone breaks if upstream repo structure changes.

Both issues are fixable. For (1), add shutil.which("git") validation before the subprocess call. For (2), either pin to a commit via git -C checkout after clone, or use --branch during clone (requires knowing a stable branch name from the upstream repo).

Consider implementing both:

  • Validate git executable is available
  • Pin to a known commit SHA (preferred for reproducibility) or stable tag/branch
🧰 Tools
🪛 Ruff (0.14.1)

133-133: Starting a process with a partial executable path

(S607)

🤖 Prompt for AI Agents
In nemo_skills/dataset/bfcl_v3/prepare.py around line 133, the subprocess git
clone call does not validate git is available and performs a shallow clone
without pinning, risking S607 bandit failure and a floating HEAD; fix by first
checking shutil.which("git") and raising a clear exception if not found, then
perform the clone and immediately pin to a stable commit or tag (preferred:
define a BFCL_COMMIT_SHA constant and run a git -C <temp_dir> checkout
<BFCL_COMMIT_SHA> after clone) or alternatively use --branch <stable-branch> in
the clone command; ensure subprocess calls use check=True and capture_output for
errors and include helpful error messages when git commands fail.

Comment on lines +86 to +91
"""
This function retrieves the dataset entry for a given test category.
The input should not be a test category goup, but a specific test category.
If `contain_prereq` is True, it will include the pre-requisite entries for the memory test categories.
If `include_language_specific_hint` is True, it will include the language-specific hint for the function description (for Java, JavaScript, and Python).
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Docstring parameter typo.

The text says contain_prereq; the parameter is include_prereq.

-    If `contain_prereq` is True, it will include the pre-requisite entries for the memory test categories.
+    If `include_prereq` is True, it will include the pre-requisite entries for the memory test categories.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"""
This function retrieves the dataset entry for a given test category.
The input should not be a test category goup, but a specific test category.
If `contain_prereq` is True, it will include the pre-requisite entries for the memory test categories.
If `include_language_specific_hint` is True, it will include the language-specific hint for the function description (for Java, JavaScript, and Python).
"""
"""
This function retrieves the dataset entry for a given test category.
The input should not be a test category goup, but a specific test category.
If `include_prereq` is True, it will include the pre-requisite entries for the memory test categories.
If `include_language_specific_hint` is True, it will include the language-specific hint for the function description (for Java, JavaScript, and Python).
"""
🤖 Prompt for AI Agents
nemo_skills/dataset/bfcl_v4/prepare.py around lines 86 to 91: the docstring
incorrectly references the parameter name `contain_prereq`; update it to
`include_prereq` so the description matches the actual function signature, and
scan the docstring for any other mismatched parameter names (e.g.,
`include_language_specific_hint`) and correct them for consistency.

Comment on lines +120 to +136
if include_language_specific_hint:
all_entries = add_language_specific_hint_to_function_doc(all_entries)

all_entries = populate_initial_settings_for_memory_test_cases(
all_entries, str(target_folder)
)
all_entries = populate_initial_settings_for_web_search_test_cases(
all_entries
)

# Convert function calls to tools format and add them to the system prompt
for instance in all_entries:
if "function" in instance:
# Add the tools to the system prompt
instance["function"] = func_doc_language_specific_pre_processing(instance["function"], test_category)
instance["tools"] = convert_to_tool(instance["function"])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid double-adding language hints to function descriptions.

Both add_language_specific_hint_to_function_doc and func_doc_language_specific_pre_processing append hints, leading to duplicates. Keep one source of truth.

-    if include_language_specific_hint:
-        all_entries = add_language_specific_hint_to_function_doc(all_entries)
+    if include_language_specific_hint:
+        # Defer language hinting to func_doc_language_specific_pre_processing to avoid duplication.
+        pass

Or drop the description-append in func_doc_language_specific_pre_processing and leave casting only. Pick one consistently.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In nemo_skills/dataset/bfcl_v4/prepare.py around lines 120-136, the language
hint is being appended twice because add_language_specific_hint_to_function_doc
is called earlier and func_doc_language_specific_pre_processing also appends
hints; to fix, remove the hint-appending behavior from
func_doc_language_specific_pre_processing so it only normalizes/casts the
function doc (or alternatively skip calling
add_language_specific_hint_to_function_doc here), and ensure only one place is
responsible for adding the language-specific hint to function descriptions to
prevent duplicates.

Comment on lines +23 to +30
def _get_language_specific_hint(test_category):
if test_category == "java":
return " Note that the provided function is in Java 8 SDK syntax."
elif test_category == "javascript":
return " Note that the provided function is in JavaScript syntax."
else:
return " Note that the provided function is in Python 3 syntax."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Language detection will misclassify categories (e.g., simple_java) as Python.

test_category equals split names like "simple_java"/"simple_javascript", not bare "java"/"javascript". Hints and type coercion won’t apply.

-def _get_language_specific_hint(test_category):
-    if test_category == "java":
+def _detect_lang(test_category: str) -> str:
+    t = test_category.lower()
+    if "javascript" in t:
+        return "javascript"
+    if "java" in t:
+        return "java"
+    return "python"
+
+def _get_language_specific_hint(test_category):
+    lang = _detect_lang(test_category)
+    if lang == "java":
         return " Note that the provided function is in Java 8 SDK syntax."
-    elif test_category == "javascript":
+    elif lang == "javascript":
         return " Note that the provided function is in JavaScript syntax."
     else:
         return " Note that the provided function is in Python 3 syntax."
@@
-def func_doc_language_specific_pre_processing(function, test_category):
+def func_doc_language_specific_pre_processing(function, test_category):
+    lang = _detect_lang(test_category)
@@
-        if test_category == "java":
+        if lang == "java":
@@
-        elif test_category == "javascript":
+        elif lang == "javascript":

Also applies to: 32-83

🤖 Prompt for AI Agents
In nemo_skills/dataset/bfcl_v4/utils.py around lines 23 to 30 (and similarly
lines 32-83), the function compares test_category to exact strings
("java"/"javascript") but actual categories are named like "simple_java" or
"simple_javascript", so detections fall through to Python; update the
conditional logic to detect language tokens within the category string (e.g.,
check for "java" and "javascript" substrings or use a regex/startswith pattern)
and return the appropriate language hint; apply the same substring-based
detection wherever test_category is used in lines 32-83 to ensure correct hints
and type coercion for categories like "simple_java" and "simple_javascript".

Comment on lines +150 to +186
backoff = 10 # initial back-off in seconds

# Infinite retry loop with exponential backoff
while True:
try:
wait_time = backoff + random.uniform(0, backoff)
search_results = DDGS(timeout=60).text(
query=keywords,
region=region,
max_results=max_results,
backend="duckduckgo"
)

except DDGSException as e:
if "No results found" in str(e):
wait_time = backoff + random.uniform(0, backoff)
error_block = (
"*" * 100
+ f"\n❗️❗️ [WebSearchAPI] Hit rate limit on DuckDuckGo requests. This is a common behaviour. If unable to run eval due to repeated rate limits, try to decrease job parallelism. Retrying in {wait_time:.1f} seconds…"
+ "*" * 100
)
print(error_block)
time.sleep(wait_time)
backoff = min(backoff * 2, 120) # cap the back-off
continue
else:
error_block = (
"*" * 100
+ f"\n❗️❗️ [WebSearchAPI] Error from DuckDuckGo: {str(e)}. This is not a rate-limit error, so it will not be retried."
+ "*" * 100
)
print(error_block)
return {"error": str(e)}

break # Success – no rate-limit error detected

if not search_results:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make retries finite and handle DDGS generator/results correctly.

  • DDGS.text may return a generator; slicing and truthiness checks fail.
  • Infinite loop risks hanging; handle Ratelimit/Timeout explicitly and cap retries.
-        backoff = 10  # initial back-off in seconds
-
-        # Infinite retry loop with exponential backoff
-        while True:
+        backoff = 10  # initial back-off in seconds
+        max_retries = 6
+        attempts = 0
+        while attempts < max_retries:
             try:
                 wait_time = backoff + random.uniform(0, backoff)
-                search_results = DDGS(timeout=60).text(
+                search_results = list(DDGS(timeout=60).text(
                     query=keywords,
                     region=region,
                     max_results=max_results,
                     backend="duckduckgo"
-                )
+                ))
 
-            except DDGSException as e:
-                if "No results found" in str(e):
-                    wait_time = backoff + random.uniform(0, backoff)
-                    error_block = (
-                        "*" * 100
-                        + f"\n❗️❗️ [WebSearchAPI] Hit rate limit on DuckDuckGo requests. This is a common behaviour. If unable to run eval due to repeated rate limits, try to decrease job parallelism. Retrying in {wait_time:.1f} seconds…"
-                        + "*" * 100
-                    )
-                    print(error_block)
-                    time.sleep(wait_time)
-                    backoff = min(backoff * 2, 120)  # cap the back-off
-                    continue
-                else:
-                    error_block = (
-                        "*" * 100
-                        + f"\n❗️❗️ [WebSearchAPI] Error from DuckDuckGo: {str(e)}. This is not a rate-limit error, so it will not be retried."
-                        + "*" * 100
-                    )
-                    print(error_block)
-                return {"error": str(e)}
+            except (RatelimitException, TimeoutException) as e:
+                attempts += 1
+                print("*" * 100 + f"\n[WebSearchAPI] Rate limited/timed out. Retrying in {wait_time:.1f}s ({attempts}/{max_retries})..." + "*" * 100)
+                time.sleep(wait_time)
+                backoff = min(backoff * 2, 120)
+                continue
+            except DDGSException as e:
+                print("*" * 100 + f"\n[WebSearchAPI] DuckDuckGo error: {e!s}. Not retriable." + "*" * 100)
+                return {"error": str(e)}
             break  # Success – no rate-limit error detected
+        if attempts >= max_retries:
+            return {"error": "Exceeded max retries while querying DuckDuckGo due to rate limits/timeouts."}
 
-        if not search_results:
+        if not search_results:
             return {
                 "error": "Failed to retrieve the search results from server. Please try again later."
             }
 
         # Convert the search results to the desired format
-        results = []
-        for result in search_results[:max_results]:
+        results = []
+        for result in search_results[:max_results]:
             if self.show_snippet:
                 results.append(
                     {
                         "title": result["title"],
                         "href": result["href"],
                         "body": result["body"],
                     }
                 )
             else:
                 results.append(
                     {
                         "title": result["title"],
                         "href": result["href"],
                     }
                 )
 
         return results

Also applies to: 191-210

🧰 Tools
🪛 Ruff (0.14.1)

155-155: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)


165-165: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)


178-178: Use explicit conversion flag

Replace with conversion flag

(RUF010)


184-184: Comment contains ambiguous (EN DASH). Did you mean - (HYPHEN-MINUS)?

(RUF003)

🤖 Prompt for AI Agents
In nemo_skills/inference/eval/bfcl_web_search.py around lines 150-186 (and
similarly 191-210), the code uses an infinite retry loop and treats DDGS.text as
if it were directly truthy/sliceable; change to a finite retry loop (e.g.,
max_retries with attempt counter), explicitly catch DDGSException and timeout
errors, and on each try convert the DDGS.text result to a list (results =
list(DDGS(...).text(...))) before checking emptiness or slicing; when
rate-limit/timeout errors occur, log and sleep with exponential backoff capped
(e.g., backoff = min(backoff*2, 120)) and retry until max_retries is reached,
then return a clear error; for non-retryable errors return immediately; ensure
checks use if not results rather than truthiness of the generator.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
nemo_skills/inference/eval/bfcl_web_search.py (1)

152-209: CRITICAL: Past review comments on infinite retry loop and generator handling were NOT addressed.

The issues flagged in the previous review are still present:

  1. Infinite loop (line 152): while True: with no max retry limit can hang indefinitely under persistent rate-limiting.
  2. Generator handling (line 155): DDGS().text() returns a generator, not a list.
  3. Truthiness check fails (line 185): if not search_results: will not work correctly on a generator (generators are always truthy until exhausted).
  4. Slicing fails (line 192): search_results[:max_results] will raise TypeError because generators don't support slicing.

These are runtime blockers that will cause failures in production.

The previous review provided a comprehensive fix. Please apply it:

-        backoff = 10  # initial back-off in seconds
-
-        # Infinite retry loop with exponential backoff
-        while True:
+        backoff = 10  # initial back-off in seconds
+        max_retries = 6
+        attempts = 0
+        
+        while attempts < max_retries:
             try:
                 wait_time = backoff + random.uniform(0, backoff)
-                search_results = DDGS(timeout=60).text(
+                search_results = list(DDGS(timeout=60).text(
                     query=keywords,
                     region=region,
                     max_results=max_results,
                     backend="duckduckgo"
-                )
+                ))
 
             except DDGSException as e:
                 if "No results found" in str(e):
+                    attempts += 1
                     wait_time = backoff + random.uniform(0, backoff)
                     error_block = (
                         "*" * 100
-                        + f"\n❗️❗️ [WebSearchAPI] Hit rate limit on DuckDuckGo requests. This is a common behaviour. If unable to run eval due to repeated rate limits, try to decrease job parallelism. Retrying in {wait_time:.1f} seconds…"
+                        + f"\n❗️❗️ [WebSearchAPI] Hit rate limit on DuckDuckGo requests ({attempts}/{max_retries}). This is a common behaviour. If unable to run eval due to repeated rate limits, try to decrease job parallelism. Retrying in {wait_time:.1f} seconds..."
                         + "*" * 100
                     )
                     print(error_block)
                     time.sleep(wait_time)
                     backoff = min(backoff * 2, 120)  # cap the back-off
                     continue
                 else:
                     error_block = (
                         "*" * 100
                         + f"\n❗️❗️ [WebSearchAPI] Error from DuckDuckGo: {str(e)}. This is not a rate-limit error, so it will not be retried."
                         + "*" * 100
                     )
                     print(error_block)
                 return {"error": str(e)}
 
-            break  # Success – no rate-limit error detected
+            break  # Success - no rate-limit error detected
+        
+        if attempts >= max_retries:
+            return {"error": "Exceeded max retries while querying DuckDuckGo due to rate limits."}
 
         if not search_results:
             return {
                 "error": "Failed to retrieve the search results from server. Please try again later."
             }
 
         # Convert the search results to the desired format
         results = []
         for result in search_results[:max_results]:
             if self.show_snippet:
                 results.append(
                     {
                         "title": result["title"],
                         "href": result["href"],
                         "body": result["body"],
                     }
                 )
             else:
                 results.append(
                     {
                         "title": result["title"],
                         "href": result["href"],
                     }
                 )
 
         return results
🧹 Nitpick comments (2)
requirements/main.txt (1)

40-40: Verify CPU-only FAISS variant is intentional.

The addition of faiss-cpu suggests CPU-only deployment. If GPU acceleration is available or expected in production, faiss-gpu or the base faiss package may be more appropriate. Confirm this choice aligns with deployment targets.

Please clarify: Is the CPU-only variant intentional, or should this be faiss (which auto-selects based on environment)?

nemo_skills/inference/eval/bfcl_web_search.py (1)

228-249: Consider more robust browser emulation for production use.

The custom headers help avoid 403 Forbidden errors, but some sites may still detect this as a bot. For more reliable fetching, consider:

  • Option 1: Use requests-html which includes a browser-like session with better default headers.
  • Option 2: For sites with heavy JavaScript, consider playwright or selenium for full browser automation.

This is noted by the TODO comment on line 228. The current approach is acceptable for initial implementation, but may need enhancement based on real-world evaluation results.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 01c8c34 and 70805d2.

📒 Files selected for processing (2)
  • nemo_skills/inference/eval/bfcl_web_search.py (1 hunks)
  • requirements/main.txt (2 hunks)
🧰 Additional context used
🪛 Ruff (0.14.1)
nemo_skills/inference/eval/bfcl_web_search.py

52-52: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)


54-54: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)


56-56: Unused method argument: long_context

(ARG002)


154-154: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)


164-164: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)


177-177: Use explicit conversion flag

Replace with conversion flag

(RUF010)


183-183: Comment contains ambiguous (EN DASH). Did you mean - (HYPHEN-MINUS)?

(RUF003)


224-224: Avoid specifying long messages outside the exception class

(TRY003)


277-277: Abstract raise to an inner function

(TRY301)


277-277: Avoid specifying long messages outside the exception class

(TRY003)


279-279: Do not catch blind exception: Exception

(BLE001)


280-280: Use explicit conversion flag

Replace with conversion flag

(RUF010)


284-284: Docstring contains ambiguous (NON-BREAKING HYPHEN). Did you mean - (HYPHEN-MINUS)?

(RUF002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: unit-tests
  • GitHub Check: pre-commit
🔇 Additional comments (5)
requirements/main.txt (2)

16-16: Good practice: explanatory comments for new dependencies.

The inline comments ("Needed for BFCLv4") are helpful for maintainability and context. This aligns well with other similar comments in the file (e.g., "Requires patching…" on line 31).

Also applies to: 38-40


16-16: Remove unused faiss-cpu dependency; version pinning concern is invalid.

The original review comment's version pinning concern misrepresents project practice—the codebase intentionally leaves most dependencies unpinned, pinning only those with known issues (e.g., litellm==1.77.5). The new packages follow this existing pattern correctly.

However, verification reveals a critical issue: faiss-cpu is added to requirements but never imported or used anywhere in the codebase. Only ddgs and sentence_transformers are actively used for BFCLv4. Either remove faiss-cpu or confirm if supporting code is missing.

Likely an incorrect or invalid review comment.

nemo_skills/inference/eval/bfcl_web_search.py (3)

46-54: Random instances are appropriate for non-cryptographic error simulation.

The S311 static analysis warnings about random.Random are false positives here. These instances with fixed seeds are explicitly used to simulate deterministic errors for testing purposes, not for any security-sensitive operations.


279-280: Broad exception handling is appropriate for this utility function.

Catching Exception here is reasonable for a general-purpose URL fetching utility that needs to gracefully handle network errors, timeouts, parsing failures, etc., and return a consistent error dict format. The BLE001 static analysis warning can be safely ignored in this context.


282-298: LGTM: Error simulation helper is well-implemented.

The helper method correctly generates realistic error messages by parsing the URL and formatting templates with appropriate context. While currently unused (as noted in comments), this provides valuable testing infrastructure for future use.

# limitations under the License.

bs4
ddgs # Needed for BFCLv4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify new dependencies are actually used in BFCLv4 code.

The added packages (ddgs, sentence_transformers, faiss-cpu) include helpful explanatory comments indicating BFCLv4 use, but this should be verified to ensure they're not accidentally included.

Also applies to: 38-40


🏁 Script executed:

#!/bin/bash
# Verify new packages are imported in the codebase
echo "=== Searching for ddgs imports ==="
rg "import\s+ddgs|from\s+ddgs" --type py

echo ""
echo "=== Searching for sentence_transformers imports ==="
rg "import\s+sentence_transformers|from\s+sentence_transformers" --type py

echo ""
echo "=== Searching for faiss imports ==="
rg "import\s+faiss|from\s+faiss" --type py

Length of output: 382


I need to verify the faiss package more thoroughly and check the actual requirements file.

Now let me check the requirements file and search for faiss more broadly:


🏁 Script executed:

#!/bin/bash
# Check requirements/main.txt for the new packages
echo "=== Checking requirements/main.txt around lines 16 and 38-40 ==="
cat requirements/main.txt | head -45 | tail -30

echo ""
echo "=== Broader search for faiss (case-insensitive) ==="
rg -i "faiss" --type py

echo ""
echo "=== Search for any Python files importing from faiss ==="
rg "faiss\." --type py | head -20

Length of output: 876


Remove faiss-cpu dependency or verify if it's indirectly required.

Verification shows that ddgs and sentence_transformers are genuinely used in the codebase:

  • ddgs is imported in nemo_skills/inference/eval/bfcl_web_search.py
  • sentence_transformers is imported in nemo_skills/inference/retrieve_similar.py

However, faiss-cpu shows no imports or usage anywhere in the codebase despite the comment claiming it's "Needed for BFCLv4". Either remove this unused dependency or clarify if it's an indirect transitive dependency that shouldn't be explicitly listed.

🤖 Prompt for AI Agents
In requirements/main.txt around line 16, the faiss-cpu entry appears unused per
the review; either remove the faiss-cpu dependency from the requirements file or
verify and document why it must remain: search the repo for any direct
imports/usages of faiss (e.g., grep for "faiss" and check runtime code paths),
check whether a package like sentence_transformers or ddgs pulls it in
transitively (pipdeptree or pip install . and pip check), and if it is only a
transitive dependency remove the explicit listing (or replace with a comment
noting it is provided transitively); update the requirements file accordingly
and add a brief note in the file or PR explaining the change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants