diff --git a/docs/about/concepts/architecture.md b/docs/about/concepts/architecture.md index 7900089a..4aa6ea10 100644 --- a/docs/about/concepts/architecture.md +++ b/docs/about/concepts/architecture.md @@ -100,7 +100,7 @@ graph LR Use the launcher to handle both model deployment and evaluation: ```bash -nv-eval run \ +nemo-evaluator-launcher run \ --config-dir examples \ --config-name local_llama_3_1_8b_instruct \ -o deployment.checkpoint_path=/path/to/model \ @@ -112,7 +112,7 @@ nv-eval run \ Point the launcher to an existing API endpoint: ```bash -nv-eval run \ +nemo-evaluator-launcher run \ --config-dir examples \ --config-name local_llama_3_1_8b_instruct \ -o target.api_endpoint.url=http://localhost:8080/v1/completions \ diff --git a/docs/about/concepts/evaluation-model.md b/docs/about/concepts/evaluation-model.md index adcb15d4..4e5b65ac 100644 --- a/docs/about/concepts/evaluation-model.md +++ b/docs/about/concepts/evaluation-model.md @@ -14,7 +14,7 @@ NeMo Evaluator supports several evaluation approaches through containerized harn - **Function Calling**: Models generate structured outputs for tool use and API interaction scenarios. - **Safety & Security**: Evaluation against adversarial prompts and safety benchmarks to test model alignment and robustness. -One or more evaluation harnesses implement each approach. To discover available tasks for each approach, use `nv-eval ls tasks`. +One or more evaluation harnesses implement each approach. To discover available tasks for each approach, use `nemo-evaluator-launcher ls tasks`. ## Endpoint Compatibility @@ -25,7 +25,7 @@ NeMo Evaluator targets OpenAI-compatible API endpoints. The platform supports th - **`vlm`**: Vision-language model endpoints supporting image inputs. - **`embedding`**: Embedding generation endpoints for retrieval evaluation. -Each evaluation task specifies which endpoint types it supports. Verify compatibility using `nv-eval ls tasks`. +Each evaluation task specifies which endpoint types it supports. Verify compatibility using `nemo-evaluator-launcher ls tasks`. ## Metrics diff --git a/docs/about/concepts/framework-definition-file.md b/docs/about/concepts/framework-definition-file.md index cb936518..3b700f87 100644 --- a/docs/about/concepts/framework-definition-file.md +++ b/docs/about/concepts/framework-definition-file.md @@ -3,7 +3,7 @@ # Framework Definition Files ::::{note} -**Who needs this?** This documentation is for framework developers and organizations creating custom evaluation frameworks. If you're running existing evaluation tasks using {ref}`nv-eval ` (NeMo Evaluator Launcher CLI) or {ref}`eval-factory ` (NeMo Evaluator CLI), you don't need to create FDFs—they're already provided by framework packages. +**Who needs this?** This documentation is for framework developers and organizations creating custom evaluation frameworks. If you're running existing evaluation tasks using {ref}`nemo-evaluator-launcher ` (NeMo Evaluator Launcher CLI) or {ref}`nemo-evaluator ` (NeMo Evaluator CLI), you don't need to create FDFs—they're already provided by framework packages. :::: A Framework Definition File (FDF) is a YAML configuration file that serves as the single source of truth for integrating evaluation frameworks into the NeMo Evaluator ecosystem. FDFs define how evaluation frameworks are configured, executed, and integrated with the Eval Factory system. diff --git a/docs/about/key-features.md b/docs/about/key-features.md index 30cb6b49..cf2727ff 100644 --- a/docs/about/key-features.md +++ b/docs/about/key-features.md @@ -17,9 +17,9 @@ Run evaluations anywhere with unified configuration and monitoring: ```bash # Single command, multiple backends -nv-eval run --config-dir examples --config-name local_llama_3_1_8b_instruct -nv-eval run --config-dir examples --config-name slurm_llama_3_1_8b_instruct -nv-eval run --config-dir examples --config-name lepton_vllm_llama_3_1_8b_instruct +nemo-evaluator-launcher run --config-dir examples --config-name local_llama_3_1_8b_instruct +nemo-evaluator-launcher run --config-dir examples --config-name slurm_llama_3_1_8b_instruct +nemo-evaluator-launcher run --config-dir examples --config-name lepton_vllm_llama_3_1_8b_instruct ``` ### 100+ Benchmarks Across 17 Harnesses @@ -27,14 +27,14 @@ Access comprehensive benchmark suite with single CLI: ```bash # Discover available benchmarks -nv-eval ls tasks +nemo-evaluator-launcher ls tasks # Run academic benchmarks -nv-eval run --config-dir examples --config-name local_llama_3_1_8b_instruct \ +nemo-evaluator-launcher run --config-dir examples --config-name local_llama_3_1_8b_instruct \ -o 'evaluation.tasks=["mmlu_pro", "gsm8k", "arc_challenge"]' # Run safety evaluation -nv-eval run --config-dir examples --config-name local_llama_3_1_8b_instruct \ +nemo-evaluator-launcher run --config-dir examples --config-name local_llama_3_1_8b_instruct \ -o 'evaluation.tasks=["aegis_v2", "garak"]' ``` @@ -43,13 +43,13 @@ First-class integration with MLOps platforms: ```bash # Export to MLflow -nv-eval export --dest mlflow +nemo-evaluator-launcher export --dest mlflow # Export to Weights & Biases -nv-eval export --dest wandb +nemo-evaluator-launcher export --dest wandb # Export to Google Sheets -nv-eval export --dest gsheets +nemo-evaluator-launcher export --dest gsheets ``` ## **Core Evaluation Engine (NeMo Evaluator Core)** @@ -313,7 +313,7 @@ Built-in safety assessment through specialized containers: ```bash # Run safety evaluation suite -nv-eval run \ +nemo-evaluator-launcher run \ --config-dir examples \ --config-name local_llama_3_1_8b_instruct \ -o 'evaluation.tasks=["aegis_v2", "garak"]' @@ -331,10 +331,10 @@ Monitor evaluation progress across all backends: ```bash # Check evaluation status -nv-eval status +nemo-evaluator-launcher status # Kill running evaluations -nv-eval kill +nemo-evaluator-launcher kill ``` ### Result Export and Analysis @@ -342,11 +342,11 @@ Export evaluation results to MLOps platforms for downstream analysis: ```bash # Export to MLflow for experiment tracking -nv-eval export --dest mlflow +nemo-evaluator-launcher export --dest mlflow # Export to Weights & Biases for visualization -nv-eval export --dest wandb +nemo-evaluator-launcher export --dest wandb # Export to Google Sheets for sharing -nv-eval export --dest gsheets +nemo-evaluator-launcher export --dest gsheets ``` diff --git a/docs/conf.py b/docs/conf.py index 77da8bf5..2b9a842b 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -24,7 +24,7 @@ import sys # Add custom extensions directory to Python path -sys.path.insert(0, os.path.abspath('_extensions')) +sys.path.insert(0, os.path.abspath("_extensions")) project = "NeMo Evaluator SDK" copyright = "2025, NVIDIA Corporation" @@ -43,7 +43,7 @@ "sphinx_copybutton", # For copy button in code blocks, "sphinx_design", # For grid layout "sphinx.ext.ifconfig", # For conditional content - "content_gating", # Unified content gating extension + "content_gating", # Unified content gating extension "myst_codeblock_substitutions", # Our custom MyST substitutions in code blocks "json_output", # Generate JSON output for each page "search_assets", # Enhanced search assets extension @@ -54,26 +54,41 @@ templates_path = ["_templates"] exclude_patterns = [ - "_build", - "Thumbs.db", + "_build", + "Thumbs.db", ".DS_Store", - "_extensions/*/README.md", # Exclude README files in extension directories - "_extensions/README.md", # Exclude main extensions README - "_extensions/*/__pycache__", # Exclude Python cache directories - "_extensions/*/*/__pycache__", # Exclude nested Python cache directories + "_extensions/*/README.md", # Exclude README files in extension directories + "_extensions/README.md", # Exclude main extensions README + "_extensions/*/__pycache__", # Exclude Python cache directories + "_extensions/*/*/__pycache__", # Exclude nested Python cache directories ] # -- Options for Intersphinx ------------------------------------------------- # Cross-references to external NVIDIA documentation intersphinx_mapping = { - "ctk": ("https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest", None), - "gpu-op": ("https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest", None), + "ctk": ( + "https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest", + None, + ), + "gpu-op": ( + "https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest", + None, + ), "ngr-tk": ("https://docs.nvidia.com/nemo/guardrails/latest", None), - "nim-cs": ("https://docs.nvidia.com/nim/llama-3-1-nemoguard-8b-contentsafety/latest/", None), - "nim-tc": ("https://docs.nvidia.com/nim/llama-3-1-nemoguard-8b-topiccontrol/latest/", None), + "nim-cs": ( + "https://docs.nvidia.com/nim/llama-3-1-nemoguard-8b-contentsafety/latest/", + None, + ), + "nim-tc": ( + "https://docs.nvidia.com/nim/llama-3-1-nemoguard-8b-topiccontrol/latest/", + None, + ), "nim-jd": ("https://docs.nvidia.com/nim/nemoguard-jailbreakdetect/latest/", None), "nim-llm": ("https://docs.nvidia.com/nim/large-language-models/latest/", None), - "driver-linux": ("https://docs.nvidia.com/datacenter/tesla/driver-installation-guide", None), + "driver-linux": ( + "https://docs.nvidia.com/datacenter/tesla/driver-installation-guide", + None, + ), "nim-op": ("https://docs.nvidia.com/nim-operator/latest", None), } @@ -83,7 +98,7 @@ # -- Options for JSON Output ------------------------------------------------- # Configure the JSON output extension for comprehensive search indexes json_output_settings = { - 'enabled': True, + "enabled": True, } # -- Options for AI Assistant ------------------------------------------------- @@ -103,8 +118,8 @@ "deflist", # Supports definition lists with term: definition format "fieldlist", # Enables field lists for metadata like :author: Name "tasklist", # Adds support for GitHub-style task lists with [ ] and [x] - "attrs_inline", # Enables inline attributes for markdown - "substitution", # Enables substitution for markdown + "attrs_inline", # Enables inline attributes for markdown + "substitution", # Enables substitution for markdown ] myst_heading_anchors = 5 # Generates anchor links for headings up to level 5 @@ -121,18 +136,14 @@ "support_email": "update-me", "min_python_version": "3.8", "recommended_cuda": "12.0+", - "docker_compose_latest": "25.08.1" + "docker_compose_latest": "25.09.1", } # Enable figure numbering numfig = True # Optional: customize numbering format -numfig_format = { - 'figure': 'Figure %s', - 'table': 'Table %s', - 'code-block': 'Listing %s' -} +numfig_format = {"figure": "Figure %s", "table": "Table %s", "code-block": "Listing %s"} # Optional: number within sections numfig_secnum_depth = 1 # Gives you "Figure 1.1, 1.2, 2.1, etc." @@ -141,9 +152,10 @@ # Suppress expected warnings for conditional content builds suppress_warnings = [ "toc.not_included", # Expected when video docs are excluded from GA builds - "toc.no_title", # Expected for helm docs that include external README files - "docutils", # Expected for autodoc2-generated content with regex patterns and complex syntax - "ref.python", # Expected for ambiguous autodoc2 cross-references (e.g., multiple 'Params' classes) + "toc.no_title", # Expected for helm docs that include external README files + "docutils", # Expected for autodoc2-generated content with regex patterns and complex syntax + "ref.python", # Expected for ambiguous autodoc2 cross-references (e.g., multiple 'Params' classes) + "myst.xref_missing", # Expected for Pydantic BaseModel docstrings that reference Pydantic's own documentation ] # -- Options for Autodoc2 --------------------------------------------------- @@ -153,9 +165,7 @@ # Conditional autodoc2 configuration - only enable if packages exist # Note: We point to the parent package rather than individual subpackages because # the subpackages have relative imports between them (e.g., api imports from core) -autodoc2_packages_list = [ - "../packages/nemo-evaluator/src/nemo_evaluator/" -] +autodoc2_packages_list = ["../packages/nemo-evaluator/src/nemo_evaluator/"] # Check if any of the packages actually exist before enabling autodoc2 autodoc2_packages = [] @@ -168,37 +178,37 @@ if autodoc2_packages: if "autodoc2" not in extensions: extensions.append("autodoc2") - + autodoc2_render_plugin = "myst" # Use MyST for rendering docstrings autodoc2_output_dir = "apidocs" # Output directory for autodoc2 (relative to docs/) - + # ==================== GOOD DEFAULTS FOR CLEANER DOCS ==================== - + # Hide implementation details - good defaults for cleaner docs autodoc2_hidden_objects = [ - "dunder", # Hide __methods__ like __init__, __str__, etc. - "private", # Hide _private methods and variables + "dunder", # Hide __methods__ like __init__, __str__, etc. + "private", # Hide _private methods and variables "inherited", # Hide inherited methods to reduce clutter ] - + # Enable module summaries for better organization autodoc2_module_summary = True - + # Sort by name for consistent organization autodoc2_sort_names = True - + # Enhanced docstring processing for better formatting autodoc2_docstrings = "all" # Include all docstrings for comprehensive docs - + # Include class inheritance information - useful for users autodoc2_class_inheritance = True - + # Handle class docstrings properly (merge __init__ with class doc) autodoc2_class_docstring = "merge" - + # Better type annotation handling autodoc2_type_guard_imports = True - + # Replace common type annotations for better readability autodoc2_replace_annotations = [ ("typing.Union", "Union"), @@ -207,23 +217,25 @@ ("typing.Dict", "Dict"), ("typing.Any", "Any"), ] - + # Don't require __all__ to be defined - document all public members autodoc2_module_all_regexes = [] # Empty list means don't require __all__ - + # Skip common test and internal modules - customize for your project autodoc2_skip_module_regexes = [ - r".*\.tests?.*", # Skip test modules - r".*\.test_.*", # Skip test files - r".*\._.*", # Skip private modules - r".*\.conftest", # Skip conftest files + r".*\.tests?.*", # Skip test modules + r".*\.test_.*", # Skip test files + r".*\._.*", # Skip private modules + r".*\.conftest", # Skip conftest files ] - + # Load index template from external file for better maintainability - template_path = os.path.join(os.path.dirname(__file__), "_templates", "autodoc2_index.rst") + template_path = os.path.join( + os.path.dirname(__file__), "_templates", "autodoc2_index.rst" + ) with open(template_path) as f: autodoc2_index_template = f.read() - + # This is a workaround that uses the parser located in autodoc2_docstrings_parser.py to allow autodoc2 to # render google style docstrings. # Related Issue: https://github.com/sphinx-extensions2/sphinx-autodoc2/issues/33 @@ -272,10 +284,10 @@ }, } -# Add our static files directory +# Add our static files directory # html_static_path = ["_static"] html_extra_path = ["project.json", "versions1.json"] -# Note: JSON output configuration has been moved to the consolidated -# json_output_settings dictionary above for better organization and new features! \ No newline at end of file +# Note: JSON output configuration has been moved to the consolidated +# json_output_settings dictionary above for better organization and new features! diff --git a/docs/deployment/bring-your-own-endpoint/hosted-services.md b/docs/deployment/bring-your-own-endpoint/hosted-services.md index 873328bb..529e0b43 100644 --- a/docs/deployment/bring-your-own-endpoint/hosted-services.md +++ b/docs/deployment/bring-your-own-endpoint/hosted-services.md @@ -69,14 +69,14 @@ For multi-model comparison, run separate evaluations for each model and compare ```bash # Evaluate first model -nv-eval run \ +nemo-evaluator-launcher run \ --config-dir examples \ --config-name local_llama_3_1_8b_instruct \ -o target.api_endpoint.model_id=meta/llama-3.1-8b-instruct \ -o execution.output_dir=./results/llama-3.1-8b # Evaluate second model -nv-eval run \ +nemo-evaluator-launcher run \ --config-dir examples \ --config-name local_llama_3_1_8b_instruct \ -o target.api_endpoint.model_id=meta/llama-3.1-70b-instruct \ @@ -121,11 +121,11 @@ results = evaluate(target_cfg=target, eval_cfg=config) ### NVIDIA Build CLI Usage -Use `nv-eval` (recommended) or `nemo-evaluator-launcher`: +Use `nemo-evaluator-launcher` (recommended) or `nemo-evaluator-launcher`: ```bash # Basic evaluation -nv-eval run \ +nemo-evaluator-launcher run \ --config-dir examples \ --config-name local_llama_3_1_8b_instruct \ -o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \ @@ -133,7 +133,7 @@ nv-eval run \ -o target.api_endpoint.api_key=${NGC_API_KEY} # Large model evaluation with limited samples -nv-eval run \ +nemo-evaluator-launcher run \ --config-dir examples \ --config-name local_llama_3_1_8b_instruct \ -o target.api_endpoint.model_id=meta/llama-3.1-405b-instruct \ @@ -213,11 +213,11 @@ evaluation: ### OpenAI CLI Usage -Use `nv-eval` (recommended) or `nemo-evaluator-launcher`: +Use `nemo-evaluator-launcher` (recommended) or `nemo-evaluator-launcher`: ```bash # GPT-4 evaluation -nv-eval run \ +nemo-evaluator-launcher run \ --config-dir examples \ --config-name local_llama_3_1_8b_instruct \ -o target.api_endpoint.url=https://api.openai.com/v1/chat/completions \ @@ -225,7 +225,7 @@ nv-eval run \ -o target.api_endpoint.api_key=${OPENAI_API_KEY} # Cost-effective GPT-3.5 evaluation -nv-eval run \ +nemo-evaluator-launcher run \ --config-dir examples \ --config-name local_llama_3_1_8b_instruct \ -o target.api_endpoint.model_id=gpt-3.5-turbo \ diff --git a/docs/deployment/bring-your-own-endpoint/manual-deployment.md b/docs/deployment/bring-your-own-endpoint/manual-deployment.md index ec1ae81f..a808c2f3 100644 --- a/docs/deployment/bring-your-own-endpoint/manual-deployment.md +++ b/docs/deployment/bring-your-own-endpoint/manual-deployment.md @@ -29,7 +29,7 @@ Once your manual deployment is running, use the launcher to evaluate: ```bash # Basic evaluation against manual deployment -nv-eval run \ +nemo-evaluator-launcher run \ --config-dir examples \ --config-name local_llama_3_1_8b_instruct \ -o target.api_endpoint.url=http://localhost:8080/v1/completions \ diff --git a/docs/deployment/index.md b/docs/deployment/index.md index 00d5e79e..798666dc 100644 --- a/docs/deployment/index.md +++ b/docs/deployment/index.md @@ -26,7 +26,7 @@ Let NeMo Evaluator Launcher handle both model deployment and evaluation orchestr ```bash # Launcher deploys model AND runs evaluation -nv-eval run \ +nemo-evaluator-launcher run \ --config-dir examples \ --config-name slurm_llama_3_1_8b_instruct \ -o deployment.checkpoint_path=/shared/models/llama-3.1-8b @@ -51,7 +51,7 @@ You handle model deployment, NeMo Evaluator handles evaluation: **Launcher users with existing endpoints:** ```bash # Point launcher to your deployed model -nv-eval run \ +nemo-evaluator-launcher run \ --config-dir examples \ --config-name local_llama_3_1_8b_instruct \ -o target.api_endpoint.url=http://localhost:8080/v1/completions @@ -223,7 +223,7 @@ Choose from these approaches when managing your own deployment: ### With Launcher ```bash # Point to any existing endpoint -nv-eval run \ +nemo-evaluator-launcher run \ --config-dir examples \ --config-name local_llama_3_1_8b_instruct \ -o target.api_endpoint.url=http://your-endpoint:8080/v1/completions \ diff --git a/docs/deployment/launcher-orchestrated/index.md b/docs/deployment/launcher-orchestrated/index.md index 7963be5f..06943c1c 100644 --- a/docs/deployment/launcher-orchestrated/index.md +++ b/docs/deployment/launcher-orchestrated/index.md @@ -18,7 +18,7 @@ The launcher supports multiple deployment backends and execution environments. ```bash # Deploy model and run evaluation in one command (Slurm example) -nv-eval run \ +nemo-evaluator-launcher run \ --config-dir examples \ --config-name slurm_llama_3_1_8b_instruct \ -o deployment.checkpoint_path=/path/to/your/model diff --git a/docs/deployment/launcher-orchestrated/lepton.md b/docs/deployment/launcher-orchestrated/lepton.md index f3a56e13..f989b62a 100644 --- a/docs/deployment/launcher-orchestrated/lepton.md +++ b/docs/deployment/launcher-orchestrated/lepton.md @@ -17,7 +17,7 @@ Lepton launcher-orchestrated deployment: ```bash # Deploy and evaluate on Lepton AI -nv-eval run \ +nemo-evaluator-launcher run \ --config-dir examples \ --config-name lepton_vllm_llama_3_1_8b_instruct \ -o deployment.checkpoint_path=meta-llama/Llama-3.1-8B-Instruct \ @@ -185,10 +185,10 @@ Use NeMo Evaluator Launcher commands to monitor your evaluations: ```bash # Check status using invocation ID -nv-eval status +nemo-evaluator-launcher status # Kill running evaluations and cleanup endpoints -nv-eval kill +nemo-evaluator-launcher kill ``` ### Monitor Lepton Resources @@ -217,7 +217,7 @@ After evaluation completes, export results using the export command: ```bash # Export results to MLflow -nv-eval export --dest mlflow +nemo-evaluator-launcher export --dest mlflow ``` Refer to the {ref}`exporters-overview` for additional export options and configurations. diff --git a/docs/deployment/launcher-orchestrated/local.md b/docs/deployment/launcher-orchestrated/local.md index e1da9200..3064df6b 100644 --- a/docs/deployment/launcher-orchestrated/local.md +++ b/docs/deployment/launcher-orchestrated/local.md @@ -21,7 +21,7 @@ Local execution: ```bash # Run evaluation against existing endpoint -nv-eval run \ +nemo-evaluator-launcher run \ --config-dir examples \ --config-name local_llama_3_1_8b_instruct ``` @@ -118,24 +118,75 @@ evaluation: For detailed adapter configuration options, refer to {ref}`adapters`. + +### Advanced settings + +If you are deploying the model locally with Docker, you can use a dedicated docker network. +This will provide a secure connetion between deployment and evaluation docker containers. + +```shell +docker network create my-custom-network + +docker run --gpus all --network my-custom-network --name my-phi-container vllm/vllm-openai:latest \ + --model microsoft/Phi-4-mini-instruct +``` + +Then use the same network in the evaluator config: + +```yaml +defaults: + - execution: local + - deployment: none + - _self_ + +execution: + output_dir: my_phi_test + extra_docker_args: "--network my-custom-network" + +target: + api_endpoint: + model_id: microsoft/Phi-4-mini-instruct + url: http://my-phi-container:8000/v1/chat/completions + api_key_name: null + +evaluation: + tasks: + - name: simple_evals.mmlu_pro + overrides: + config.params.limit_samples: 10 # TEST ONLY: Limits to 10 samples for quick testing + config.params.parallelism: 1 +``` + +Alternatively you can expose ports and use the host network: + +```shell +docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \ + --model microsoft/Phi-4-mini-instruct +``` + +```yaml +execution: + extra_docker_args: "--network host" +``` + ## Command-Line Usage ### Basic Commands ```bash # Run evaluation -nv-eval run \ +nemo-evaluator-launcher run \ --config-dir examples \ --config-name local_llama_3_1_8b_instruct # Dry run to preview configuration -nv-eval run \ +nemo-evaluator-launcher run \ --config-dir examples \ --config-name local_llama_3_1_8b_instruct \ --dry-run # Override endpoint URL -nv-eval run \ +nemo-evaluator-launcher run \ --config-dir examples \ --config-name local_llama_3_1_8b_instruct \ -o target.api_endpoint.url=http://localhost:8080/v1/chat/completions @@ -145,19 +196,19 @@ nv-eval run \ ```bash # Check job status -nv-eval status +nemo-evaluator-launcher status # Check entire invocation -nv-eval status +nemo-evaluator-launcher status # Kill running job -nv-eval kill +nemo-evaluator-launcher kill # List available tasks -nv-eval ls tasks +nemo-evaluator-launcher ls tasks # List recent runs -nv-eval ls runs +nemo-evaluator-launcher ls runs ``` ## Requirements @@ -227,7 +278,7 @@ Check logs in the output directory: tail -f //logs/stdout.log # Kill and restart if needed -nv-eval kill +nemo-evaluator-launcher kill ``` **Tasks fail with errors:** @@ -240,7 +291,7 @@ nv-eval kill ```bash # Validate configuration before running -nv-eval run \ +nemo-evaluator-launcher run \ --config-dir examples \ --config-name local_llama_3_1_8b_instruct \ --dry-run diff --git a/docs/deployment/launcher-orchestrated/slurm.md b/docs/deployment/launcher-orchestrated/slurm.md index 10b0957d..040973b5 100644 --- a/docs/deployment/launcher-orchestrated/slurm.md +++ b/docs/deployment/launcher-orchestrated/slurm.md @@ -17,7 +17,7 @@ Slurm launcher-orchestrated deployment: ```bash # Deploy and evaluate on Slurm cluster -nv-eval run \ +nemo-evaluator-launcher run \ --config-dir examples \ --config-name slurm_llama_3_1_8b_instruct \ -o deployment.checkpoint_path=/shared/models/llama-3.1-8b-instruct \ @@ -145,12 +145,12 @@ evaluation: ```bash # Submit job with configuration -nv-eval run \ +nemo-evaluator-launcher run \ --config-dir examples \ --config-name slurm_llama_3_1_8b_instruct # Submit with configuration overrides -nv-eval run \ +nemo-evaluator-launcher run \ --config-dir examples \ --config-name slurm_llama_3_1_8b_instruct \ -o execution.walltime="04:00:00" \ @@ -161,17 +161,17 @@ nv-eval run \ ```bash # Check job status -nv-eval status +nemo-evaluator-launcher status # List all runs (optionally filter by executor) -nv-eval ls runs --executor slurm +nemo-evaluator-launcher ls runs --executor slurm ``` ### Managing Jobs ```bash # Cancel job -nv-eval kill +nemo-evaluator-launcher kill ``` ### Native Slurm Commands @@ -248,7 +248,7 @@ sinfo -p gpu ```bash # Check job status -nv-eval status +nemo-evaluator-launcher status # View Slurm job details scontrol show job diff --git a/docs/evaluation/_snippets/commands/list_tasks.sh b/docs/evaluation/_snippets/commands/list_tasks.sh index 1d506040..31b20964 100755 --- a/docs/evaluation/_snippets/commands/list_tasks.sh +++ b/docs/evaluation/_snippets/commands/list_tasks.sh @@ -3,12 +3,12 @@ # [snippet-start] # List all available benchmarks -nv-eval ls tasks +nemo-evaluator-launcher ls tasks # Output as JSON for programmatic filtering -nv-eval ls tasks --json +nemo-evaluator-launcher ls tasks --json # Filter for specific task types (example: academic benchmarks) -nv-eval ls tasks | grep -E "(mmlu|gsm8k|arc)" +nemo-evaluator-launcher ls tasks | grep -E "(mmlu|gsm8k|arc)" # [snippet-end] diff --git a/docs/evaluation/benchmarks.md b/docs/evaluation/benchmarks.md index 0312e2cc..d92f877a 100644 --- a/docs/evaluation/benchmarks.md +++ b/docs/evaluation/benchmarks.md @@ -30,7 +30,7 @@ Recommended suite for comprehensive model evaluation: - `truthfulqa` - Factual accuracy vs. plausibility ```bash -nv-eval run \ +nemo-evaluator-launcher run \ --config-dir examples \ --config-name local_academic_suite \ -o 'evaluation.tasks=["mmlu_pro", "arc_challenge", "hellaswag", "truthfulqa"]' @@ -90,7 +90,7 @@ nv-eval run \ **Example Usage:** ```bash # Run academic benchmark suite -nv-eval run \ +nemo-evaluator-launcher run \ --config-dir examples \ --config-name local_llama_3_1_8b_instruct \ -o 'evaluation.tasks=["mmlu_pro", "gsm8k", "arc_challenge"]' @@ -136,7 +136,7 @@ for task in academic_tasks: **Example Usage:** ```bash # Run code generation evaluation -nv-eval run \ +nemo-evaluator-launcher run \ --config-dir examples \ --config-name local_llama_3_1_8b_instruct \ -o 'evaluation.tasks=["humaneval", "mbpp"]' @@ -165,7 +165,7 @@ nv-eval run \ **Example Usage:** ```bash # Run comprehensive safety evaluation -nv-eval run \ +nemo-evaluator-launcher run \ --config-dir examples \ --config-name local_llama_3_1_8b_instruct \ -o 'evaluation.tasks=["aegis_v2", "garak"]' @@ -271,10 +271,10 @@ For a complete list of available tasks in each container: ```bash # List tasks in any container -docker run --rm nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }} eval-factory ls +docker run --rm nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }} nemo-evaluator ls # Or use the launcher for unified access -nv-eval ls tasks +nemo-evaluator-launcher ls tasks ``` ## Integration Patterns @@ -283,11 +283,11 @@ NeMo Evaluator provides multiple integration options to fit your workflow: ```bash # Launcher CLI (recommended for most users) -nv-eval ls tasks -nv-eval run --config-dir examples --config-name local_mmlu_evaluation +nemo-evaluator-launcher ls tasks +nemo-evaluator-launcher run --config-dir examples --config-name local_mmlu_evaluation # Container direct execution -docker run --rm nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }} eval-factory ls +docker run --rm nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }} nemo-evaluator ls # Python API (for programmatic control) # See the Python API documentation for details diff --git a/docs/evaluation/custom-tasks.md b/docs/evaluation/custom-tasks.md index 2cef7ba5..55604483 100644 --- a/docs/evaluation/custom-tasks.md +++ b/docs/evaluation/custom-tasks.md @@ -47,7 +47,7 @@ Custom tasks require explicit harness specification using the format: - `"bigcode-evaluation-harness.humaneval"` - BigCode harness task :::{note} -These examples demonstrate accessing tasks from upstream evaluation harnesses. Pre-configured tasks with optimized settings are available through the launcher CLI (`nv-eval ls tasks`). Custom task configuration is useful when you need non-standard parameters or when evaluating tasks not yet integrated into the pre-configured catalog. +These examples demonstrate accessing tasks from upstream evaluation harnesses. Pre-configured tasks with optimized settings are available through the launcher CLI (`nemo-evaluator-launcher ls tasks`). Custom task configuration is useful when you need non-standard parameters or when evaluating tasks not yet integrated into the pre-configured catalog. ::: ## lambada_openai (Log-Probability Task) diff --git a/docs/evaluation/index.md b/docs/evaluation/index.md index 0774ffd5..6654ee2f 100644 --- a/docs/evaluation/index.md +++ b/docs/evaluation/index.md @@ -27,7 +27,7 @@ Before you run evaluations, ensure you have: **For researchers and data scientists**: Evaluate your model on standard academic benchmarks in 3 steps. **Step 1: Choose Your Approach** -- **Launcher CLI** (Recommended): `nv-eval run --config-dir examples --config-name local_llama_3_1_8b_instruct` +- **Launcher CLI** (Recommended): `nemo-evaluator-launcher run --config-dir examples --config-name local_llama_3_1_8b_instruct` - **Python API**: Direct programmatic control with `evaluate()` function **Step 2: Select Benchmarks** @@ -39,14 +39,14 @@ Common academic suites: Discover all available tasks: ```bash -nv-eval ls tasks +nemo-evaluator-launcher ls tasks ``` **Step 3: Run Evaluation** Using Launcher CLI: ```bash -nv-eval run \ +nemo-evaluator-launcher run \ --config-dir examples \ --config-name local_llama_3_1_8b_instruct \ -o 'evaluation.tasks=["mmlu_pro", "gsm8k", "arc_challenge"]' \ @@ -119,7 +119,7 @@ Programmatic evaluation using Python API for integration into ML pipelines and c ::: :::{grid-item-card} {octicon}`package;1.5em;sd-mr-1` Container Workflows -:link: ../libraries/nemo-evaluator/workflows/using_containers +:link: ../libraries/nemo-evaluator/containers/index :link-type: doc Direct container access for specialized use cases and custom evaluation environments. ::: diff --git a/docs/evaluation/run-evals/code-generation.md b/docs/evaluation/run-evals/code-generation.md index 6963de98..d761aae0 100644 --- a/docs/evaluation/run-evals/code-generation.md +++ b/docs/evaluation/run-evals/code-generation.md @@ -46,10 +46,10 @@ Verify your setup before running code evaluation: ```bash # List available code generation tasks -nv-eval ls tasks | grep -E "(mbpp|humaneval)" +nemo-evaluator-launcher ls tasks | grep -E "(mbpp|humaneval)" # Run MBPP evaluation -nv-eval run \ +nemo-evaluator-launcher run \ --config-dir examples \ --config-name local_llama_3_1_8b_instruct \ -o 'evaluation.tasks=["mbpp"]' \ @@ -57,7 +57,7 @@ nv-eval run \ -o target.api_endpoint.api_key=${YOUR_API_KEY} # Run multiple code generation benchmarks -nv-eval run \ +nemo-evaluator-launcher run \ --config-dir examples \ --config-name local_llama_3_1_8b_instruct \ -o 'evaluation.tasks=["mbpp", "humaneval"]' @@ -116,7 +116,7 @@ docker run --rm -it --gpus all nvcr.io/nvidia/eval-factory/bigcode-evaluation-ha export MY_API_KEY=your_api_key_here # Run code generation evaluation -eval-factory run_eval \ +nemo-evaluator run_eval \ --eval_type mbpp \ --model_id meta/llama-3.1-8b-instruct \ --model_url https://integrate.api.nvidia.com/v1/chat/completions \ @@ -143,10 +143,10 @@ Use the launcher CLI to discover all available code generation tasks: ```bash # List all available benchmarks -nv-eval ls tasks +nemo-evaluator-launcher ls tasks # Filter for code generation tasks -nv-eval ls tasks | grep -E "(mbpp|humaneval)" +nemo-evaluator-launcher ls tasks | grep -E "(mbpp|humaneval)" ``` ## Available Tasks diff --git a/docs/evaluation/run-evals/function-calling.md b/docs/evaluation/run-evals/function-calling.md index da855fa2..c34d11c9 100644 --- a/docs/evaluation/run-evals/function-calling.md +++ b/docs/evaluation/run-evals/function-calling.md @@ -34,10 +34,10 @@ Ensure you have: ```bash # List available function calling tasks -nv-eval ls tasks | grep -E "(bfcl|function)" +nemo-evaluator-launcher ls tasks | grep -E "(bfcl|function)" # Run BFCL AST prompting evaluation -nv-eval run \ +nemo-evaluator-launcher run \ --config-dir examples \ --config-name local_llama_3_1_8b_instruct \ -o 'evaluation.tasks=["bfclv3_ast_prompting"]' \ @@ -100,7 +100,7 @@ docker run --rm -it --gpus all nvcr.io/nvidia/eval-factory/bfcl:{{ docker_compos export MY_API_KEY=your_api_key_here # Run function calling evaluation -eval-factory run_eval \ +nemo-evaluator run_eval \ --eval_type bfclv3_ast_prompting \ --model_id meta/llama-3.1-8b-instruct \ --model_url https://integrate.api.nvidia.com/v1/chat/completions \ @@ -126,10 +126,10 @@ Use the launcher CLI to discover all available function calling tasks: ```bash # List all available benchmarks -nv-eval ls tasks +nemo-evaluator-launcher ls tasks # Filter for function calling tasks -nv-eval ls tasks | grep -E "(bfcl|function)" +nemo-evaluator-launcher ls tasks | grep -E "(bfcl|function)" ``` ## Available Function Calling Tasks diff --git a/docs/evaluation/run-evals/log-probability.md b/docs/evaluation/run-evals/log-probability.md index b1ca6404..2e99d401 100644 --- a/docs/evaluation/run-evals/log-probability.md +++ b/docs/evaluation/run-evals/log-probability.md @@ -50,11 +50,11 @@ Verify your completions endpoint before running log-probability evaluation: ```bash # List available log-probability tasks -nv-eval ls tasks | grep -E "(arc|hellaswag|winogrande|truthfulqa)" +nemo-evaluator-launcher ls tasks | grep -E "(arc|hellaswag|winogrande|truthfulqa)" # Run ARC Challenge evaluation with existing endpoint # Note: Configure tokenizer parameters in your YAML config file -nv-eval run \ +nemo-evaluator-launcher run \ --config-dir examples \ --config-name local_llama_3_1_8b_instruct \ -o target.api_endpoint.url=http://0.0.0.0:8080/v1/completions \ @@ -122,7 +122,7 @@ export MY_API_KEY=your_api_key_here export HF_TOKEN=your_hf_token_here # Run log-probability evaluation using eval-factory (nemo-evaluator CLI) -eval-factory run_eval \ +nemo-evaluator run_eval \ --eval_type adlr_arc_challenge_llama \ --model_id megatron_model \ --model_url http://0.0.0.0:8080/v1/completions \ @@ -152,13 +152,13 @@ Use the launcher CLI to discover all available log-probability tasks: ```bash # List all available benchmarks -nv-eval ls tasks +nemo-evaluator-launcher ls tasks # Filter for log-probability tasks -nv-eval ls tasks | grep -E "(arc|hellaswag|winogrande|truthfulqa)" +nemo-evaluator-launcher ls tasks | grep -E "(arc|hellaswag|winogrande|truthfulqa)" # Get detailed information about a specific task (if supported) -nv-eval ls tasks --task adlr_arc_challenge_llama +nemo-evaluator-launcher ls tasks --task adlr_arc_challenge_llama ``` ## How Log-Probability Evaluation Works diff --git a/docs/evaluation/run-evals/safety-security.md b/docs/evaluation/run-evals/safety-security.md index 26825258..16f960f0 100644 --- a/docs/evaluation/run-evals/safety-security.md +++ b/docs/evaluation/run-evals/safety-security.md @@ -35,10 +35,10 @@ Ensure you have: ```bash # List available safety tasks -nv-eval ls tasks | grep -E "(safety|aegis|garak)" +nemo-evaluator-launcher ls tasks | grep -E "(safety|aegis|garak)" # Run Aegis safety evaluation -nv-eval run \ +nemo-evaluator-launcher run \ --config-dir examples \ --config-name local_llama_3_1_8b_instruct \ -o 'evaluation.tasks=["aegis_v2"]' \ @@ -46,7 +46,7 @@ nv-eval run \ -o target.api_endpoint.api_key=${YOUR_API_KEY} # Run safety and security evaluation -nv-eval run \ +nemo-evaluator-launcher run \ --config-dir examples \ --config-name local_llama_3_1_8b_instruct \ -o 'evaluation.tasks=["aegis_v2", "garak"]' @@ -111,7 +111,7 @@ export MY_API_KEY=your_api_key_here export HF_TOKEN=your_hf_token_here # Run safety evaluation -eval-factory run_eval \ +nemo-evaluator run_eval \ --eval_type aegis_v2 \ --model_id meta/llama-3.1-8b-instruct \ --model_url https://integrate.api.nvidia.com/v1/chat/completions \ @@ -156,13 +156,13 @@ Use the launcher CLI to discover all available safety and security tasks: ```bash # List all available benchmarks -nv-eval ls tasks +nemo-evaluator-launcher ls tasks # Filter for safety and security tasks -nv-eval ls tasks | grep -E "(safety|aegis|garak)" +nemo-evaluator-launcher ls tasks | grep -E "(safety|aegis|garak)" # Get detailed information about a specific task (if supported) -nv-eval ls tasks --task aegis_v2 +nemo-evaluator-launcher ls tasks --task aegis_v2 ``` ## Available Safety Tasks diff --git a/docs/evaluation/run-evals/text-gen.md b/docs/evaluation/run-evals/text-gen.md index f53dbc21..d8952c85 100644 --- a/docs/evaluation/run-evals/text-gen.md +++ b/docs/evaluation/run-evals/text-gen.md @@ -53,10 +53,10 @@ For log-probability methods, see the [Log-Probability Evaluation guide](../run-e ```bash # List available text generation tasks -nv-eval ls tasks +nemo-evaluator-launcher ls tasks # Run MMLU Pro evaluation -nv-eval run \ +nemo-evaluator-launcher run \ --config-dir examples \ --config-name local_llama_3_1_8b_instruct \ -o 'evaluation.tasks=["mmlu_pro"]' \ @@ -64,7 +64,7 @@ nv-eval run \ -o target.api_endpoint.api_key=${YOUR_API_KEY} # Run multiple text generation benchmarks -nv-eval run \ +nemo-evaluator-launcher run \ --config-dir examples \ --config-name local_text_generation_suite \ -o 'evaluation.tasks=["mmlu_pro", "arc_challenge", "hellaswag", "truthfulqa"]' @@ -123,7 +123,7 @@ docker run --rm -it --gpus all nvcr.io/nvidia/eval-factory/simple-evals:{{ docke export MY_API_KEY=your_api_key_here # Run evaluation -eval-factory run_eval \ +nemo-evaluator run_eval \ --eval_type mmlu_pro \ --model_id meta/llama-3.1-8b-instruct \ --model_url https://integrate.api.nvidia.com/v1/chat/completions \ @@ -183,7 +183,7 @@ Run these commands to discover the complete list of available benchmarks across ``` :::{note} -Task availability depends on installed frameworks. Use `nv-eval ls tasks` to see the complete list for your environment. +Task availability depends on installed frameworks. Use `nemo-evaluator-launcher ls tasks` to see the complete list for your environment. ::: ## Task Naming and Framework Specification @@ -240,10 +240,10 @@ Or use the CLI for programmatic access: ```bash # List all tasks with framework information -nv-eval ls tasks +nemo-evaluator-launcher ls tasks # Filter for specific tasks -nv-eval ls tasks | grep mmlu +nemo-evaluator-launcher ls tasks | grep mmlu ``` This helps you: @@ -431,13 +431,13 @@ api_endpoint=ApiEndpoint( **Symptoms**: Task name not recognized **Solutions**: -- Verify task name with `nv-eval ls tasks` +- Verify task name with `nemo-evaluator-launcher ls tasks` - Check if evaluation framework is installed - Use framework-qualified names for ambiguous tasks (e.g., `lm-evaluation-harness.mmlu`) ```bash # Discover available tasks -nv-eval ls tasks | grep mmlu +nemo-evaluator-launcher ls tasks | grep mmlu ``` :::: diff --git a/docs/get-started/_snippets/README.md b/docs/get-started/_snippets/README.md index 4a81b43b..df79ea26 100644 --- a/docs/get-started/_snippets/README.md +++ b/docs/get-started/_snippets/README.md @@ -87,7 +87,7 @@ export NGC_API_KEY="your-api-key-here" export MY_API_KEY="your-api-key" # For container versions -export DOCKER_TAG="25.08.1" +export DOCKER_TAG="25.09.1" ``` ## Testing Snippets diff --git a/scripts/snippets/arc_challenge.py b/docs/get-started/_snippets/arc_challenge.py similarity index 100% rename from scripts/snippets/arc_challenge.py rename to docs/get-started/_snippets/arc_challenge.py diff --git a/scripts/snippets/bfcl.py b/docs/get-started/_snippets/bfcl.py similarity index 100% rename from scripts/snippets/bfcl.py rename to docs/get-started/_snippets/bfcl.py diff --git a/scripts/snippets/bigcode.py b/docs/get-started/_snippets/bigcode.py similarity index 100% rename from scripts/snippets/bigcode.py rename to docs/get-started/_snippets/bigcode.py diff --git a/docs/get-started/_snippets/container_run.sh b/docs/get-started/_snippets/container_run.sh index ca558b30..c9f0dace 100755 --- a/docs/get-started/_snippets/container_run.sh +++ b/docs/get-started/_snippets/container_run.sh @@ -2,7 +2,7 @@ # Run evaluation using NGC containers directly # Set container version (or use environment variable) -DOCKER_TAG="${DOCKER_TAG:-25.08.1}" +DOCKER_TAG="${DOCKER_TAG:-25.09.1}" export MY_API_KEY="${MY_API_KEY:-your-api-key}" # [snippet-start] @@ -11,7 +11,7 @@ docker run --rm --gpus all \ -v $(pwd)/results:/workspace/results \ -e MY_API_KEY="${MY_API_KEY}" \ nvcr.io/nvidia/eval-factory/simple-evals:${DOCKER_TAG} \ - eval-factory run_eval \ + nemo-evaluator run_eval \ --eval_type mmlu_pro \ --model_url https://integrate.api.nvidia.com/v1/chat/completions \ --model_id meta/llama-3.1-8b-instruct \ diff --git a/scripts/snippets/deploy.sh b/docs/get-started/_snippets/deploy.sh similarity index 100% rename from scripts/snippets/deploy.sh rename to docs/get-started/_snippets/deploy.sh diff --git a/scripts/snippets/garak.py b/docs/get-started/_snippets/garak.py similarity index 100% rename from scripts/snippets/garak.py rename to docs/get-started/_snippets/garak.py diff --git a/docs/get-started/_snippets/install_containers.sh b/docs/get-started/_snippets/install_containers.sh index 66197576..2067530b 100755 --- a/docs/get-started/_snippets/install_containers.sh +++ b/docs/get-started/_snippets/install_containers.sh @@ -2,7 +2,7 @@ # Pull pre-built evaluation containers from NVIDIA NGC # Set container version (or use environment variable) -DOCKER_TAG="${DOCKER_TAG:-25.08.1}" +DOCKER_TAG="${DOCKER_TAG:-25.09.1}" # [snippet-start] # Pull evaluation containers (no local installation needed) diff --git a/docs/get-started/_snippets/install_launcher.sh b/docs/get-started/_snippets/install_launcher.sh index 8ac45c5e..9dcd75cc 100755 --- a/docs/get-started/_snippets/install_launcher.sh +++ b/docs/get-started/_snippets/install_launcher.sh @@ -11,9 +11,9 @@ pip install nemo-evaluator-launcher[all] # [snippet-end] # Verify installation -if command -v nv-eval &> /dev/null; then +if command -v nemo-evaluator-launcher &> /dev/null; then echo "✓ NeMo Evaluator Launcher installed successfully" - nv-eval --version + nemo-evaluator-launcher --version else echo "✗ Installation failed" exit 1 diff --git a/scripts/snippets/lambada.py b/docs/get-started/_snippets/lambada.py similarity index 100% rename from scripts/snippets/lambada.py rename to docs/get-started/_snippets/lambada.py diff --git a/docs/get-started/_snippets/launcher_basic.sh b/docs/get-started/_snippets/launcher_basic.sh index f5880529..497a05bc 100755 --- a/docs/get-started/_snippets/launcher_basic.sh +++ b/docs/get-started/_snippets/launcher_basic.sh @@ -6,7 +6,7 @@ export NGC_API_KEY="${NGC_API_KEY:-your-api-key-here}" # [snippet-start] # Run evaluation against a hosted endpoint -nv-eval run \ +nemo-evaluator-launcher run \ --config-dir examples \ --config-name local_llama_3_1_8b_instruct \ -o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \ @@ -14,5 +14,5 @@ nv-eval run \ -o execution.output_dir=./results # [snippet-end] -echo "Evaluation started. Use 'nv-eval status ' to check progress." +echo "Evaluation started. Use 'nemo-evaluator-launcher status ' to check progress." diff --git a/docs/get-started/_snippets/launcher_full_example.sh b/docs/get-started/_snippets/launcher_full_example.sh index b625120c..8d2194aa 100755 --- a/docs/get-started/_snippets/launcher_full_example.sh +++ b/docs/get-started/_snippets/launcher_full_example.sh @@ -6,7 +6,7 @@ export NGC_API_KEY="${NGC_API_KEY:-nvapi-your-key-here}" # [snippet-start] # Run a quick test evaluation with limited samples -nv-eval run \ +nemo-evaluator-launcher run \ --config-dir examples \ --config-name local_llama_3_1_8b_instruct \ -o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \ @@ -19,6 +19,6 @@ nv-eval run \ # Note: Replace with actual ID from output echo "" echo "Evaluation started! Next steps:" -echo "1. Monitor progress: nv-eval status " +echo "1. Monitor progress: nemo-evaluator-launcher status " echo "2. View results: ls -la ./results//" diff --git a/scripts/snippets/safety.py b/docs/get-started/_snippets/safety.py similarity index 100% rename from scripts/snippets/safety.py rename to docs/get-started/_snippets/safety.py diff --git a/scripts/snippets/simple_evals.py b/docs/get-started/_snippets/simple_evals.py similarity index 100% rename from scripts/snippets/simple_evals.py rename to docs/get-started/_snippets/simple_evals.py diff --git a/docs/get-started/_snippets/verify_launcher.sh b/docs/get-started/_snippets/verify_launcher.sh index 0731e228..37202c94 100755 --- a/docs/get-started/_snippets/verify_launcher.sh +++ b/docs/get-started/_snippets/verify_launcher.sh @@ -3,10 +3,10 @@ # [snippet-start] # Verify installation -nv-eval --version +nemo-evaluator-launcher --version # Test basic functionality - list available tasks -nv-eval ls tasks | head -10 +nemo-evaluator-launcher ls tasks | head -10 # [snippet-end] echo "✓ Launcher installed successfully" diff --git a/docs/get-started/install.md b/docs/get-started/install.md index 416f95e2..458e4768 100644 --- a/docs/get-started/install.md +++ b/docs/get-started/install.md @@ -119,7 +119,7 @@ docker run --rm --gpus all \ -v $(pwd)/results:/workspace/results \ -e MY_API_KEY=your-api-key \ nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }} \ - eval-factory run_eval \ + nemo-evaluator run_eval \ --eval_type mmlu_pro \ --model_url https://integrate.api.nvidia.com/v1/chat/completions \ --model_id meta/llama-3.1-8b-instruct \ @@ -132,7 +132,7 @@ Quick verification: ```bash # Test container access docker run --rm nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }} \ - eval-factory ls | head -5 + nemo-evaluator ls | head -5 echo " Container access verified" ``` diff --git a/docs/get-started/quickstart/container.md b/docs/get-started/quickstart/container.md index 51a0db35..1827c82c 100644 --- a/docs/get-started/quickstart/container.md +++ b/docs/get-started/quickstart/container.md @@ -24,7 +24,7 @@ export MY_API_KEY=nvapi-your-key-here export HF_TOKEN=hf_your-token-here # If using Hugging Face models # 4. Run evaluation -eval-factory run_eval \ +nemo-evaluator run_eval \ --eval_type mmlu_pro \ --model_id meta/llama-3.1-8b-instruct \ --model_url https://integrate.api.nvidia.com/v1/chat/completions \ @@ -52,7 +52,7 @@ docker run --rm -it --gpus all \ nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }} bash # 3. Inside container - run evaluation -eval-factory run_eval \ +nemo-evaluator run_eval \ --eval_type mmlu_pro \ --model_id meta/llama-3.1-8b-instruct \ --model_url https://integrate.api.nvidia.com/v1/chat/completions \ @@ -76,7 +76,7 @@ docker run --rm --gpus all \ -v $(pwd)/results:/workspace/results \ -e MY_API_KEY="${MY_API_KEY}" \ nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }} \ - eval-factory run_eval \ + nemo-evaluator run_eval \ --eval_type mmlu_pro \ --model_url https://integrate.api.nvidia.com/v1/chat/completions \ --model_id meta/llama-3.1-8b-instruct \ @@ -138,7 +138,7 @@ services: environment: - MY_API_KEY=${NGC_API_KEY} command: | - eval-factory run_eval + nemo-evaluator run_eval --eval_type mmlu_pro --model_id meta/llama-3.1-8b-instruct --model_url https://integrate.api.nvidia.com/v1/chat/completions @@ -164,7 +164,7 @@ for benchmark in "${BENCHMARKS[@]}"; do -e MY_API_KEY=$API_KEY \ -e HF_TOKEN=$HF_TOKEN \ nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }} \ - eval-factory run_eval \ + nemo-evaluator run_eval \ --eval_type $benchmark \ --model_id meta/llama-3.1-8b-instruct \ --model_url https://integrate.api.nvidia.com/v1/chat/completions \ diff --git a/docs/get-started/quickstart/index.md b/docs/get-started/quickstart/index.md index 12305a33..162d912c 100644 --- a/docs/get-started/quickstart/index.md +++ b/docs/get-started/quickstart/index.md @@ -33,6 +33,14 @@ Unified CLI experience with automated container management, built-in orchestrati Programmatic control with full adapter features, custom configurations, and direct API access for integration into existing workflows. ::: +:::{grid-item-card} {octicon}`code;1.5em;sd-mr-1` NeMo Framework Container +:link: gs-quickstart-nemo-fw +:link-type: ref +**For NeMo Framework Users** + +End-to-end training and evaluation of large language models (LLMs). +::: + :::{grid-item-card} {octicon}`container;1.5em;sd-mr-1` Container Direct :link: gs-quickstart-container :link-type: ref @@ -86,13 +94,13 @@ curl -X POST "https://integrate.api.nvidia.com/v1/chat/completions" \ }' # 2. Run a dry-run to validate configuration -nv-eval run \ +nemo-evaluator-launcher run \ --config-dir examples \ --config-name local_llama_3_1_8b_instruct \ --dry-run # 3. Run a minimal test with very few samples -nv-eval run \ +nemo-evaluator-launcher run \ --config-dir examples \ --config-name local_llama_3_1_8b_instruct \ -o +config.params.limit_samples=1 \ @@ -134,7 +142,7 @@ docker pull nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }} export NEMO_EVALUATOR_LOG_LEVEL=DEBUG # Check available evaluation types -nv-eval ls tasks +nemo-evaluator-launcher ls tasks ``` ::: @@ -149,7 +157,7 @@ find ./results -name "*.yml" -type f cat ./results///artifacts/results.yml # Or export and view processed results -nv-eval export --dest local --format json +nemo-evaluator-launcher export --dest local --format json cat ./results//processed_results.json ``` ::: @@ -167,10 +175,10 @@ After completing your quickstart: ```bash # List all available tasks -nv-eval ls tasks +nemo-evaluator-launcher ls tasks # Run with limited samples for quick testing -nv-eval run --config-dir examples --config-name local_limit_samples +nemo-evaluator-launcher run --config-dir examples --config-name local_limit_samples ``` ::: @@ -179,16 +187,16 @@ nv-eval run --config-dir examples --config-name local_limit_samples ```bash # Export to MLflow -nv-eval export --dest mlflow +nemo-evaluator-launcher export --dest mlflow # Export to Weights & Biases -nv-eval export --dest wandb +nemo-evaluator-launcher export --dest wandb # Export to Google Sheets -nv-eval export --dest gsheets +nemo-evaluator-launcher export --dest gsheets # Export to local files -nv-eval export --dest local --format json +nemo-evaluator-launcher export --dest local --format json ``` ::: @@ -197,10 +205,10 @@ nv-eval export --dest local --format json ```bash # Run on Slurm cluster -nv-eval run --config-dir examples --config-name slurm_llama_3_1_8b_instruct +nemo-evaluator-launcher run --config-dir examples --config-name slurm_llama_3_1_8b_instruct # Run on Lepton AI -nv-eval run --config-dir examples --config-name lepton_vllm_llama_3_1_8b_instruct +nemo-evaluator-launcher run --config-dir examples --config-name lepton_vllm_llama_3_1_8b_instruct ``` ::: @@ -210,10 +218,10 @@ nv-eval run --config-dir examples --config-name lepton_vllm_llama_3_1_8b_instruc | Task | Command | |------|---------| -| List benchmarks | `nv-eval ls tasks` | -| Run evaluation | `nv-eval run --config-dir examples --config-name ` | -| Check status | `nv-eval status ` | -| Export results | `nv-eval export --dest local --format json` | +| List benchmarks | `nemo-evaluator-launcher ls tasks` | +| Run evaluation | `nemo-evaluator-launcher run --config-dir examples --config-name ` | +| Check status | `nemo-evaluator-launcher status ` | +| Export results | `nemo-evaluator-launcher export --dest local --format json` | | Dry run | Add `--dry-run` to any run command | | Test with limited samples | Add `-o +config.params.limit_samples=3` | @@ -223,5 +231,6 @@ nv-eval run --config-dir examples --config-name lepton_vllm_llama_3_1_8b_instruc NeMo Evaluator Launcher NeMo Evaluator Core +NeMo Framework Container Container Direct ``` diff --git a/docs/get-started/quickstart/launcher.md b/docs/get-started/quickstart/launcher.md index ab525c41..7156162a 100644 --- a/docs/get-started/quickstart/launcher.md +++ b/docs/get-started/quickstart/launcher.md @@ -17,7 +17,7 @@ The NeMo Evaluator Launcher provides the simplest way to run evaluations with au pip install nemo-evaluator-launcher # 2. List available benchmarks -nv-eval ls tasks +nemo-evaluator-launcher ls tasks # 3. Run evaluation against a hosted endpoint ``` @@ -30,7 +30,7 @@ nv-eval ls tasks ```bash # 4. Check status and results -nv-eval status +nemo-evaluator-launcher status ``` ## Complete Working Example @@ -77,7 +77,7 @@ Here's a complete example using NVIDIA Build (build.nvidia.com): ## Next Steps -- Explore different evaluation types: `nv-eval ls tasks` +- Explore different evaluation types: `nemo-evaluator-launcher ls tasks` - Try advanced configurations in the `examples/` directory - Export results to your preferred tracking platform - Scale to cluster execution with Slurm or cloud providers diff --git a/docs/get-started/quickstart/nemo-fw.md b/docs/get-started/quickstart/nemo-fw.md new file mode 100644 index 00000000..b978e5c3 --- /dev/null +++ b/docs/get-started/quickstart/nemo-fw.md @@ -0,0 +1,118 @@ +(gs-quickstart-nemo-fw)= +# Evaluate checkpoints trained by NeMo Framework + +The NeMo Framework is NVIDIA’s GPU-accelerated, end-to-end training platform for large language models (LLMs), multimodal models, and speech models. It enables seamless scaling of both pretraining and post-training workloads, from a single GPU to clusters with thousands of nodes, supporting Hugging Face/PyTorch and Megatron models. NeMo includes a suite of libraries and curated training recipes to help users build models from start to finish. + +The NeMo Evaluator is integrated within NeMo Framework, offering streamlined deployment and advanced evaluation capabilities for models trained using NeMo, leveraging state-of-the-art evaluation harnesses. + + +## Prerequisites + +- Docker with GPU support +- NeMo Framework docker container + +## Quick Start + +### 1. Start NeMo Framework Container + +For optimal performance and user experience, use the latest version of the [NeMo Framework container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo/tags). Please fetch the most recent `$TAG` and run the following command to start a container: + +```bash +docker run --rm -it -w /workdir -v $(pwd):/workdir \ + --entrypoint bash \ + --gpus all \ + nvcr.io/nvidia/nemo:${TAG} +``` + +### 2. Deploy a Model + +```bash +# Deploy a NeMo checkpoint +python \ + /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \ + --nemo_checkpoint "/path/to/your/checkpoint" \ + --model_id megatron_model \ + --port 8080 \ + --host 0.0.0.0 +``` + +### 3. Evaluate the Model + +```python +from nemo_evaluator.api import evaluate +from nemo_evaluator.api.api_dataclasses import ApiEndpoint, EvaluationConfig, EvaluationTarget + +# Configure evaluation +api_endpoint = ApiEndpoint( + url="http://0.0.0.0:8080/v1/completions/", + type="completions", + model_id="megatron_model" +) +target = EvaluationTarget(api_endpoint=api_endpoint) +config = EvaluationConfig(type="gsm8k", output_dir="results") + +# Run evaluation +results = evaluate(target_cfg=target, eval_cfg=config) +print(results) +``` + + +## Key Features + +- **Multi-Backend Deployment**: Supports PyTriton and multi-instance evaluations using the Ray Serve deployment backend +- **Production-Ready**: Supports high-performance inference with CUDA graphs and flash decoding +- **Multi-GPU and Multi-Node Support**: Enables distributed inference across multiple GPUs and compute nodes +- **OpenAI-Compatible API**: Provides RESTful endpoints aligned with OpenAI API specifications +- **Comprehensive Evaluation**: Includes state-of-the-art evaluation harnesses for academic benchmarks, reasoning benchmarks, code generation, and safety testing +- **Adapter System**: Benefits from NeMo Evaluator's Adapter System for customizable request and response processing + +## Advanced Usage Patterns + +### Evaluate LLMs Using Log-Probabilities + +```{literalinclude} ../_snippets/arc_challenge.py +:language: python +:start-after: "## Run the evaluation" +``` + +### Multi-Instance Deployment with Ray + +Deploy multiple instances of your model: + +```shell +python \ + /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \ + --nemo_checkpoint "meta-llama/Llama-3.1-8B" \ + --model_id "megatron_model" \ + --port 8080 \ # Ray server port + --num_gpus 4 \ # Total GPUs available + --num_replicas 2 \ # Number of model replicas + --tensor_model_parallel_size 2 \ # Tensor parallelism per replica + --pipeline_model_parallel_size 1 \ # Pipeline parallelism per replica + --context_parallel_size 1 # Context parallelism per replica +``` + +Run evaluations with increased parallelism: + +```python +from nemo_evaluator.api import check_endpoint, evaluate +from nemo_evaluator.api.api_dataclasses import EvaluationConfig, ApiEndpoint, EvaluationTarget, ConfigParams + +# Configure the evaluation target +api_endpoint = ApiEndpoint( + url="http://0.0.0.0:8080/v1/completions/", + type="completions", + model_id="megatron_model", +) +eval_target = EvaluationTarget(api_endpoint=api_endpoint) +eval_params = ConfigParams(top_p=0, temperature=0, parallelism=2) +eval_config = EvaluationConfig(type='mmlu', params=eval_params, output_dir="results") + +if __name__ == "__main__": + check_endpoint( + endpoint_url=eval_target.api_endpoint.url, + endpoint_type=eval_target.api_endpoint.type, + model_name=eval_target.api_endpoint.model_id, + ) + evaluate(target_cfg=eval_target, eval_cfg=eval_config) +``` diff --git a/docs/libraries/index.md b/docs/libraries/index.md index f04d1a20..0a533b70 100644 --- a/docs/libraries/index.md +++ b/docs/libraries/index.md @@ -14,7 +14,7 @@ Select a library for your evaluation workflow: **Start here** - Unified CLI and Python API for running evaluations across local, cluster, and hosted environments. ::: -:::{grid-item-card} {octicon}`code;1.5em;sd-mr-1` NeMo Evaluator +:::{grid-item-card} {octicon}`beaker;1.5em;sd-mr-1` NeMo Evaluator :link: nemo-evaluator/index :link-type: doc diff --git a/docs/libraries/nemo-evaluator-launcher/cli.md b/docs/libraries/nemo-evaluator-launcher/cli.md index 843082a3..b1ada874 100644 --- a/docs/libraries/nemo-evaluator-launcher/cli.md +++ b/docs/libraries/nemo-evaluator-launcher/cli.md @@ -1,15 +1,12 @@ -# NeMo Evaluator Launcher CLI Reference (nv-eval) +# NeMo Evaluator Launcher CLI Reference (nemo-evaluator-launcher) -The NeMo Evaluator Launcher provides a command-line interface for running evaluations, managing jobs, and exporting results. The CLI is available through two commands: - -- `nv-eval` (short alias, recommended) -- `nemo-evaluator-launcher` (full command name) +The NeMo Evaluator Launcher provides a command-line interface for running evaluations, managing jobs, and exporting results. The CLI is available through `nemo-evaluator-launcher` command. ## Global Options ```bash -nv-eval --help # Show help -nv-eval --version # Show version information +nemo-evaluator-launcher --help # Show help +nemo-evaluator-launcher --version # Show version information ``` ## Commands Overview @@ -42,10 +39,10 @@ Execute evaluations using Hydra configuration management. ```bash # Using example configurations -nv-eval run --config-dir examples --config-name local_llama_3_1_8b_instruct +nemo-evaluator-launcher run --config-dir examples --config-name local_llama_3_1_8b_instruct # With output directory override -nv-eval run --config-dir examples --config-name local_llama_3_1_8b_instruct \ +nemo-evaluator-launcher run --config-dir examples --config-name local_llama_3_1_8b_instruct \ -o execution.output_dir=/path/to/results ``` @@ -53,10 +50,10 @@ nv-eval run --config-dir examples --config-name local_llama_3_1_8b_instruct \ ```bash # Using custom config directory -nv-eval run --config-dir my_configs --config-name my_evaluation +nemo-evaluator-launcher run --config-dir my_configs --config-name my_evaluation # Multiple overrides (Hydra syntax) -nv-eval run --config-dir examples --config-name local_llama_3_1_8b_instruct \ +nemo-evaluator-launcher run --config-dir examples --config-name local_llama_3_1_8b_instruct \ -o execution.output_dir=results \ -o target.api_endpoint.model_id=my-model \ -o +config.params.limit_samples=10 @@ -67,7 +64,7 @@ nv-eval run --config-dir examples --config-name local_llama_3_1_8b_instruct \ Preview the full resolved configuration without executing: ```bash -nv-eval run --config-dir examples --config-name local_llama_3_1_8b_instruct --dry-run +nemo-evaluator-launcher run --config-dir examples --config-name local_llama_3_1_8b_instruct --dry-run ``` ### Test Runs @@ -75,7 +72,7 @@ nv-eval run --config-dir examples --config-name local_llama_3_1_8b_instruct --dr Run with limited samples for testing: ```bash -nv-eval run --config-dir examples --config-name local_llama_3_1_8b_instruct \ +nemo-evaluator-launcher run --config-dir examples --config-name local_llama_3_1_8b_instruct \ -o +config.params.limit_samples=10 ``` @@ -84,14 +81,14 @@ nv-eval run --config-dir examples --config-name local_llama_3_1_8b_instruct \ **Local Execution:** ```bash -nv-eval run --config-dir examples --config-name local_llama_3_1_8b_instruct \ +nemo-evaluator-launcher run --config-dir examples --config-name local_llama_3_1_8b_instruct \ -o execution.output_dir=./local_results ``` **Slurm Execution:** ```bash -nv-eval run --config-dir examples --config-name slurm_llama_3_1_8b_instruct \ +nemo-evaluator-launcher run --config-dir examples --config-name slurm_llama_3_1_8b_instruct \ -o execution.output_dir=/shared/results ``` @@ -99,10 +96,10 @@ nv-eval run --config-dir examples --config-name slurm_llama_3_1_8b_instruct \ ```bash # With model deployment -nv-eval run --config-dir examples --config-name lepton_nim_llama_3_1_8b_instruct +nemo-evaluator-launcher run --config-dir examples --config-name lepton_nim_llama_3_1_8b_instruct # Using existing endpoint -nv-eval run --config-dir examples --config-name lepton_none_llama_3_1_8b_instruct +nemo-evaluator-launcher run --config-dir examples --config-name lepton_none_llama_3_1_8b_instruct ``` ## status - Check Job Status @@ -113,13 +110,13 @@ Check the status of running or completed evaluations. ```bash # Check status of specific invocation (returns all jobs in that invocation) -nv-eval status abc12345 +nemo-evaluator-launcher status abc12345 # Check status of specific job -nv-eval status abc12345.0 +nemo-evaluator-launcher status abc12345.0 # Output as JSON -nv-eval status abc12345 --json +nemo-evaluator-launcher status abc12345 --json ``` ### Output Formats @@ -165,10 +162,10 @@ Stop running evaluations. ```bash # Kill entire invocation -nv-eval kill abc12345 +nemo-evaluator-launcher kill abc12345 # Kill specific job -nv-eval kill abc12345.0 +nemo-evaluator-launcher kill abc12345.0 ``` The command outputs JSON with the results of the kill operation. @@ -181,10 +178,10 @@ List available tasks or runs. ```bash # List all available evaluation tasks -nv-eval ls tasks +nemo-evaluator-launcher ls tasks # List tasks with JSON output -nv-eval ls tasks --json +nemo-evaluator-launcher ls tasks --json ``` **Output Format:** @@ -210,17 +207,17 @@ winogrande completions ```bash # List recent evaluation runs -nv-eval ls runs +nemo-evaluator-launcher ls runs # Limit number of results -nv-eval ls runs --limit 10 +nemo-evaluator-launcher ls runs --limit 10 # Filter by executor -nv-eval ls runs --executor local +nemo-evaluator-launcher ls runs --executor local # Filter by date -nv-eval ls runs --since "2024-01-01" -nv-eval ls runs --since "2024-01-01T12:00:00" +nemo-evaluator-launcher ls runs --since "2024-01-01" +nemo-evaluator-launcher ls runs --since "2024-01-01T12:00:00" ``` **Output Format:** @@ -239,47 +236,46 @@ Export evaluation results to various destinations. ```bash # Export to local files (JSON format) -nv-eval export abc12345 --dest local --format json +nemo-evaluator-launcher export abc12345 --dest local --format json # Export to specific directory -nv-eval export abc12345 --dest local --format json --output-dir ./results +nemo-evaluator-launcher export abc12345 --dest local --format json --output-dir ./results # Specify custom filename -nv-eval export abc12345 --dest local --format json --output-filename results.json +nemo-evaluator-launcher export abc12345 --dest local --format json --output-filename results.json ``` ### Export Options ```bash # Available destinations -nv-eval export abc12345 --dest local # Local file system -nv-eval export abc12345 --dest mlflow # MLflow tracking -nv-eval export abc12345 --dest wandb # Weights & Biases -nv-eval export abc12345 --dest gsheets # Google Sheets -nv-eval export abc12345 --dest jet # JET (internal) +nemo-evaluator-launcher export abc12345 --dest local # Local file system +nemo-evaluator-launcher export abc12345 --dest mlflow # MLflow tracking +nemo-evaluator-launcher export abc12345 --dest wandb # Weights & Biases +nemo-evaluator-launcher export abc12345 --dest gsheets # Google Sheets # Format options (for local destination only) -nv-eval export abc12345 --dest local --format json -nv-eval export abc12345 --dest local --format csv +nemo-evaluator-launcher export abc12345 --dest local --format json +nemo-evaluator-launcher export abc12345 --dest local --format csv # Include logs when exporting -nv-eval export abc12345 --dest local --format json --copy-logs +nemo-evaluator-launcher export abc12345 --dest local --format json --copy-logs # Filter metrics by name -nv-eval export abc12345 --dest local --format json --log-metrics score --log-metrics accuracy +nemo-evaluator-launcher export abc12345 --dest local --format json --log-metrics score --log-metrics accuracy # Copy all artifacts (not just required ones) -nv-eval export abc12345 --dest local --only-required False +nemo-evaluator-launcher export abc12345 --dest local --only-required False ``` ### Exporting Multiple Invocations ```bash # Export several runs together -nv-eval export abc12345 def67890 ghi11111 --dest local --format json +nemo-evaluator-launcher export abc12345 def67890 ghi11111 --dest local --format json # Export several runs with custom output -nv-eval export abc12345 def67890 --dest local --format csv \ +nemo-evaluator-launcher export abc12345 def67890 --dest local --format csv \ --output-dir ./all-results --output-filename combined.csv ``` @@ -293,10 +289,10 @@ Display version and build information. ```bash # Show version -nv-eval version +nemo-evaluator-launcher version # Alternative -nv-eval --version +nemo-evaluator-launcher --version ``` ## Environment Variables @@ -310,12 +306,9 @@ The CLI respects environment variables for logging and task-specific authenticat * - Variable - Description - Default -* - `NEMO_EVALUATOR_LOG_LEVEL` +* - `LOG_LEVEL` - Logging level for the launcher (DEBUG, INFO, WARNING, ERROR, CRITICAL) - `WARNING` -* - `LOG_LEVEL` - - Alternative log level variable - - Uses `NEMO_EVALUATOR_LOG_LEVEL` if set * - `LOG_DISABLE_REDACTION` - Disable credential redaction in logs (set to 1, true, or yes) - Not set @@ -331,7 +324,7 @@ export HF_TOKEN="hf_..." # For Hugging Face datasets export API_KEY="nvapi-..." # For NVIDIA API endpoints # Run evaluation -nv-eval run --config-dir examples --config-name local_llama_3_1_8b_instruct +nemo-evaluator-launcher run --config-dir examples --config-name local_llama_3_1_8b_instruct ``` The specific environment variables required depend on the tasks and endpoints you're using. Refer to the example configuration files for details on which variables are needed. @@ -342,8 +335,13 @@ The NeMo Evaluator Launcher includes several example configuration files that de - `local_llama_3_1_8b_instruct.yaml` - Local execution with an existing endpoint - `local_limit_samples.yaml` - Local execution with limited samples for testing +- `local_nvidia_nemotron_nano_9b_v2.yaml` - Local execution with Nvidia Nemotron Nano 9B v2 +- `local_auto_export_llama_3_1_8b_instruct.yaml` - Local execution with auto-export for Llama 3.1 8B +- `local_custom_config_seed_oss_36b_instruct.yaml` - Local execution with advanced interceptors - `slurm_llama_3_1_8b_instruct.yaml` - Slurm execution with model deployment +- `slurm_llama_3_1_8b_instruct_hf.yaml` - Slurm execution with deployment using Hugging Face model handle - `slurm_no_deployment_llama_3_1_8b_instruct.yaml` - Slurm execution with existing endpoint +- `slurm_no_deployment_llama_nemotron_super_v1_nemotron_benchmarks.yaml` - Slurm execution with Llama-3.3-Nemotron-Super - `lepton_nim_llama_3_1_8b_instruct.yaml` - Lepton AI execution with NIM deployment - `lepton_vllm_llama_3_1_8b_instruct.yaml` - Lepton AI execution with vLLM deployment - `lepton_none_llama_3_1_8b_instruct.yaml` - Lepton AI execution with existing endpoint @@ -356,7 +354,7 @@ cp examples/local_llama_3_1_8b_instruct.yaml my_config.yaml # Edit the configuration as needed # Then run with your config -nv-eval run --config-dir . --config-name my_config +nemo-evaluator-launcher run --config-dir . --config-name my_config ``` Refer to the {ref}`configuration documentation ` for detailed information on all available configuration options. @@ -369,7 +367,7 @@ Refer to the {ref}`configuration documentation ` for det ```bash # Validate configuration without running -nv-eval run --config-dir examples --config-name my_config --dry-run +nemo-evaluator-launcher run --config-dir examples --config-name my_config --dry-run ``` **Permission Errors:** @@ -379,7 +377,7 @@ nv-eval run --config-dir examples --config-name my_config --dry-run ls -la examples/my_config.yaml # Use absolute paths -nv-eval run --config-dir /absolute/path/to/configs --config-name my_config +nemo-evaluator-launcher run --config-dir /absolute/path/to/configs --config-name my_config ``` **Network Issues:** @@ -395,12 +393,12 @@ curl -X POST http://localhost:8000/v1/chat/completions \ ```bash # Set log level to DEBUG for detailed output -export NEMO_EVALUATOR_LOG_LEVEL=DEBUG -nv-eval run --config-dir examples --config-name local_llama_3_1_8b_instruct +export LOG_LEVEL=DEBUG +nemo-evaluator-launcher run --config-dir examples --config-name local_llama_3_1_8b_instruct # Or use single-letter shorthand -export NEMO_EVALUATOR_LOG_LEVEL=D -nv-eval run --config-dir examples --config-name local_llama_3_1_8b_instruct +export LOG_LEVEL=D +nemo-evaluator-launcher run --config-dir examples --config-name local_llama_3_1_8b_instruct # Logs are written to ~/.nemo-evaluator/logs/ ``` @@ -409,12 +407,12 @@ nv-eval run --config-dir examples --config-name local_llama_3_1_8b_instruct ```bash # Command-specific help -nv-eval run --help -nv-eval export --help -nv-eval ls --help +nemo-evaluator-launcher run --help +nemo-evaluator-launcher export --help +nemo-evaluator-launcher ls --help # General help -nv-eval --help +nemo-evaluator-launcher --help ``` ## See Also diff --git a/docs/libraries/nemo-evaluator-launcher/configuration/deployment/generic.md b/docs/libraries/nemo-evaluator-launcher/configuration/deployment/generic.md index 485520f7..22eebb6f 100644 --- a/docs/libraries/nemo-evaluator-launcher/configuration/deployment/generic.md +++ b/docs/libraries/nemo-evaluator-launcher/configuration/deployment/generic.md @@ -1,4 +1,4 @@ -(deployment-gemeric)= +(deployment-generic)= # Generic Deployment diff --git a/docs/libraries/nemo-evaluator-launcher/configuration/deployment/index.md b/docs/libraries/nemo-evaluator-launcher/configuration/deployment/index.md index 19f4f1f7..29be3a9e 100644 --- a/docs/libraries/nemo-evaluator-launcher/configuration/deployment/index.md +++ b/docs/libraries/nemo-evaluator-launcher/configuration/deployment/index.md @@ -20,7 +20,7 @@ Choose the deployment type for your evaluation: Use existing API endpoints. No model deployment needed. ::: -:::{grid-item-card} {octicon}`rocket;1.5em;sd-mr-1` vLLM +:::{grid-item-card} {octicon}`broadcast;1.5em;sd-mr-1` vLLM :link: vllm :link-type: doc @@ -34,13 +34,30 @@ Deploy models using the vLLM serving framework. Deploy models using the SGLang serving framework. ::: -:::{grid-item-card} {octicon}`shield;1.5em;sd-mr-1` NIM +:::{grid-item-card} {octicon}`cpu;1.5em;sd-mr-1` NIM :link: nim :link-type: doc Deploy models using NVIDIA Inference Microservices. ::: + +:::{grid-item-card} {octicon}`server;1.5em;sd-mr-1` TRT-LLM +:link: trtllm +:link-type: doc + + +Deploy models using NVIDIA TensorRT LLM. +::: + +:::{grid-item-card} {octicon}`package;1.5em;sd-mr-1` Generic +:link: generic +:link-type: doc + + +Deploy models using a fully custom setup. +::: + :::: ## Quick Reference @@ -58,5 +75,7 @@ deployment: vLLM SGLang NIM +TensorRT-LLM +Generic None (External) ``` diff --git a/docs/libraries/nemo-evaluator-launcher/configuration/deployment/trtllm.md b/docs/libraries/nemo-evaluator-launcher/configuration/deployment/trtllm.md new file mode 100644 index 00000000..448303f0 --- /dev/null +++ b/docs/libraries/nemo-evaluator-launcher/configuration/deployment/trtllm.md @@ -0,0 +1,76 @@ +(deployment-trtllm)= + +# TensorRT LLM (TRT-LLM) Deployment + +Configure TRT-LLM as the deployment backend for serving models during evaluation. + +## Configuration Parameters + +### Basic Settings + +```yaml +deployment: + type: trtllm + image: nvcr.io/nvidia/tensorrt-llm/release:1.0.0 + checkpoint_path: /path/to/model + served_model_name: your-model-name + port: 8000 +``` + +### Parallelism Configuration + +```yaml +deployment: + tensor_parallel_size: 4 + pipeline_parallel_size: 1 +``` + +- **tensor_parallel_size**: Number of GPUs to split the model across (default: 4) +- **pipeline_parallel_size**: Number of pipeline stages (default: 1) + +### Extra Arguments and Endpoints + +```yaml +deployment: + extra_args: "--ep_size 2" + + endpoints: + chat: /v1/chat/completions + completions: /v1/completions + health: /health +``` + +The `extra_args` field passes extra arguments to the `trtllm-serve serve ` command. + +## Complete Example + +```yaml +defaults: + - execution: slurm/default + - deployment: trtllm + - _self_ + +deployment: + checkpoint_path: /path/to/checkpoint + served_model_name: llama-3.1-8b-instruct + tensor_parallel_size: 1 + extra_args: "" + +execution: + account: your-account + output_dir: /path/to/output + walltime: 02:00:00 + +evaluation: + tasks: + - name: ifeval + - name: gpqa_diamond +``` + + + +Use `nemo-evaluator-launcher run --dry-run` to check your configuration before running. diff --git a/docs/libraries/nemo-evaluator-launcher/configuration/executors/lepton.md b/docs/libraries/nemo-evaluator-launcher/configuration/executors/lepton.md index 194e4369..e790eeb9 100644 --- a/docs/libraries/nemo-evaluator-launcher/configuration/executors/lepton.md +++ b/docs/libraries/nemo-evaluator-launcher/configuration/executors/lepton.md @@ -79,9 +79,7 @@ graph TD Lepton executor configurations require: - **Execution backend**: `execution: lepton/default` -- **Deployment type**: One of `vllm`, `sglang`, `nim`, or `none` - **Lepton platform settings**: Node groups, resource shapes, secrets, and storage mounts -- **Evaluation tasks**: List of tasks to run Refer to the complete working examples in the `examples/` directory: diff --git a/docs/libraries/nemo-evaluator-launcher/configuration/executors/local.md b/docs/libraries/nemo-evaluator-launcher/configuration/executors/local.md index 8b46f167..5f8033c4 100644 --- a/docs/libraries/nemo-evaluator-launcher/configuration/executors/local.md +++ b/docs/libraries/nemo-evaluator-launcher/configuration/executors/local.md @@ -62,7 +62,30 @@ The Local executor uses Docker volume mounts for data persistence: ### Docker Volumes - **Results Mount**: Each task's artifacts directory mounts as `/results` in evaluation containers -- **No Custom Mounts**: Local executor doesn't support custom volume mounts +- **Custom Mounts**: Use to `extra_docker_args` field to define custom volume mounts (see [Advanced configuration](#advanced-configuration) ) + +## Advanced configuration + +You can customize your local executor by specifying `extra_docker_args`. +This parameter allows you to pass any flag to the `docker run` command that is executed by the NeMo Evaluator Launcher. +You can use it to mount additional volumes, set environment variables or customize your network settings. + +For example, if you would like your job to use a specific docker network, you can specify: + +```yaml +execution: + extra_docker_args: "--network my-custom-network" +``` + +Replace `my-custom-network` with `host` to access the host network. + +To mount additional custom volumes, do: + +```yaml +execution: + extra_docker_args: "--volume /my/local/path:/my/container/path" +``` + ## Rerunning Evaluations @@ -91,7 +114,6 @@ bash run.sh ## Key Features - **Docker-based execution**: Isolated, reproducible runs -- **OpenAI-compatible endpoint support**: Works with any OpenAI-compatible endpoint - **Script generation**: Reusable scripts for rerunning evaluations - **Real-time logs**: Status tracking via log files diff --git a/docs/libraries/nemo-evaluator-launcher/configuration/index.md b/docs/libraries/nemo-evaluator-launcher/configuration/index.md index 0f7afd19..4993e898 100644 --- a/docs/libraries/nemo-evaluator-launcher/configuration/index.md +++ b/docs/libraries/nemo-evaluator-launcher/configuration/index.md @@ -27,7 +27,7 @@ Every configuration has four main sections: ```yaml defaults: - execution: local # Where to run: local, lepton, slurm - - deployment: none # How to deploy: none, vllm, sglang, nim + - deployment: none # How to deploy: none, vllm, sglang, nim, trtllm, generic - _self_ execution: @@ -58,7 +58,7 @@ Choose how to serve your model for evaluation: Use existing API endpoints like NVIDIA API Catalog, OpenAI, or custom deployments. No model deployment needed. ::: -:::{grid-item-card} {octicon}`rocket;1.5em;sd-mr-1` vLLM +:::{grid-item-card} {octicon}`broadcast;1.5em;sd-mr-1` vLLM :link: deployment/vllm :link-type: doc @@ -72,13 +72,29 @@ High-performance LLM serving with advanced parallelism strategies. Best for prod Fast serving framework optimized for structured generation and high-throughput inference with efficient memory usage. ::: -:::{grid-item-card} {octicon}`shield;1.5em;sd-mr-1` NIM +:::{grid-item-card} {octicon}`cpu;1.5em;sd-mr-1` NIM :link: deployment/nim :link-type: doc NVIDIA-optimized inference microservices with automatic scaling, optimization, and enterprise-grade features. ::: +:::{grid-item-card} {octicon}`server;1.5em;sd-mr-1` TRT-LLM +:link: deployment/trtllm +:link-type: doc + + +NVIDIA TensorRT LLM. +::: + +:::{grid-item-card} {octicon}`package;1.5em;sd-mr-1` Generic +:link: deployment/generic +:link-type: doc + + +Deploy models using a fully custom setup. +::: + :::: ## Execution Platforms diff --git a/docs/libraries/nemo-evaluator-launcher/exporters/gsheets.md b/docs/libraries/nemo-evaluator-launcher/exporters/gsheets.md index 76f04f17..767f5f91 100644 --- a/docs/libraries/nemo-evaluator-launcher/exporters/gsheets.md +++ b/docs/libraries/nemo-evaluator-launcher/exporters/gsheets.md @@ -19,11 +19,12 @@ Export results from a specific evaluation run to Google Sheets: ```bash # Export results using default spreadsheet name -nv-eval export 8abcd123 --dest gsheets +nemo-evaluator-launcher export 8abcd123 --dest gsheets -# Export with custom spreadsheet name and service account -nv-eval export 8abcd123 --dest gsheets \ - --config '{"spreadsheet_name": "My Evaluation Results", "service_account_file": "/path/to/service-account.json"}' +# Export with custom spreadsheet name and ID +nemo-evaluator-launcher export 8abcd123 --dest gsheets \ + -o export.gsheets.spreadsheet_name="My Results" \ + -o export.gsheets.spreadsheet_id=1ABC...XYZ ``` ::: @@ -58,6 +59,28 @@ export_results( ::: +:::{tab-item} YAML Config + +Configure Google Sheets export in your evaluation YAML file for automatic export on completion: + +```yaml +execution: + auto_export: + destinations: ["gsheets"] + + # Export-related env vars (optional for GSheets) + env_vars: + export: + PATH: "/path/to/conda/env/bin:$PATH" + +export: + gsheets: + spreadsheet_name: "LLM Evaluation Results" + spreadsheet_id: "1ABC...XYZ" # optional: use existing sheet + service_account_file: "/path/to/service-account.json" + log_metrics: ["accuracy", "pass@1"] +``` + :::: ## Key Configuration @@ -76,8 +99,12 @@ export_results( - Uses default credentials if omitted * - `spreadsheet_name` - str, optional - - Target spreadsheet name + - Target spreadsheet name. Used to open existing sheets or name new ones. - Default: "NeMo Evaluator Launcher Results" +* - `spreadsheet_id` + - str, optional + - Target spreadsheet ID. Find it in the spreadsheet URL: `https://docs.google.com/spreadsheets/d//edit` + - Required if your service account can't create sheets due to quota limits. * - `log_metrics` - list[str], optional - Filter metrics to log diff --git a/docs/libraries/nemo-evaluator-launcher/exporters/index.md b/docs/libraries/nemo-evaluator-launcher/exporters/index.md index 26414ecc..bd1dcdfe 100644 --- a/docs/libraries/nemo-evaluator-launcher/exporters/index.md +++ b/docs/libraries/nemo-evaluator-launcher/exporters/index.md @@ -11,7 +11,7 @@ Exporters move evaluation results and artifacts from completed runs to external :::{tab-item} CLI ```bash -nv-eval export [ ...] \ +nemo-evaluator-launcher export [ ...] \ --dest \ [options] ``` diff --git a/docs/libraries/nemo-evaluator-launcher/exporters/local.md b/docs/libraries/nemo-evaluator-launcher/exporters/local.md index d55a1470..ba80a8c6 100644 --- a/docs/libraries/nemo-evaluator-launcher/exporters/local.md +++ b/docs/libraries/nemo-evaluator-launcher/exporters/local.md @@ -16,16 +16,16 @@ Export artifacts and generate summary reports locally: ```bash # Basic export to current directory -nv-eval export 8abcd123 --dest local +nemo-evaluator-launcher export 8abcd123 --dest local # Export with JSON summary to custom directory -nv-eval export 8abcd123 --dest local --format json --output-dir ./evaluation-results/ +nemo-evaluator-launcher export 8abcd123 --dest local --format json --output-dir ./evaluation-results/ # Export multiple runs with CSV summary and logs included -nv-eval export 8abcd123 9def4567 --dest local --format csv --copy-logs --output-dir ./results +nemo-evaluator-launcher export 8abcd123 9def4567 --dest local --format csv --copy-logs --output-dir ./results # Export only specific metrics to a custom filename -nv-eval export 8abcd123 --dest local --format json --log-metrics accuracy --log-metrics bleu --output-filename model_metrics.json +nemo-evaluator-launcher export 8abcd123 --dest local --format json --log-metrics accuracy --log-metrics bleu --output-filename model_metrics.json ``` ::: @@ -75,6 +75,21 @@ export_results( ::: +:::{tab-item} YAML Config + +Configure local export in your evaluation YAML file for automatic export on completion: + +```yaml +execution: + auto_export: + destinations: ["local"] + +export: + local: + format: "json" + output_dir: "./results" +``` + :::: ## Key Configuration diff --git a/docs/libraries/nemo-evaluator-launcher/exporters/mlflow.md b/docs/libraries/nemo-evaluator-launcher/exporters/mlflow.md index 57a54580..5f05abc3 100644 --- a/docs/libraries/nemo-evaluator-launcher/exporters/mlflow.md +++ b/docs/libraries/nemo-evaluator-launcher/exporters/mlflow.md @@ -21,19 +21,26 @@ Configure MLflow export to run automatically after evaluation completes. Add MLf execution: auto_export: destinations: ["mlflow"] - configs: - mlflow: - tracking_uri: "http://mlflow.example.com:5000" - experiment_name: "llm-evaluation" - description: "Llama 3.1 8B evaluation" - log_metrics: ["accuracy", "f1"] - tags: - model_family: "llama" - version: "3.1" - extra_metadata: - hardware: "A100" - batch_size: 32 - log_artifacts: true + + # Export-related env vars (placeholders expanded at runtime) + env_vars: + export: + MLFLOW_TRACKING_URI: MLFLOW_TRACKING_URI # or set tracking_uri under export.mflow + PATH: "/path/to/conda/env/bin:$PATH" + +export: + mlflow: + tracking_uri: "http://mlflow.example.com:5000" + experiment_name: "llm-evaluation" + description: "Llama 3.1 8B evaluation" + log_metrics: ["accuracy", "f1"] + tags: + model_family: "llama" + version: "3.1" + extra_metadata: + hardware: "A100" + batch_size: 32 + log_artifacts: true target: api_endpoint: @@ -117,6 +124,26 @@ export_results( ::: +:::{tab-item} Manual Export (CLI) + +Export results after evaluation completes: + +```shell +# Default export +nemo-evaluator-launcher export 8abcd123 --dest mlflow + +# With overrides +nemo-evaluator-launcher export 8abcd123 --dest mlflow \ + -o export.mlflow.tracking_uri=http://mlflow:5000 \ + -o export.mlflow.experiment_name=my-exp + +# With metric filtering +nemo-evaluator-launcher export 8abcd123 --dest mlflow --log-metrics accuracy pass@1 +``` + +::: + + :::: ## Configuration Parameters @@ -132,7 +159,7 @@ export_results( * - `tracking_uri` - str - MLflow tracking server URI - - Required + - Required if env var `MLFLOW_TRACKING_URI` is not set * - `experiment_name` - str - MLflow experiment name @@ -155,7 +182,7 @@ export_results( - None * - `skip_existing` - bool - - Skip export if run exists for invocation + - Skip export if run exists for invocation. Useful to avoid creating duplicate runs when re-exporting. - `false` * - `log_metrics` - list[str] @@ -165,4 +192,12 @@ export_results( - bool - Upload evaluation artifacts - `true` +* - `log_logs` + - bool + - Upload execution logs + - `false` +* - `only_required` + - bool + - Copy only required artifacts + - `true` ``` diff --git a/docs/libraries/nemo-evaluator-launcher/exporters/wandb.md b/docs/libraries/nemo-evaluator-launcher/exporters/wandb.md index 0ae7f761..41be62dd 100644 --- a/docs/libraries/nemo-evaluator-launcher/exporters/wandb.md +++ b/docs/libraries/nemo-evaluator-launcher/exporters/wandb.md @@ -19,10 +19,10 @@ Basic export to W&B using credentials and project settings from your evaluation ```bash # Export to W&B (uses config from evaluation run) -nv-eval export 8abcd123 --dest wandb +nemo-evaluator-launcher export 8abcd123 --dest wandb # Filter metrics to export specific measurements -nv-eval export 8abcd123 --dest wandb --log-metrics accuracy f1_score +nemo-evaluator-launcher export 8abcd123 --dest wandb --log-metrics accuracy f1_score ``` ```{note} @@ -91,20 +91,31 @@ Configure W&B export in your evaluation YAML file for automatic export on comple execution: auto_export: destinations: ["wandb"] - configs: - wandb: - entity: "myorg" - project: "llm-benchmarks" - name: "llama-3.1-8b-instruct-v1" - group: "baseline-evals" - tags: ["llama-3.1", "baseline"] - description: "Baseline evaluation" - log_mode: "multi_task" - log_metrics: ["accuracy"] - log_artifacts: true - extra_metadata: - hardware: "H100" - checkpoint: "path/to/checkpoint" + + # Export-related env vars (placeholders expanded at runtime) + env_vars: + export: + WANDB_API_KEY: WANDB_API_KEY + PATH: "/path/to/conda/env/bin:$PATH" + +export: + wandb: + entity: "myorg" + project: "llm-benchmarks" + name: "llama-3.1-8b-instruct-v1" + group: "baseline-evals" + tags: ["llama-3.1", "baseline"] + description: "Baseline evaluation" + log_mode: "multi_task" + log_metrics: ["accuracy"] + log_artifacts: true + log_logs: true + only_required: false + + extra_metadata: + hardware: "H100" + checkpoint: "path/to/checkpoint" + ``` ::: @@ -157,6 +168,14 @@ execution: - bool - Whether to upload evaluation artifacts (results files, configs) to W&B - `true` +* - `log_logs` + - bool + - Upload execution logs + - `false` +* - `only_required` + - bool + - Copy only required artifacts + - `true` * - `extra_metadata` - dict - Additional metadata stored in run config (e.g., hardware, hyperparameters) diff --git a/docs/libraries/nemo-evaluator-launcher/index.md b/docs/libraries/nemo-evaluator-launcher/index.md index 70472703..10d79bc8 100644 --- a/docs/libraries/nemo-evaluator-launcher/index.md +++ b/docs/libraries/nemo-evaluator-launcher/index.md @@ -13,7 +13,7 @@ The *Orchestration Layer* empowers you to run AI model evaluations at scale. Use ::::{grid} 1 2 2 2 :gutter: 1 1 1 2 -:::{grid-item-card} {octicon}`rocket;1.5em;sd-mr-1` Quickstart +:::{grid-item-card} {octicon}`play;1.5em;sd-mr-1` Quickstart :link: quickstart :link-type: doc @@ -34,7 +34,7 @@ Complete configuration schema, examples, and advanced patterns for all use cases ::::{grid} 1 2 2 2 :gutter: 1 1 1 2 -:::{grid-item-card} {octicon}`server;1.5em;sd-mr-1` Executors +:::{grid-item-card} {octicon}`iterations;1.5em;sd-mr-1` Executors :link: configuration/executors/index :link-type: doc @@ -48,7 +48,7 @@ Execute evaluations on your local machine, HPC cluster (Slurm), or cloud platfor Export results to MLflow, Weights & Biases, Google Sheets, or local files with one command. ::: -:::{grid-item-card} {octicon}`workflow;1.5em;sd-mr-1` Local Executor +:::{grid-item-card} {octicon}`device-desktop;1.5em;sd-mr-1` Local Executor :link: configuration/executors/local :link-type: doc @@ -140,5 +140,5 @@ Executors Configuration Exporters Python API -CLI Reference (nv-eval) +CLI Reference (nemo-evaluator-launcher) ::: diff --git a/docs/libraries/nemo-evaluator-launcher/quickstart.md b/docs/libraries/nemo-evaluator-launcher/quickstart.md index 0ecc5bb8..11b4024f 100644 --- a/docs/libraries/nemo-evaluator-launcher/quickstart.md +++ b/docs/libraries/nemo-evaluator-launcher/quickstart.md @@ -42,7 +42,7 @@ Hosted endpoints (fastest): ```bash # Using the short alias (recommended) - nv-eval run --config-dir examples \ + nemo-evaluator-launcher run --config-dir examples \ --config-name local_llama_3_1_8b_instruct \ -o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \ -o target.api_endpoint.api_key_name=NGC_API_KEY @@ -62,10 +62,10 @@ View all available evaluation benchmarks: ```bash # List all available tasks/benchmarks -nv-eval ls tasks +nemo-evaluator-launcher ls tasks # Alternative: list recent runs -nv-eval ls runs +nemo-evaluator-launcher ls runs ``` ### 2. Run Evaluations @@ -119,7 +119,7 @@ Run this configuration (requires Docker and a model endpoint): ```bash # Using short alias (recommended) -nv-eval run --config-dir examples --config-name local_llama_3_1_8b_instruct \ +nemo-evaluator-launcher run --config-dir examples --config-name local_llama_3_1_8b_instruct \ -o execution.output_dir= # Or using full command name @@ -157,7 +157,7 @@ For other backends: ```bash # Using short alias - nv-eval run --config-dir my_configs --config-name my_evaluation + nemo-evaluator-launcher run --config-dir my_configs --config-name my_evaluation # Or using full command nemo-evaluator-launcher run --config-dir my_configs --config-name my_evaluation @@ -169,7 +169,7 @@ You can override configuration values from the command line (`-o` can be used mu ```bash # Using short alias (recommended) -nv-eval run --config-dir examples --config-name local_llama_3_1_8b_instruct \ +nemo-evaluator-launcher run --config-dir examples --config-name local_llama_3_1_8b_instruct \ -o execution.output_dir=my_results \ -o target.api_endpoint.model_id=model/another/one @@ -185,7 +185,7 @@ Monitor the status of your evaluation jobs: ```bash # Check status using short alias -nv-eval status +nemo-evaluator-launcher status # Or using full command nemo-evaluator-launcher status @@ -193,9 +193,9 @@ nemo-evaluator-launcher status You can check: -- Individual job status: `nv-eval status ` -- All jobs in an invocation: `nv-eval status ` -- Kill running jobs: `nv-eval kill ` +- Individual job status: `nemo-evaluator-launcher status ` +- All jobs in an invocation: `nemo-evaluator-launcher status ` +- Kill running jobs: `nemo-evaluator-launcher kill ` The status command returns JSON output with job status information. @@ -209,16 +209,16 @@ Export evaluation results to various destinations: ```bash # Export to local files (JSON/CSV) -nv-eval export --dest local --format json +nemo-evaluator-launcher export --dest local --format json # Export to MLflow -nv-eval export --dest mlflow +nemo-evaluator-launcher export --dest mlflow # Export to Weights & Biases -nv-eval export --dest wandb +nemo-evaluator-launcher export --dest wandb # Export to Google Sheets -nv-eval export --dest gsheets +nemo-evaluator-launcher export --dest gsheets ``` ### 5. Troubleshooting @@ -227,13 +227,13 @@ View the full resolved configuration without running: ```bash # Dry run to see full config -nv-eval run --config-dir examples --config-name local_llama_3_1_8b_instruct --dry-run +nemo-evaluator-launcher run --config-dir examples --config-name local_llama_3_1_8b_instruct --dry-run ``` Test a small subset before running full benchmarks: ```bash # Add global override to limit all tasks to 10 samples for testing -nv-eval run --config-dir examples --config-name local_llama_3_1_8b_instruct \ +nemo-evaluator-launcher run --config-dir examples --config-name local_llama_3_1_8b_instruct \ -o +evaluation.overrides.config.params.limit_samples=10 ``` diff --git a/docs/libraries/nemo-evaluator/api.md b/docs/libraries/nemo-evaluator/api.md index d327b805..a07d73d0 100644 --- a/docs/libraries/nemo-evaluator/api.md +++ b/docs/libraries/nemo-evaluator/api.md @@ -585,7 +585,7 @@ target: To use the above, save it as `config.yaml` and run: ```bash -eval-factory run_eval \ +nemo-evaluator run_eval \ --eval_type mmlu_pro \ --model_id meta/llama-3.1-8b-instruct \ --model_url https://integrate.api.nvidia.com/v1/chat/completions \ diff --git a/docs/libraries/nemo-evaluator/cli.md b/docs/libraries/nemo-evaluator/cli.md index 1e167f62..b66aa244 100644 --- a/docs/libraries/nemo-evaluator/cli.md +++ b/docs/libraries/nemo-evaluator/cli.md @@ -1,13 +1,13 @@ (nemo-evaluator-cli)= -# NeMo Evaluator CLI Reference (eval-factory) +# NeMo Evaluator CLI Reference (nemo-evaluator) This document provides a comprehensive reference for the `nemo-evaluator` command-line interface, which is the primary way to interact with NeMo Evaluator from the terminal. ## Prerequisites - **Container way**: Use evaluation containers mentioned in {ref}`nemo-evaluator-containers` -- **Python way**: +- **Package way**: ```bash pip install nemo-evaluator @@ -20,7 +20,7 @@ This document provides a comprehensive reference for the `nemo-evaluator` comman ## Overview -The CLI provides a unified interface for managing evaluations and frameworks. It's built on top of the Python API and provides both interactive and non-interactive modes. +The CLI provides a unified interface for managing evaluations and frameworks. It's built on top of the Python API and provides full feature parity with it. ## Command Structure @@ -35,15 +35,14 @@ eval-factory [command] [options] List all available evaluation types and frameworks. ```bash -eval-factory ls +nemo-evaluator ls ``` **Output Example:** ``` -mmlu_pro: +nvidia-simple-evals: * mmlu_pro -gsm8k: - * gsm8k +... human_eval: * human_eval ``` @@ -53,12 +52,12 @@ human_eval: Execute an evaluation with the specified configuration. ```bash -eval-factory run_eval [options] +nemo-evaluator run_eval [options] ``` To see the list of options, run: ```bash -eval-factory run_eval --help +nemo-evaluator run_eval --help ``` **Required Options:** @@ -78,7 +77,7 @@ eval-factory run_eval --help **Example Usage:** ```bash # Basic evaluation -eval-factory run_eval \ +nemo-evaluator run_eval \ --eval_type mmlu_pro \ --model_id "meta/llama-3.1-8b-instruct" \ --model_url "https://integrate.api.nvidia.com/v1/chat/completions" \ @@ -87,7 +86,7 @@ eval-factory run_eval \ --output_dir ./results # With parameter overrides -eval-factory run_eval \ +nemo-evaluator run_eval \ --eval_type mmlu_pro \ --model_id "meta/llama-3.1-8b-instruct" \ --model_url "https://integrate.api.nvidia.com/v1/chat/completions" \ @@ -97,7 +96,7 @@ eval-factory run_eval \ --overrides "config.params.limit_samples=100,config.params.temperature=0.1" # Dry run to see configuration -eval-factory run_eval \ +nemo-evaluator run_eval \ --eval_type mmlu_pro \ --model_id "meta/llama-3.1-8b-instruct" \ --model_url "https://integrate.api.nvidia.com/v1/chat/completions" \ @@ -110,12 +109,12 @@ eval-factory run_eval \ For execution with run configuration: ```bash # Using YAML configuration file -eval-factory run_eval \ +nemo-evaluator run_eval \ --eval_type mmlu_pro \ --output_dir ./results \ --run_config ./config/eval_config.yml ``` -To check the structure of the run configuration, see the [Run Configuration](#run-configuration) section below. +To check the structure of the run configuration, refer to the [Run Configuration](#run-configuration) section below. (run-configuration)= @@ -151,7 +150,7 @@ target: Run configurations can be specified in YAML files and executed with following syntax: ```bash -eval-factory run_eval \ +nemo-evaluator run_eval \ --run_config config.yml \ --output_dir `mktemp -d` ``` @@ -194,7 +193,7 @@ Enable debug mode for detailed error information: export NEMO_EVALUATOR_LOG_LEVEL=DEBUG # Or use deprecated debug flag -eval-factory run_eval --debug [options] +nemo-evaluator run_eval --debug [options] ``` ## Examples @@ -203,10 +202,10 @@ eval-factory run_eval --debug [options] ```bash # 1. List available evaluations -eval-factory ls +nemo-evaluator ls # 2. Run evaluation -eval-factory run_eval \ +nemo-evaluator run_eval \ --eval_type mmlu_pro \ --model_id "meta/llama-3.1-8b-instruct" \ --model_url "https://integrate.api.nvidia.com/v1/chat/completions" \ @@ -232,7 +231,7 @@ for model in "${models[@]}"; do for eval_type in "${eval_types[@]}"; do echo "Running $eval_type on $model..." - eval-factory run_eval \ + nemo-evaluator run_eval \ --eval_type "$eval_type" \ --model_id "$model" \ --model_url "https://integrate.api.nvidia.com/v1/chat/completions" \ @@ -263,7 +262,7 @@ nemo-evaluator-example my_custom_eval . # Edit framework.yml to configure your evaluation # Edit output.py to implement result parsing # Test your framework -eval-factory run_eval \ +nemo-evaluator run_eval \ --eval_type my_custom_eval \ --model_id "test-model" \ --model_url "https://api.example.com/v1/chat/completions" \ diff --git a/docs/libraries/nemo-evaluator/containers/code-generation.md b/docs/libraries/nemo-evaluator/containers/code-generation.md index 6a9e206f..7cf65405 100644 --- a/docs/libraries/nemo-evaluator/containers/code-generation.md +++ b/docs/libraries/nemo-evaluator/containers/code-generation.md @@ -37,59 +37,6 @@ docker pull nvcr.io/nvidia/eval-factory/bigcode-evaluation-harness:{{ docker_com --- -## BFCL Container - -**NGC Catalog**: [bfcl](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/bfcl) - -Container for Berkeley Function-Calling Leaderboard evaluation framework. - -**Use Cases:** -- Tool usage evaluation -- Multi-turn interactions -- Native support for function/tool calling -- Function calling evaluation - -**Pull Command:** -```bash -docker pull nvcr.io/nvidia/eval-factory/bfcl:{{ docker_compose_latest }} -``` - -**Default Parameters:** - -| Parameter | Value | -|-----------|-------| -| `limit_samples` | `None` | -| `parallelism` | `10` | -| `native_calling` | `False` | -| `custom_dataset` | `{'path': None, 'format': None, 'data_template_path': None}` | - ---- - -## ToolTalk Container - -**NGC Catalog**: [tooltalk](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/tooltalk) - -Container for evaluating AI models' ability to use tools and APIs effectively. - -**Use Cases:** -- Tool usage evaluation -- API interaction assessment -- Function calling evaluation -- External tool integration testing - -**Pull Command:** -```bash -docker pull nvcr.io/nvidia/eval-factory/tooltalk:{{ docker_compose_latest }} -``` - -**Default Parameters:** - -| Parameter | Value | -|-----------|-------| -| `limit_samples` | `None` | - ---- - ## LiveCodeBench Container **NGC Catalog**: [livecodebench](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/livecodebench) @@ -134,7 +81,7 @@ docker pull nvcr.io/nvidia/eval-factory/livecodebench:{{ docker_compose_latest } **NGC Catalog**: [scicode](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/scicode) -SciCode is a challenging benchmark designed to evaluate the capabilities of language models in generating code for solving realistic scientific research problems with diverse coverage across 16 subdomains from 6 domains. +SciCode is a challenging benchmark designed to evaluate the capabilities of language models in generating code for solving realistic scientific research problems with diverse coverage across 16 subdomains from six domains. **Use Cases:** - Scientific research code generation @@ -163,4 +110,4 @@ docker pull nvcr.io/nvidia/eval-factory/scicode:{{ docker_compose_latest }} | `n_samples` | `1` | | `eval_threads` | `None` | -**Supported Domains:** Physics, Math, Material Science, Biology, Chemistry (16 subdomains from 5 domains) +**Supported Domains:** Physics, Math, Material Science, Biology, Chemistry (16 subdomains from five domains) diff --git a/docs/libraries/nemo-evaluator/containers/index.md b/docs/libraries/nemo-evaluator/containers/index.md index 75a07e95..6be2184b 100644 --- a/docs/libraries/nemo-evaluator/containers/index.md +++ b/docs/libraries/nemo-evaluator/containers/index.md @@ -17,87 +17,87 @@ NeMo Evaluator provides a collection of specialized containers for different eva - Key Benchmarks * - **agentic_eval** - Agentic AI evaluation framework - - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/agentic_eval) + - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/agentic_eval) - `{{ docker_compose_latest }}` - agentic_eval_answer_accuracy, agentic_eval_goal_accuracy_with_reference, agentic_eval_goal_accuracy_without_reference, agentic_eval_topic_adherence, agentic_eval_tool_call_accuracy * - **bfcl** - Function calling evaluation - - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/bfcl) + - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/bfcl) - `{{ docker_compose_latest }}` - bfclv2, bfclv2_ast, bfclv2_ast_prompting, bfclv3, bfclv3_ast, bfclv3_ast_prompting * - **bigcode-evaluation-harness** - Code generation evaluation - - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/bigcode-evaluation-harness) + - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/bigcode-evaluation-harness) - `{{ docker_compose_latest }}` - humaneval, humanevalplus, mbpp, mbppplus * - **garak** - Security and robustness testing - - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/garak) + - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/garak) - `{{ docker_compose_latest }}` - garak * - **helm** - Holistic evaluation framework - - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/helm) + - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/helm) - `{{ docker_compose_latest }}` - aci_bench, ehr_sql, head_qa, med_dialog_healthcaremagic * - **hle** - Academic knowledge and problem solving - - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/hle) + - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/hle) - `{{ docker_compose_latest }}` - hle * - **ifbench** - Instruction following evaluation - - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/ifbench) + - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/ifbench) - `{{ docker_compose_latest }}` - ifbench * - **livecodebench** - Live coding evaluation - - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/livecodebench) + - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/livecodebench) - `{{ docker_compose_latest }}` - livecodebench_0724_0125, livecodebench_0824_0225 * - **lm-evaluation-harness** - Language model benchmarks - - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/lm-evaluation-harness) + - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/lm-evaluation-harness) - `{{ docker_compose_latest }}` - mmlu, gsm8k, hellaswag, arc_challenge, truthfulqa * - **mmath** - Multilingual math reasoning - - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/mmath) + - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/mmath) - `{{ docker_compose_latest }}` - mmath_ar, mmath_en, mmath_es, mmath_fr, mmath_zh * - **mtbench** - Multi-turn conversation evaluation - - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/mtbench) + - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/mtbench) - `{{ docker_compose_latest }}` - mtbench, mtbench-cor1 * - **rag_retriever_eval** - RAG system evaluation - - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/rag_retriever_eval) + - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/rag_retriever_eval) - `{{ docker_compose_latest }}` - RAG, Retriever * - **safety-harness** - Safety and bias evaluation - - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/safety-harness) + - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/safety-harness) - `{{ docker_compose_latest }}` - aegis_v2 * - **scicode** - Coding for scientific research - - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/scicode) + - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/scicode) - `{{ docker_compose_latest }}` - scicode, scicode_background * - **simple-evals** - Basic evaluation tasks - - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/simple-evals) + - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/simple-evals) - `{{ docker_compose_latest }}` - mmlu, mmlu_pro, gpqa_diamond, humaneval, math_test_500 * - **tooltalk** - Tool usage evaluation - - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/tooltalk) + - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/tooltalk) - `{{ docker_compose_latest }}` - tooltalk * - **vlmevalkit** - Vision-language model evaluation - - [Link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/vlmevalkit) + - [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/vlmevalkit) - `{{ docker_compose_latest }}` - ai2d_judge, chartqa, ocrbench, slidevqa ``` @@ -116,7 +116,7 @@ NeMo Evaluator provides a collection of specialized containers for different eva Containers for evaluating large language models across academic benchmarks and custom tasks. ::: -:::{grid-item-card} {octicon}`code;1.5em;sd-mr-1` Code Generation +:::{grid-item-card} {octicon}`file-code;1.5em;sd-mr-1` Code Generation :link: code-generation :link-type: doc @@ -130,7 +130,7 @@ Specialized containers for evaluating code generation and programming capabiliti Multimodal evaluation containers for vision-language understanding and reasoning. ::: -:::{grid-item-card} {octicon}`shield;1.5em;sd-mr-1` Safety & Security +:::{grid-item-card} {octicon}`shield-check;1.5em;sd-mr-1` Safety & Security :link: safety-security :link-type: doc @@ -162,7 +162,7 @@ docker run --gpus all -it nvcr.io/nvidia/eval-factory/: - NVIDIA GPU (for GPU-accelerated evaluation) - Sufficient disk space for models and datasets -For detailed usage instructions, see {ref}`container-workflows` guide. +For detailed usage instructions, refer to the {ref}`cli-workflows` guide. :::{toctree} :caption: Container Reference diff --git a/docs/libraries/nemo-evaluator/containers/specialized-tools.md b/docs/libraries/nemo-evaluator/containers/specialized-tools.md index 9a22d196..9760647f 100644 --- a/docs/libraries/nemo-evaluator/containers/specialized-tools.md +++ b/docs/libraries/nemo-evaluator/containers/specialized-tools.md @@ -28,3 +28,56 @@ docker pull nvcr.io/nvidia/eval-factory/agentic_eval:{{ docker_compose_latest }} - `agentic_eval_goal_accuracy_without_reference` - `agentic_eval_topic_adherence` - `agentic_eval_tool_call_accuracy` + +--- + +## BFCL Container + +**NGC Catalog**: [bfcl](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/bfcl) + +Container for Berkeley Function-Calling Leaderboard evaluation framework. + +**Use Cases:** +- Tool usage evaluation +- Multi-turn interactions +- Native support for function/tool calling +- Function calling evaluation + +**Pull Command:** +```bash +docker pull nvcr.io/nvidia/eval-factory/bfcl:{{ docker_compose_latest }} +``` + +**Default Parameters:** + +| Parameter | Value | +|-----------|-------| +| `limit_samples` | `None` | +| `parallelism` | `10` | +| `native_calling` | `False` | +| `custom_dataset` | `{'path': None, 'format': None, 'data_template_path': None}` | + +--- + +## ToolTalk Container + +**NGC Catalog**: [tooltalk](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/eval-factory/containers/tooltalk) + +Container for evaluating AI models' ability to use tools and APIs effectively. + +**Use Cases:** +- Tool usage evaluation +- API interaction assessment +- Function calling evaluation +- External tool integration testing + +**Pull Command:** +```bash +docker pull nvcr.io/nvidia/eval-factory/tooltalk:{{ docker_compose_latest }} +``` + +**Default Parameters:** + +| Parameter | Value | +|-----------|-------| +| `limit_samples` | `None` | diff --git a/docs/libraries/nemo-evaluator/extending/framework-definition-file/advanced-features.md b/docs/libraries/nemo-evaluator/extending/framework-definition-file/advanced-features.md index fc154410..4a4e204f 100644 --- a/docs/libraries/nemo-evaluator/extending/framework-definition-file/advanced-features.md +++ b/docs/libraries/nemo-evaluator/extending/framework-definition-file/advanced-features.md @@ -78,11 +78,11 @@ config: **CLI overrides**: ```bash -eval-factory run_eval --overrides config.params.temperature=1.0 +nemo-evaluator run_eval --overrides config.params.temperature=1.0 # Overrides all previous values ``` -For more information on how to use these overrides, see {ref}`parameter-overrides` documentation. +For more information on how to use these overrides, refer to the {ref}`parameter-overrides` documentation. ## Dynamic Configuration diff --git a/docs/libraries/nemo-evaluator/extending/framework-definition-file/defaults-section.md b/docs/libraries/nemo-evaluator/extending/framework-definition-file/defaults-section.md index d83afd2f..3430d853 100644 --- a/docs/libraries/nemo-evaluator/extending/framework-definition-file/defaults-section.md +++ b/docs/libraries/nemo-evaluator/extending/framework-definition-file/defaults-section.md @@ -2,7 +2,7 @@ # Defaults Section -The `defaults` section defines the default configuration and execution command that will be used across all evaluations unless overridden. Overriding is supported either through `--overrides` flag (see {ref}`parameter-overrides`) or {ref}`run-configuration`. +The `defaults` section defines the default configuration and execution command that will be used across all evaluations unless overridden. Overriding is supported either through `--overrides` flag (refer to {ref}`parameter-overrides`) or {ref}`run-configuration`. ## Command Template @@ -66,10 +66,9 @@ defaults: max_retries: 5 # Maximum API retry attempts request_timeout: 60 # Request timeout in seconds extra: # Framework-specific parameters - n_samples: null # Number of evaluation samples + n_samples: null # Number of sampled responses per input downsampling_ratio: null # Data downsampling ratio add_system_prompt: false # Include system prompt - args: null # Additional CLI arguments ``` ## Parameter Categories @@ -96,11 +95,10 @@ Task-specific configuration options: ### Extra Parameters -Custom parameters specific to your framework: -- `n_samples`: Framework-specific sampling configuration -- `downsampling_ratio`: Data subset selection -- `add_system_prompt`: Framework-specific prompt handling -- `args`: Additional CLI arguments passed directly to your framework +Custom parameters specific to your framework. Use it for: +- specifying number of sampled responses per input query +- judge configuration +- configuring few-shot settings ## Target Configuration diff --git a/docs/libraries/nemo-evaluator/extending/framework-definition-file/evaluations-section.md b/docs/libraries/nemo-evaluator/extending/framework-definition-file/evaluations-section.md index 8d4912ea..051f1fd0 100644 --- a/docs/libraries/nemo-evaluator/extending/framework-definition-file/evaluations-section.md +++ b/docs/libraries/nemo-evaluator/extending/framework-definition-file/evaluations-section.md @@ -8,16 +8,16 @@ The `evaluations` section defines the specific evaluation types available in you ```yaml evaluations: - - name: example_task_1 # Evaluation identifier + - name: example_task_1 # Evaluation name description: Basic functionality demo # Human-readable description defaults: config: - type: "example_task_1" # Evaluation type identifier + type: "example_task_1" # Evaluation identifier supported_endpoint_types: # Supported endpoints for this task - chat - completions params: - task: "example_task_1" # Task-specific identifier + task: "example_task_1" # Task identifier used by the harness temperature: 0.0 # Task-specific temperature max_new_tokens: 1024 # Task-specific token limit extra: @@ -31,11 +31,11 @@ evaluations: **Type**: String **Required**: Yes -Unique identifier for the evaluation type. This is used to reference the evaluation in CLI commands and configurations. +Name for the evaluation type. **Example**: ```yaml -name: humaneval +name: HumanEval ``` ### description @@ -55,7 +55,7 @@ description: Evaluates code generation capabilities using the HumanEval benchmar **Type**: String **Required**: Yes -Internal type identifier used by the framework. This typically matches the `name` field but may differ based on your framework's conventions. +Unique configuration identifier used by the framework. This is used to reference the evaluation in CLI commands and configurations. This typically matches the `name` field but may differ based on your framework's conventions. **Example**: ```yaml diff --git a/docs/libraries/nemo-evaluator/extending/framework-definition-file/fdf-troubleshooting.md b/docs/libraries/nemo-evaluator/extending/framework-definition-file/fdf-troubleshooting.md index a334dee8..116a48aa 100644 --- a/docs/libraries/nemo-evaluator/extending/framework-definition-file/fdf-troubleshooting.md +++ b/docs/libraries/nemo-evaluator/extending/framework-definition-file/fdf-troubleshooting.md @@ -48,6 +48,7 @@ Verify conditional statements are properly formatted: - Incorrect parameter paths in overrides - Type mismatches between default and override values - Missing parameter definitions in defaults section +- Incorrect indentation in the YAML config **Solutions**: @@ -70,6 +71,23 @@ temperature: 0.7 # Float temperature: "0.7" # String ``` +Make sure to use the correct indentation: +```yaml +# Correct +defaults: + config: + params: + limit_samples: null + max_new_tokens: 4096 # max_new_tokens belongs to params + +# Incorrect +defaults: + config: + params: + limit_samples: null + max_new_tokens: 4096 # max_new_tokens is outside of params +``` + :::: ::::{dropdown} Type Mismatches @@ -142,11 +160,11 @@ Enable debug logging to see how your FDF is processed. Use the `--debug` flag or ```bash # Using debug flag -eval-factory run_eval --eval_type your_evaluation --debug +nemo-evaluator run_eval --eval_type your_evaluation --debug # Or set log level environment variable export LOG_LEVEL=DEBUG -eval-factory run_eval --eval_type your_evaluation +nemo-evaluator run_eval --eval_type your_evaluation ``` ### Debug Output diff --git a/docs/libraries/nemo-evaluator/extending/framework-definition-file/framework-section.md b/docs/libraries/nemo-evaluator/extending/framework-definition-file/framework-section.md index ad020e6f..94fa3e3a 100644 --- a/docs/libraries/nemo-evaluator/extending/framework-definition-file/framework-section.md +++ b/docs/libraries/nemo-evaluator/extending/framework-definition-file/framework-section.md @@ -8,7 +8,7 @@ The `framework` section contains basic identification and metadata for your eval ```yaml framework: - name: example-evaluation-framework # Internal framework identifier + name: example-evaluation-framework # Internal framework identifier pkg_name: example_evaluation_framework # Python package name full_name: Example Evaluation Framework # Human-readable display name description: A comprehensive example... # Detailed description diff --git a/docs/libraries/nemo-evaluator/extending/framework-definition-file/index.md b/docs/libraries/nemo-evaluator/extending/framework-definition-file/index.md index f95b68f4..8f293560 100644 --- a/docs/libraries/nemo-evaluator/extending/framework-definition-file/index.md +++ b/docs/libraries/nemo-evaluator/extending/framework-definition-file/index.md @@ -19,11 +19,11 @@ Before creating an FDF, you should: **Creating your first FDF?** Follow this sequence: -1. Start with the {ref}`create-framework-definition-file` tutorial for a hands-on walkthrough -2. {ref}`framework-section` - Define framework metadata -3. {ref}`defaults-section` - Configure command templates and parameters -4. {ref}`evaluations-section` - Define evaluation tasks -5. {ref}`integration` - Integrate with Eval Factory + +1. {ref}`framework-section` - Define framework metadata +2. {ref}`defaults-section` - Configure command templates and parameters +3. {ref}`evaluations-section` - Define evaluation tasks +4. {ref}`integration` - Integrate with Eval Factory **Need help?** Refer to {ref}`fdf-troubleshooting` for debugging common issues. @@ -84,7 +84,7 @@ evaluations: Define framework metadata including name, package information, and repository URL. ::: -:::{grid-item-card} {octicon}`gear;1.5em;sd-mr-1` Defaults Section +:::{grid-item-card} {octicon}`list-unordered;1.5em;sd-mr-1` Defaults Section :link: defaults-section :link-type: ref Configure default parameters, command templates, and target endpoint settings. @@ -96,7 +96,7 @@ Configure default parameters, command templates, and target endpoint settings. Define specific evaluation types with task-specific configurations and parameters. ::: -:::{grid-item-card} {octicon}`rocket;1.5em;sd-mr-1` Advanced Features +:::{grid-item-card} {octicon}`telescope;1.5em;sd-mr-1` Advanced Features :link: advanced-features :link-type: ref Use conditionals, parameter inheritance, and dynamic configuration in your FDF. @@ -108,7 +108,7 @@ Use conditionals, parameter inheritance, and dynamic configuration in your FDF. Learn how to integrate your FDF with the Eval Factory system. ::: -:::{grid-item-card} {octicon}`tools;1.5em;sd-mr-1` Troubleshooting +:::{grid-item-card} {octicon}`question;1.5em;sd-mr-1` Troubleshooting :link: fdf-troubleshooting :link-type: ref Debug common issues with template errors, parameters, and validation. diff --git a/docs/libraries/nemo-evaluator/extending/framework-definition-file/integration.md b/docs/libraries/nemo-evaluator/extending/framework-definition-file/integration.md index 0c12f1e7..6e53c788 100644 --- a/docs/libraries/nemo-evaluator/extending/framework-definition-file/integration.md +++ b/docs/libraries/nemo-evaluator/extending/framework-definition-file/integration.md @@ -64,19 +64,16 @@ Once your FDF is properly located and validated, the Eval Factory system automat After successful integration, you can use your framework with the Eval Factory CLI: ```bash -# List available frameworks -eval-factory list_frameworks - -# List evaluations for your framework -eval-factory list_evals --framework your_framework +# List available frameworks and tasks +nemo-evaluator ls # Run an evaluation -eval-factory run_eval --framework your_framework --eval_type your_evaluation +nemo-evaluator run_eval --eval_type your_evaluation --model_id my-model ... ``` ## Package Configuration -Ensure your `setup.py` includes the FDF in package data: +Ensure your `setup.py` or `pyproject.toml` includes the FDF in package data: ```python from setuptools import setup, find_packages @@ -85,12 +82,17 @@ setup( name="your-framework", packages=find_packages(), package_data={ - "": ["core_evals/**/framework.yml", "core_evals/**/*.py"], + "core_evals": ["**/*.yml"], }, include_package_data=True, ) ``` +```toml +[tool.setuptools.package-data] +core_evals = ["**/*.yml"] +``` + ## Best Practices - Follow the exact directory structure and naming conventions diff --git a/docs/libraries/nemo-evaluator/extending/index.md b/docs/libraries/nemo-evaluator/extending/index.md index bad4600e..d645d6ac 100644 --- a/docs/libraries/nemo-evaluator/extending/index.md +++ b/docs/libraries/nemo-evaluator/extending/index.md @@ -38,7 +38,7 @@ The primary extension mechanism uses YAML configuration files to define: ## Start with Extensions -**New to FDFs?** Start with the {ref}`create-framework-definition-file` tutorial for a hands-on walkthrough. + **Building a production framework?** Follow these steps: @@ -47,7 +47,7 @@ The primary extension mechanism uses YAML configuration files to define: 3. **Test Integration**: Validate that your framework works with NeMo Evaluator workflows 4. **Container Packaging**: Package your framework as a container for distribution -For detailed reference documentation, see {ref}`framework-definition-file`. +For detailed reference documentation, refer to {ref}`framework-definition-file`. :::{toctree} :caption: Extending NeMo Evaluator diff --git a/docs/libraries/nemo-evaluator/index.md b/docs/libraries/nemo-evaluator/index.md index 7db2c2f9..8f35cd87 100644 --- a/docs/libraries/nemo-evaluator/index.md +++ b/docs/libraries/nemo-evaluator/index.md @@ -55,7 +55,7 @@ Comprehensive logging setup for evaluation runs, debugging, and audit trails. Add custom benchmarks and frameworks by defining configuration and interfaces. ::: -:::{grid-item-card} {octicon}`code;1.5em;sd-mr-1` API Reference +:::{grid-item-card} {octicon}`book;1.5em;sd-mr-1` API Reference :link: api :link-type: doc diff --git a/docs/libraries/nemo-evaluator/interceptors/caching.md b/docs/libraries/nemo-evaluator/interceptors/caching.md index 1cd3e1ac..eab643b4 100644 --- a/docs/libraries/nemo-evaluator/interceptors/caching.md +++ b/docs/libraries/nemo-evaluator/interceptors/caching.md @@ -2,43 +2,16 @@ # Caching -The caching interceptor stores and retrieves responses to improve performance, reduce API costs, and enable reproducible evaluations. - ## Overview The `CachingInterceptor` implements a sophisticated caching system that can store responses based on request content, enabling faster re-runs of evaluations and reducing costs when using paid APIs. ## Configuration -### Interceptor Configuration - -Configure the caching interceptor through the interceptors list in AdapterConfig: - -```python -from nemo_evaluator.adapters.adapter_config import AdapterConfig, InterceptorConfig - -adapter_config = AdapterConfig( - interceptors=[ - InterceptorConfig( - name="caching", - enabled=True, - config={ - "cache_dir": "./evaluation_cache", - "reuse_cached_responses": True, - "save_requests": True, - "save_responses": True, - "max_saved_requests": 1000, - "max_saved_responses": 1000 - } - ) - ] -) -``` - ### CLI Configuration ```bash ---overrides 'target.api_endpoint.adapter_config.interceptors=[{"name":"caching","enabled":true,"config":{"cache_dir":"./cache","reuse_cached_responses":true}}]' +--overrides 'target.api_endpoint.adapter_config.use_caching=True,target.api_endpoint.adapter_config.caching_dir=./cache,target.api_endpoint.adapter_config.reuse_cached_responses=True' ``` ### YAML Configuration @@ -57,6 +30,9 @@ target: save_responses: true max_saved_requests: 1000 max_saved_responses: 1000 + - name: "endpoint" + enabled: true + config: {} ``` ## Configuration Options @@ -95,8 +71,8 @@ cache_key = hashlib.sha256(data_str.encode("utf-8")).hexdigest() The caching interceptor stores data in three separate disk-backed key-value stores within the configured cache directory: -- **Response Cache** (`{cache_dir}/responses/`): Stores raw response content (bytes) keyed by cache key -- **Headers Cache** (`{cache_dir}/headers/`): Stores response headers (dictionary) keyed by cache key +- **Response Cache** (`{cache_dir}/responses/`): Stores raw response content (bytes) keyed by cache key (when `save_responses=True` or `reuse_cached_responses=True`) +- **Headers Cache** (`{cache_dir}/headers/`): Stores response headers (dictionary) keyed by cache key (when `save_requests=True`) - **Request Cache** (`{cache_dir}/requests/`): Stores request data (dictionary) keyed by cache key (when `save_requests=True`) Each cache uses a SHA256 hash of the request data as the lookup key. When a cache hit occurs, the interceptor retrieves both the response content and headers using the same cache key. diff --git a/docs/libraries/nemo-evaluator/interceptors/endpoint.md b/docs/libraries/nemo-evaluator/interceptors/endpoint.md new file mode 100644 index 00000000..365f179d --- /dev/null +++ b/docs/libraries/nemo-evaluator/interceptors/endpoint.md @@ -0,0 +1,34 @@ +(interceptor-endpoint)= +# Endpoint Interceptor + +## Overview + +**Required interceptor** that handles the actual API communication. This interceptor must be present in every configuration as it performs the final request to the target API endpoint. + +**Important**: This interceptor should always be the last in the interceptor chain. + + +## Configuration + +### CLI Configuration + +```bash +# The endpoint interceptor is automatically enabled and requires no additional CLI configuration +``` + +### YAML Configuration + + +```yaml +target: + api_endpoint: + adapter_config: + interceptors: + - name: "endpoint" + enabled: true + config: {} +``` + +## Configuration Options + +The Endpoint Interceptor is configured automatically. diff --git a/docs/libraries/nemo-evaluator/interceptors/index.md b/docs/libraries/nemo-evaluator/interceptors/index.md index 8f3f5797..bddeeb8d 100644 --- a/docs/libraries/nemo-evaluator/interceptors/index.md +++ b/docs/libraries/nemo-evaluator/interceptors/index.md @@ -48,6 +48,29 @@ graph LR Cache requests and responses to improve performance and reduce API calls. ::: +:::{grid-item-card} {octicon}`sign-in;1.5em;sd-mr-1` Request Logging +:link: request-logging +:link-type: doc + +Logs requests for debugging, analysis, and audit purposes. +::: + + +:::{grid-item-card} {octicon}`sign-out;1.5em;sd-mr-1` Response Logging +:link: response-logging +:link-type: doc + +Logs responses for debugging, analysis, and audit purposes. +::: + +:::{grid-item-card} {octicon}`alert;1.5em;sd-mr-1` Raising on Client Errors +:link: raise-client-error +:link-type: doc + +Allows to fail fast on non-retryable client errors +::: + + :::: ## Specialized Interceptors @@ -62,14 +85,14 @@ Cache requests and responses to improve performance and reduce API calls. Modify system messages and prompts in requests. ::: -:::{grid-item-card} {octicon}`gear;1.5em;sd-mr-1` Payload Modification +:::{grid-item-card} {octicon}`pencil;1.5em;sd-mr-1` Payload Modification :link: payload-modification :link-type: doc Add, remove, or modify request parameters. ::: -:::{grid-item-card} {octicon}`brain;1.5em;sd-mr-1` Reasoning +:::{grid-item-card} {octicon}`comment-discussion;1.5em;sd-mr-1` Reasoning :link: reasoning :link-type: doc @@ -83,6 +106,13 @@ Handle reasoning tokens and track reasoning metrics. Track evaluation progress and status updates. ::: +:::{grid-item-card} {octicon}`meter;1.5em;sd-mr-1` Response Statistics +:link: response-stats +:link-type: doc + +Collects statistics from API responses for metrics collection and analysis. +::: + :::: ## Process Post-Evaluation Results @@ -104,9 +134,13 @@ Run additional processing, reporting, or cleanup after evaluations complete. :hidden: Caching +Request Logging +Response Logging +Raising on Client Errors System Messages Payload Modification Reasoning Progress Tracking +Response Statistics Post-Evaluation Hooks ::: diff --git a/docs/libraries/nemo-evaluator/interceptors/payload-modification.md b/docs/libraries/nemo-evaluator/interceptors/payload-modification.md index e8e46cac..3be1ef1c 100644 --- a/docs/libraries/nemo-evaluator/interceptors/payload-modification.md +++ b/docs/libraries/nemo-evaluator/interceptors/payload-modification.md @@ -2,14 +2,16 @@ # Payload Modification -Adds, removes, or modifies request parameters before sending them to the model endpoint. +## Overview + +`PayloadParamsModifierInterceptor` adds, removes, or modifies request parameters before sending them to the model endpoint. ## Configuration ### CLI Configuration ```bash ---overrides 'target.api_endpoint.adapter_config.interceptors=[{"name":"payload_modifier","enabled":true,"config":{"params_to_add":{"temperature":0.7},"params_to_remove":["top_k"]}}]' +--overrides 'target.api_endpoint.adapter_config.params_to_add={"temperature":0.7},target.api_endpoint.adapter_config.params_to_remove=["max_tokens"]' ``` ### YAML Configuration @@ -29,6 +31,9 @@ target: - "top_k" params_to_rename: old_param: "new_param" + - name: "endpoint" + enabled: true + config: {} ``` ## Configuration Options diff --git a/docs/libraries/nemo-evaluator/interceptors/post-evaluation-hooks.md b/docs/libraries/nemo-evaluator/interceptors/post-evaluation-hooks.md index c8e7733e..9823b7b8 100644 --- a/docs/libraries/nemo-evaluator/interceptors/post-evaluation-hooks.md +++ b/docs/libraries/nemo-evaluator/interceptors/post-evaluation-hooks.md @@ -2,7 +2,7 @@ Run processing or reporting tasks after evaluations complete. -Post-evaluation hooks execute after the main evaluation finishes. The built-in `post_eval_report` hook generates HTML and JSON reports from cached request-response pairs. +Post-evaluation hooks execute after the main evaluation finishes. The built-in `PostEvalReportHook` hook generates HTML and JSON reports from cached request-response pairs. ## Report Generation @@ -11,18 +11,21 @@ Generate HTML and JSON reports with evaluation request-response examples. ### YAML Configuration ```yaml -post_eval_hooks: - - name: "post_eval_report" - enabled: true - config: - report_types: ["html", "json"] - html_report_size: 10 +target: + api_endpoint: + adapter_config: + post_eval_hooks: + - name: "post_eval_report" + enabled: true + config: + report_types: ["html", "json"] + html_report_size: 10 ``` ### CLI Configuration ```bash ---overrides 'target.api_endpoint.adapter_config.post_eval_hooks=[{"name":"post_eval_report","enabled":true,"config":{"report_types":["html","json"]}}]' +--overrides 'target.api_endpoint.adapter_config.generate_html_report=True' ``` ## Configuration Options diff --git a/docs/libraries/nemo-evaluator/interceptors/progress-tracking.md b/docs/libraries/nemo-evaluator/interceptors/progress-tracking.md index c7a69464..558ccbbe 100644 --- a/docs/libraries/nemo-evaluator/interceptors/progress-tracking.md +++ b/docs/libraries/nemo-evaluator/interceptors/progress-tracking.md @@ -1,41 +1,35 @@ # Progress Tracking -Tracks evaluation progress by counting processed samples and optionally sending updates to a webhook endpoint. +## Overview +`ProgressTrackingInterceptor` tracks evaluation progress by counting processed samples and optionally sending updates to a webhook endpoint. ## Configuration +### CLI Configuration + +```bash +--overrides 'target.api_endpoint.adapter_config.use_progress_tracking=True,target.api_endpoint.adapter_config.progress_tracking_url=http://monitoring:3828/progress' +``` + ### YAML Configuration ```yaml -interceptors: - - name: "progress_tracking" - enabled: true - config: - progress_tracking_url: "http://monitoring:3828/progress" - progress_tracking_interval: 10 - request_method: "PATCH" - output_dir: "/tmp/output" +target: + api_endpoint: + adapter_config: + interceptors: + - name: "progress_tracking" + enabled: true + config: + progress_tracking_url: "http://monitoring:3828/progress" + progress_tracking_interval: 10 + request_method: "PATCH" + output_dir: "/tmp/output" + - name: "endpoint" + enabled: true + config: {} ``` -### Python Configuration - -```python -from nemo_evaluator.adapters.adapter_config import AdapterConfig, InterceptorConfig - -adapter_config = AdapterConfig( - interceptors=[ - InterceptorConfig( - name="progress_tracking", - config={ - "progress_tracking_url": "http://monitoring:3828/progress", - "progress_tracking_interval": 10, - "request_method": "PATCH", - "output_dir": "/tmp/output" - } - ) - ] -) -``` ## Configuration Options diff --git a/docs/libraries/nemo-evaluator/interceptors/raise-client-error.md b/docs/libraries/nemo-evaluator/interceptors/raise-client-error.md new file mode 100644 index 00000000..892f59b6 --- /dev/null +++ b/docs/libraries/nemo-evaluator/interceptors/raise-client-error.md @@ -0,0 +1,155 @@ +(interceptor-raise-client-error)= +# Raise Client Error Interceptor + +## Overview + +The Raise `RaiseClientErrorInterceptor` handles non-retryable client errors by raising exceptions instead of continuing the benchmark evaluation. By default, it will raise exceptions on 4xx HTTP status codes (excluding 408 Request Timeout and 429 Too Many Requests, which are typically retryable). + +This interceptor is useful when you want to fail fast on client errors that indicate configuration issues, authentication problems, or other non-recoverable errors rather than continuing the evaluation with failed requests. + +## Configuration + +### CLI Configuration + +```bash +--overrides 'target.api_endpoint.adapter_config.use_raise_client_errors=True' +``` + +### YAML Configuration + +::::{tab-set} + +:::{tab-item} Default Configuration +Raises on 4xx status codes except 408 (Request Timeout) and 429 (Too Many Requests). + +```yaml +target: + api_endpoint: + adapter_config: + interceptors: + - name: "raise_client_errors" + enabled: true + config: + # Default configuration - raises on 4xx except 408, 429 + exclude_status_codes: [408, 429] + status_code_range_start: 400 + status_code_range_end: 499 + - name: "endpoint" + enabled: true + config: {} +``` +::: + +:::{tab-item} Specific Status Codes +Raises only on specific status codes rather than a range. + +```yaml +target: + api_endpoint: + adapter_config: + interceptors: + - name: "raise_client_errors" + enabled: true + config: + # Custom configuration - only specific status codes + status_codes: [400, 401, 403, 404] + - name: "endpoint" + enabled: true + config: {} +``` +::: + +:::{tab-item} Custom Exclusions +Uses a status code range with custom exclusions, including 404 Not Found. + +```yaml +target: + api_endpoint: + adapter_config: + interceptors: + - name: "raise_client_errors" + enabled: true + config: + # Custom range with different exclusions + status_code_range_start: 400 + status_code_range_end: 499 + exclude_status_codes: [408, 429, 404] # Also exclude 404 not found + - name: "endpoint" + enabled: true + config: {} +``` +::: + +:::: + +## Configuration Options + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `exclude_status_codes` | `List[int]` | `[408, 429]` | Status codes to exclude from raising client errors when present in the status code range | +| `status_codes` | `List[int]` | `None` | Specific list of status codes that should raise exceptions. If provided, this takes precedence over range settings | +| `status_code_range_start` | `int` | `400` | Start of the status code range (inclusive) for which to raise exceptions | +| `status_code_range_end` | `int` | `499` | End of the status code range (inclusive) for which to raise exceptions | + +## Behavior + +### Default Behavior +- Raises exceptions on HTTP status codes 400-499 +- Excludes 408 (Request Timeout) and 429 (Too Many Requests) as these are typically retryable +- Logs critical errors before raising the exception + +### Configuration Logic +1. If `status_codes` is specified, only those exact status codes will trigger exceptions +2. If `status_codes` is not specified, the range defined by `status_code_range_start` and `status_code_range_end` is used +3. `exclude_status_codes` are always excluded from raising exceptions +4. Cannot have the same status code in both `status_codes` and `exclude_status_codes` + +### Error Handling +- Raises `FatalErrorException` when a matching status code is encountered +- Logs critical error messages with status code and URL information +- Stops the evaluation process immediately + +## Examples + +::::{tab-set} + +:::{tab-item} Auth Failures Only +Raises exceptions only on authentication and authorization failures. + +```yaml +config: + status_codes: [401, 403] +``` +::: + +:::{tab-item} All Client Errors Except Rate Limiting +Raises on all 4xx errors except timeout and rate limit errors. + +```yaml +config: + status_code_range_start: 400 + status_code_range_end: 499 + exclude_status_codes: [408, 429] +``` +::: + +:::{tab-item} Strict Mode - All Client Errors +Raises exceptions on any 4xx status code without exclusions. + +```yaml +config: + status_code_range_start: 400 + status_code_range_end: 499 + exclude_status_codes: [] +``` +::: + +:::: + +## Common Use Cases + +- **API Configuration Validation**: Fail immediately on authentication errors (401, 403) +- **Input Validation**: Stop evaluation on bad request errors (400) +- **Resource Existence**: Fail on not found errors (404) for critical resources +- **Development/Testing**: Use strict mode to catch all client-side issues +- **Production**: Use default settings to allow retryable errors while catching configuration issues diff --git a/docs/libraries/nemo-evaluator/interceptors/reasoning.md b/docs/libraries/nemo-evaluator/interceptors/reasoning.md index 1e913175..7706c7ea 100644 --- a/docs/libraries/nemo-evaluator/interceptors/reasoning.md +++ b/docs/libraries/nemo-evaluator/interceptors/reasoning.md @@ -2,8 +2,6 @@ # Reasoning -The reasoning interceptor processes chain-of-thought reasoning from model responses by removing reasoning tokens from content and tracking reasoning statistics. - ## Overview The `ResponseReasoningInterceptor` handles models that generate explicit reasoning steps, typically enclosed in special tokens. It removes reasoning content from the final response and tracks reasoning metrics for analysis. @@ -12,28 +10,10 @@ The `ResponseReasoningInterceptor` handles models that generate explicit reasoni ### Python Configuration -```python -from nemo_evaluator.adapters.adapter_config import AdapterConfig, InterceptorConfig - -adapter_config = AdapterConfig( - interceptors=[ - InterceptorConfig( - name="reasoning", - config={ - "start_reasoning_token": "", - "end_reasoning_token": "", - "add_reasoning": True, - "enable_reasoning_tracking": True - } - ) - ] -) -``` - ### CLI Configuration ```bash ---overrides 'target.api_endpoint.adapter_config.interceptors=[{"name":"reasoning","config":{"start_reasoning_token":"","end_reasoning_token":""}}]' +--overrides 'target.api_endpoint.adapter_config.use_reasoning=True,target.api_endpoint.adapter_config.end_reasoning_token="",target.api_endpoint.adapter_config.start_reasoning_token=""' ``` ### YAML Configuration @@ -49,6 +29,9 @@ target: end_reasoning_token: "" add_reasoning: true enable_reasoning_tracking: true + - name: "endpoint" + enabled: true + config: {} ``` ## Configuration Options @@ -111,48 +94,31 @@ The interceptor automatically tracks the following statistics: | `avg_updated_content_tokens` | Average token count in updated content | | `max_reasoning_words` | Maximum word count in reasoning content | | `max_reasoning_tokens` | Maximum token count in reasoning content | +| `max_original_content_words` | | +| `max_updated_content_words` | | | `max_updated_content_tokens` | Maximum token count in updated content | | `total_reasoning_words` | Total word count across all reasoning content | | `total_reasoning_tokens` | Total token count across all reasoning content | +| `total_original_content_words` | Total word count in original content (before processing) | +| `total_updated_content_words` | Total word count in updated content (after processing) | +| `total_updated_content_tokens` | Total token count in updated content | These statistics are saved to `eval_factory_metrics.json` under the `reasoning` key after evaluation completes. ## Example: Custom Reasoning Tokens -```python -from nemo_evaluator.adapters.adapter_config import AdapterConfig, InterceptorConfig - -# For models using different reasoning tokens -adapter_config = AdapterConfig( - interceptors=[ - InterceptorConfig( - name="reasoning", - config={ - "start_reasoning_token": "[REASONING]", - "end_reasoning_token": "[/REASONING]" - } - ) - ] -) -``` - -## Example: Combined with Other Interceptors - -```python -from nemo_evaluator.adapters.adapter_config import AdapterConfig, InterceptorConfig - -adapter_config = AdapterConfig( - interceptors=[ - InterceptorConfig(name="request_logging", config={"max_requests": 50}), - InterceptorConfig(name="response_logging", config={"max_responses": 50}), - InterceptorConfig( - name="reasoning", - config={ - "start_reasoning_token": "", - "end_reasoning_token": "", - "enable_reasoning_tracking": True - } - ) - ] -) +```yaml +target: + api_endpoint: + adapter_config: + interceptors: + - name: reasoning + config: + start_reasoning_token: "[REASONING]" + end_reasoning_token: "[/REASONING]" + add_reasoning: true + enable_reasoning_tracking: true + - name: "endpoint" + enabled: true + config: {} ``` diff --git a/docs/libraries/nemo-evaluator/interceptors/request-logging.md b/docs/libraries/nemo-evaluator/interceptors/request-logging.md new file mode 100644 index 00000000..86930aee --- /dev/null +++ b/docs/libraries/nemo-evaluator/interceptors/request-logging.md @@ -0,0 +1,39 @@ +(interceptor-request-logging)= +# Request Logging Interceptor + +## Overview + +The `RequestLoggingInterceptor` captures and logs incoming API requests for debugging, analysis, and audit purposes. This interceptor is essential for troubleshooting evaluation issues and understanding request patterns. + +## Configuration + +### CLI Configuration + +```bash +--overrides 'target.api_endpoint.adapter_config.use_request_logging=True,target.api_endpoint.adapter_config.max_saved_requests=1000' +``` + +### YAML Configuration + + +```yaml +target: + api_endpoint: + adapter_config: + interceptors: + - name: "request_logging" + enabled: true + config: + max_requests: 1000 + - name: "endpoint" + enabled: true + config: {} +``` + +## Configuration Options + +| Parameter | Description | Default | Type | +|--------------------|------------------------------------------------------------------------|-----------|---------| +| log_request_body | Whether to log the request body | `True` | bool | +| log_request_headers| Whether to log the request headers | `True` | bool | +| max_requests | Maximum number of requests to log (None for unlimited) | `2` | int/None| diff --git a/docs/libraries/nemo-evaluator/interceptors/response-logging.md b/docs/libraries/nemo-evaluator/interceptors/response-logging.md new file mode 100644 index 00000000..2e29624a --- /dev/null +++ b/docs/libraries/nemo-evaluator/interceptors/response-logging.md @@ -0,0 +1,40 @@ +(interceptor-response-logging)= +# Response Logging Interceptor + +## Overview + +The `ResponseLoggingInterceptor` captures and logs API responses for analysis and debugging. Use this interceptor to examine model outputs and identify response patterns. + + +## Configuration + +### CLI Configuration + +```bash +--overrides 'target.api_endpoint.adapter_config.use_response_logging=True,target.api_endpoint.adapter_config.max_saved_responses=1000' +``` + +### YAML Configuration + + +```yaml +target: + api_endpoint: + adapter_config: + interceptors: + - name: "response_logging" + enabled: true + config: + max_responses: 1000 + - name: "endpoint" + enabled: true + config: {} +``` + +## Configuration Options + +| Parameter | Description | Default | Type | +|---------------------|--------------------------------------------------------------|-----------|---------------| +| `log_response_body` | Whether to log the response body contents. | `True` | `bool` | +| `log_response_headers`| Whether to log the response HTTP headers. | `True` | `bool` | +| `max_responses` | Maximum number of responses to log (None for unlimited). | `None` | `int` (optional)| \ No newline at end of file diff --git a/docs/libraries/nemo-evaluator/interceptors/response-stats.md b/docs/libraries/nemo-evaluator/interceptors/response-stats.md new file mode 100644 index 00000000..be0d2758 --- /dev/null +++ b/docs/libraries/nemo-evaluator/interceptors/response-stats.md @@ -0,0 +1,201 @@ +(interceptor-response-stats)= +# Response Stats Interceptor + +## Overview + +The `ResponseStatsInterceptor` collects comprehensive aggregated statistics from API responses for metrics collection and analysis. It tracks detailed metrics about token usage, response patterns, performance characteristics, and API behavior throughout the evaluation process. + +This interceptor is essential for understanding API performance, cost analysis, and monitoring evaluation runs. It provides both real-time aggregated statistics and detailed per-request tracking capabilities. + +**Key Statistics Tracked:** + +- Token usage (prompt, completion, total) with averages and maximums +- Response status codes and counts +- Finish reasons and stop reasons +- Tool calls and function calls counts +- Response latency (average and maximum) +- Total response count and successful responses +- Inference run times and timing analysis + +## Configuration + +### CLI Configuration + +```bash +--overrides 'target.api_endpoint.adapter_config.tracking_requests_stats=True,target.api_endpoint.adapter_config.response_stats_cache=/tmp/response_stats_interceptor,target.api_endpoint.adapter_config.logging_aggregated_stats_interval=100' +``` + +### YAML Configuration + +```yaml +target: + api_endpoint: + adapter_config: + interceptors: + - name: "response_stats" + enabled: true + config: + # Default configuration - collect all statistics + collect_token_stats: true + collect_finish_reasons: true + collect_tool_calls: true + save_individuals: true + cache_dir: "/tmp/response_stats_interceptor" + logging_aggregated_stats_interval: 100 + - name: "endpoint" + enabled: true + config: {} +``` + +```yaml +target: + api_endpoint: + adapter_config: + interceptors: + - name: "response_stats" + enabled: true + config: + # Minimal configuration - only basic stats + collect_token_stats: false + collect_finish_reasons: false + collect_tool_calls: false + save_individuals: false + logging_aggregated_stats_interval: 50 + - name: "endpoint" + enabled: true + config: {} +``` + +```yaml +target: + api_endpoint: + adapter_config: + interceptors: + - name: "response_stats" + enabled: true + config: + # Custom configuration with periodic saving + collect_token_stats: true + collect_finish_reasons: true + collect_tool_calls: true + stats_file_saving_interval: 100 + save_individuals: true + cache_dir: "/custom/stats/cache" + logging_aggregated_stats_interval: 25 + - name: "endpoint" + enabled: true + config: {} +``` + +## Configuration Options + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `collect_token_stats` | `bool` | `true` | Whether to collect token statistics (prompt, completion, total tokens) | +| `collect_finish_reasons` | `bool` | `true` | Whether to collect and track finish reasons from API responses | +| `collect_tool_calls` | `bool` | `true` | Whether to collect tool call and function call statistics | +| `stats_file_saving_interval` | `int` | `None` | How often (every N responses) to save stats to file. If None, only saves via post_eval_hook | +| `save_individuals` | `bool` | `true` | Whether to save individual request statistics. If false, only saves aggregated stats | +| `cache_dir` | `str` | `"/tmp/response_stats_interceptor"` | Custom cache directory for storing response statistics | +| `logging_aggregated_stats_interval` | `int` | `100` | How often (every N responses) to log aggregated statistics to console | + +## Behavior + +### Statistics Collection +The interceptor automatically collects statistics from successful API responses (HTTP 200) and tracks basic information for all responses regardless of status code. + +**For Successful Responses (200):** +- Parses JSON response body +- Extracts token usage from `usage` field +- Collects finish reasons from `choices[].finish_reason` +- Counts tool calls and function calls +- Calculates running averages and maximums + +**For All Responses:** +- Tracks status code distribution +- Measures response latency +- Records response timestamps +- Maintains response counts + +### Data Storage +- **Aggregated Stats**: Continuously updated running statistics stored in cache +- **Individual Stats**: Per-request details stored with request IDs (if enabled) +- **Metrics File**: Final statistics saved to `eval_factory_metrics.json` +- **Thread Safety**: All operations are thread-safe using locks + +### Timing Analysis +- Tracks inference run times across multiple evaluation runs +- Calculates time from first to last request per run +- Estimates time to first request from adapter initialization +- Provides detailed timing breakdowns for performance analysis + +## Statistics Output + +### Aggregated Statistics +```json +{ + "response_stats": { + "description": "Response statistics saved during processing", + "avg_prompt_tokens": 150.5, + "avg_total_tokens": 200.3, + "avg_completion_tokens": 49.8, + "avg_latency_ms": 1250.2, + "max_prompt_tokens": 300, + "max_total_tokens": 450, + "max_completion_tokens": 150, + "max_latency_ms": 3000, + "count": 1000, + "successful_count": 995, + "tool_calls_count": 50, + "function_calls_count": 25, + "finish_reason": { + "stop": 800, + "length": 150, + "tool_calls": 45 + }, + "status_codes": { + "200": 995, + "429": 3, + "500": 2 + }, + "inference_time": 45.6, + "run_id": 0 + } +} +``` + +### Individual Request Statistics (if enabled) +```json +{ + "request_id": "req_123", + "timestamp": 1698765432.123, + "status_code": 200, + "prompt_tokens": 150, + "total_tokens": 200, + "completion_tokens": 50, + "finish_reason": "stop", + "tool_calls_count": 0, + "function_calls_count": 0, + "run_id": 0 +} +``` + + +## Common Use Cases + +- **Cost Analysis**: Track token usage patterns to estimate API costs +- **Performance Monitoring**: Monitor response times and throughput +- **Quality Assessment**: Analyze finish reasons and response patterns +- **Tool Usage Analysis**: Track function and tool call frequencies +- **Debugging**: Individual request tracking for troubleshooting +- **Capacity Planning**: Understand API usage patterns and limits +- **A/B Testing**: Compare statistics across different configurations +- **Production Monitoring**: Real-time visibility into API behavior + +## Integration Notes + +- **Post-Evaluation Hook**: Automatically saves final statistics after evaluation completes +- **Cache Persistence**: Statistics survive across runs and can be aggregated +- **Thread Safety**: Safe for concurrent request processing +- **Memory Efficient**: Uses running averages to avoid storing all individual values +- **Caching Strategy**: Handles cache hits by skipping statistics collection to avoid double-counting diff --git a/docs/libraries/nemo-evaluator/interceptors/system-messages.md b/docs/libraries/nemo-evaluator/interceptors/system-messages.md index 3d99da9e..cde0e45c 100644 --- a/docs/libraries/nemo-evaluator/interceptors/system-messages.md +++ b/docs/libraries/nemo-evaluator/interceptors/system-messages.md @@ -3,29 +3,15 @@ # System Messages -The system message interceptor injects custom system prompts into evaluation requests, enabling consistent prompting and role-specific behavior across evaluations. - ## Overview The `SystemMessageInterceptor` modifies incoming requests to include custom system messages. This interceptor works with chat-format requests, replacing any existing system messages with the configured message. ## Configuration -### Python Configuration - -```python -from nemo_evaluator.adapters.adapter_config import AdapterConfig, InterceptorConfig - -adapter_config = AdapterConfig( - interceptors=[ - InterceptorConfig( - name="system_message", - config={ - "system_message": "You are a helpful AI assistant." - } - ) - ] -) +### CLI Configuration +```bash +--overrides 'target.api_endpoint.adapter_config.use_system_prompt=True,target.api_endpoint.adapter_config.custom_system_prompt="You are a helpful assistant."' ``` ### YAML Configuration @@ -38,6 +24,9 @@ target: - name: system_message config: system_message: "You are a helpful AI assistant." + - name: "endpoint" + enabled: true + config: {} ``` ## Configuration Options @@ -92,27 +81,3 @@ If an existing system message is present, the interceptor replaces it: ] } ``` - -## Usage Example - -```python -from nemo_evaluator.adapters.adapter_config import AdapterConfig, InterceptorConfig - -# System message with other interceptors -adapter_config = AdapterConfig( - interceptors=[ - InterceptorConfig( - name="system_message", - config={ - "system_message": "You are an expert problem solver." - } - ), - InterceptorConfig( - name="caching", - config={ - "cache_dir": "./cache" - } - ) - ] -) -``` diff --git a/docs/libraries/nemo-evaluator/logging.md b/docs/libraries/nemo-evaluator/logging.md index 119ea7dc..254ec8f4 100644 --- a/docs/libraries/nemo-evaluator/logging.md +++ b/docs/libraries/nemo-evaluator/logging.md @@ -60,7 +60,7 @@ Specify a custom log directory using the `NEMO_EVALUATOR_LOG_DIR` environment va export NEMO_EVALUATOR_LOG_DIR=/path/to/logs/ # Run evaluation (logs will be written to the specified directory) -eval-factory run_eval ... +nemo-evaluator run_eval ... ``` If `NEMO_EVALUATOR_LOG_DIR` is not set, logs appear in the console (stderr) without file output. @@ -119,7 +119,7 @@ export LOG_LEVEL=DEBUG export LOG_LEVEL=DEBUG # Run evaluation with logging -eval-factory run_eval --eval_type mmlu_pro --model_id gpt-4 ... +nemo-evaluator run_eval --eval_type mmlu_pro --model_id gpt-4 ... ``` ### Custom log directory @@ -129,7 +129,7 @@ eval-factory run_eval --eval_type mmlu_pro --model_id gpt-4 ... export NEMO_EVALUATOR_LOG_DIR=./my_logs/ # Run evaluation with logging to custom directory -eval-factory run_eval --eval_type mmlu_pro ... +nemo-evaluator run_eval --eval_type mmlu_pro ... ``` ### Environment verification diff --git a/docs/libraries/nemo-evaluator/workflows/cli.md b/docs/libraries/nemo-evaluator/workflows/cli.md new file mode 100644 index 00000000..4377b73f --- /dev/null +++ b/docs/libraries/nemo-evaluator/workflows/cli.md @@ -0,0 +1,246 @@ +(cli-workflows)= + +# CLI Workflows + +This document explains how to use evaluation containers within NeMo Evaluator workflows, focusing on command execution and configuration. + +## Overview + +Evaluation containers provide consistent, reproducible environments for running AI model evaluations. For a comprehensive list of all available containers, refer to {ref}`nemo-evaluator-containers`. + +## Basic CLI + +### Using YAML Configuration + +Define your config: + +```yaml +config: + type: mmlu_pro + output_dir: /workspace/results + params: + limit_samples: 10 +target: + api_endpoint: + url: https://integrate.api.nvidia.com/v1/chat/completions + model_id: meta/llama-3.1-8b-instruct + type: chat + api_key: NGC_API_KEY +``` + +Run evaluation: + +```bash +export HF_TOKEN=hf_xxx +export MY_API_KEY=nvapi-xxx + +nemo-evaluator run_eval \ + --run_config /workspace/my_config.yml +``` + +### Using CLI overrides + +Provide all arguments through CLI: + +```bash +export HF_TOKEN=hf_xxx +export MY_API_KEY=nvapi-xxx + +nemo-evaluator run_eval \ + --eval_type mmlu_pro \ + --model_id meta/llama-3.1-8b-instruct \ + --model_url https://integrate.api.nvidia.com/v1/chat/completions \ + --model_type chat \ + --api_key_name NGC_API_KEY \ + --output_dir /workspace/results \ + --overrides 'config.params.limit_samples=10' +``` + +## Interceptor Configuration + +The adapter system uses interceptors to modify requests and responses. Configure interceptors using the `--overrides` parameter. + +For detailed interceptor configuration, refer to {ref}`nemo-evaluator-interceptors`. + +:::{note} +Always remember to include `endpoint` Interceptor at the and of your custom Interceptors chain. +::: + + +### Enable Request Logging + +```yaml +config: + type: mmlu_pro + output_dir: /workspace/results + params: + limit_samples: 10 +target: + api_endpoint: + url: https://integrate.api.nvidia.com/v1/chat/completions + model_id: meta/llama-3.1-8b-instruct + type: chat + api_key: NGC_API_KEY + adapter_config: + interceptors: + - name: "request_logging" + enabled: true + config: + max_requests: 1000 + - name: "endpoint" + enabled: true + config: {} +``` + +```bash +export HF_TOKEN=hf_xxx +export MY_API_KEY=nvapi-xxx + +nemo-evaluator run_eval \ + --run_config /workspace/my_config.yml +``` + + +### Enable Caching + +```yaml +config: + type: mmlu_pro + output_dir: /workspace/results + params: + limit_samples: 10 +target: + api_endpoint: + url: https://integrate.api.nvidia.com/v1/chat/completions + model_id: meta/llama-3.1-8b-instruct + type: chat + api_key: NGC_API_KEY + adapter_config: + interceptors: + - name: "caching" + enabled: true + config: + cache_dir: "./evaluation_cache" + reuse_cached_responses: true + save_requests: true + save_responses: true + max_saved_requests: 1000 + max_saved_responses: 1000 + - name: "endpoint" + enabled: true + config: {} +``` + +```bash +export HF_TOKEN=hf_xxx +export MY_API_KEY=nvapi-xxx + +nemo-evaluator run_eval \ + --run_config /workspace/my_config.yml +``` + +### Multiple Interceptors + +```yaml +config: + type: mmlu_pro + output_dir: /workspace/results + params: + limit_samples: 10 +target: + api_endpoint: + url: https://integrate.api.nvidia.com/v1/chat/completions + model_id: meta/llama-3.1-8b-instruct + type: chat + api_key: NGC_API_KEY + adapter_config: + interceptors: + - name: "caching" + enabled: true + config: + cache_dir: "./evaluation_cache" + reuse_cached_responses: true + save_requests: true + save_responses: true + max_saved_requests: 1000 + max_saved_responses: 1000 + - name: "request_logging" + enabled: true + config: + max_requests: 1000 + - name: "reasoning" + config: + start_reasoning_token: "" + end_reasoning_token: "" + add_reasoning: true + enable_reasoning_tracking: true + - name: "endpoint" + enabled: true + config: {} +``` + +```bash +export HF_TOKEN=hf_xxx +export MY_API_KEY=nvapi-xxx + +nemo-evaluator run_eval \ + --run_config /workspace/my_config.yml +``` + +### Legacy Configuration Support + +Provide Interceptor configuration with `--overrides` flag: + +```bash +nemo-evaluator run_eval \ + --eval_type mmlu_pro \ + --model_id meta/llama-3.1-8b-instruct \ + --model_url https://integrate.api.nvidia.com/v1/chat/completions \ + --model_type chat \ + --api_key_name MY_API_KEY \ + --output_dir ./results \ + --overrides 'target.api_endpoint.adapter_config.use_request_logging=True,target.api_endpoint.adapter_config.max_saved_requests=1000,target.api_endpoint.adapter_config.use_caching=True,target.api_endpoint.adapter_config.caching_dir=./cache,target.api_endpoint.adapter_config.reuse_cached_responses=True' +``` + +:::{note} +Legacy parameters will be automatically converted to the modern interceptor-based configuration. For new projects, use the YAML interceptor configutation shown above. +::: + +## Troubleshooting + +### Port Conflicts + +If you encounter adapter server port conflicts: + +```bash +export ADAPTER_PORT=3828 +export ADAPTER_HOST=localhost +``` + +:::{note} +You can manually set the port, or rely on NeMo Evaluator's dynamic port binding feature. +::: + +### API Key Issues + +Verify your API key environment variable: + +```bash +echo $MY_API_KEY +``` + +## Environment Variables + +### Adapter Server Configuration + +```bash +export ADAPTER_PORT=3828 # Default: 3825 +export ADAPTER_HOST=localhost +``` + +### API Key Management + +```bash +export MY_API_KEY=your_api_key_here +export HF_TOKEN=your_hf_token_here +``` diff --git a/docs/libraries/nemo-evaluator/workflows/index.md b/docs/libraries/nemo-evaluator/workflows/index.md index c47340a3..f8a430c0 100644 --- a/docs/libraries/nemo-evaluator/workflows/index.md +++ b/docs/libraries/nemo-evaluator/workflows/index.md @@ -1,20 +1,20 @@ (workflows-overview)= -# Container Workflows +# Workflows -Learn how to use NeMo Evaluator through different workflow patterns. Whether you prefer programmatic control through Python APIs or direct container usage, these guides provide practical examples for integrating evaluations into your ML pipelines. +Learn how to use NeMo Evaluator through different workflow patterns. Whether you prefer programmatic control through Python APIs or CLI, these guides provide practical examples for integrating evaluations into your ML pipelines. ::::{grid} 1 2 2 2 :gutter: 1 1 1 2 -:::{grid-item-card} {octicon}`container;1.5em;sd-mr-1` Using Containers -:link: using_containers +:::{grid-item-card} {octicon}`command-palette;1.5em;sd-mr-1` CLI +:link: cli :link-type: doc -Run evaluations using the pre-built NGC containers directly with Docker or container orchestration platforms. +Run evaluations using the pre-built NGC containers and command line interface. ::: -:::{grid-item-card} {octicon}`code;1.5em;sd-mr-1` Python API +:::{grid-item-card} {octicon}`file-code;1.5em;sd-mr-1` Python API :link: python-api :link-type: doc @@ -26,14 +26,14 @@ Use the NeMo Evaluator Python API to integrate evaluations directly into your ex ## Choose Your Workflow - **Python API**: Integrate evaluations directly into your existing Python applications when you need dynamic configuration management or programmatic control -- **Container Usage**: Use pre-built containers when you work with CI/CD systems, container orchestration platforms, or need complete control over the container environment +- **CLI**: Use CLI when you work with CI/CD systems, container orchestration platforms, or other non-interactive workflows. -Both approaches use the same underlying evaluation containers and produce identical, reproducible results. Choose based on your integration requirements and preferred level of abstraction. +Both approaches use the same underlying evaluation package and produce identical, reproducible results. Choose based on your integration requirements and preferred level of abstraction. :::{toctree} -:caption: Container Workflows +:caption: Workflows :hidden: -Using Containers +CLI Python API ::: diff --git a/docs/libraries/nemo-evaluator/workflows/python-api.md b/docs/libraries/nemo-evaluator/workflows/python-api.md index adc6740e..4dc87ba7 100644 --- a/docs/libraries/nemo-evaluator/workflows/python-api.md +++ b/docs/libraries/nemo-evaluator/workflows/python-api.md @@ -20,11 +20,19 @@ The Python API is built on top of NeMo Evaluator and provides: |--------------|----------| | nvidia-bfcl | https://pypi.org/project/nvidia-bfcl/ | | nvidia-bigcode-eval | https://pypi.org/project/nvidia-bigcode-eval/ | -| nvidia-crfm-helm | https://pypi.org/project/nvidia-crfm-helm/ | +| nvidia-compute-eval | https://pypi.org/project/nvidia-compute-eval/ | | nvidia-eval-factory-garak | https://pypi.org/project/nvidia-eval-factory-garak/ | +| nvidia-genai-perf-eval | https://pypi.org/project/nvidia-genai-perf-eval/ | +| nvidia-crfm-helm | https://pypi.org/project/nvidia-crfm-helm/ | +| nvidia-hle | https://pypi.org/project/nvidia-hle/ | +| nvidia-ifbench | https://pypi.org/project/nvidia-ifbench/ | +| nvidia-livecodebench | https://pypi.org/project/nvidia-livecodebench/ | | nvidia-lm-eval | https://pypi.org/project/nvidia-lm-eval/ | +| nvidia-mmath | https://pypi.org/project/nvidia-mmath/ | | nvidia-mtbench-evaluator | https://pypi.org/project/nvidia-mtbench-evaluator/ | +| nvidia-eval-factory-nemo-skills | https://pypi.org/project/nvidia-eval-factory-nemo-skills/ | | nvidia-safety-harness | https://pypi.org/project/nvidia-safety-harness/ | +| nvidia-scicode | https://pypi.org/project/nvidia-scicode/ | | nvidia-simple-evals | https://pypi.org/project/nvidia-simple-evals/ | | nvidia-tooltalk | https://pypi.org/project/nvidia-tooltalk/ | | nvidia-vlmeval | https://pypi.org/project/nvidia-vlmeval/ | @@ -138,7 +146,10 @@ adapter_config = AdapterConfig( # Enable progress tracking InterceptorConfig( name="progress_tracking" - ) + ), + InterceptorConfig( + name="endpoint" + ), ] ) diff --git a/docs/libraries/nemo-evaluator/workflows/using_containers.md b/docs/libraries/nemo-evaluator/workflows/using_containers.md deleted file mode 100644 index 124f81b0..00000000 --- a/docs/libraries/nemo-evaluator/workflows/using_containers.md +++ /dev/null @@ -1,123 +0,0 @@ -(container-workflows)= - -# Container Workflows - -This document explains how to use evaluation containers within NeMo Evaluator workflows, focusing on command execution and configuration. - -## Overview - -Evaluation containers provide consistent, reproducible environments for running AI model evaluations. For a comprehensive list of all available containers, see {ref}`nemo-evaluator-containers`. - -## Basic Container Usage - -### Running an Evaluation - -```bash -docker run --rm -it nvcr.io/nvidia/eval-factory/simple-evals:{{ docker_compose_latest }} bash - -export HF_TOKEN=hf_xxx -export MY_API_KEY=nvapi-xxx - -eval-factory run_eval \ - --eval_type mmlu_pro \ - --model_id meta/llama-3.1-8b-instruct \ - --model_url https://integrate.api.nvidia.com/v1/chat/completions \ - --model_type chat \ - --api_key_name MY_API_KEY \ - --output_dir /workspace/results \ - --overrides 'config.params.limit_samples=10' -``` - -## Interceptor Configuration - -The adapter system uses interceptors to modify requests and responses. Configure interceptors using the `--overrides` parameter. - -### Enable Request Logging - -```bash -eval-factory run_eval \ - --eval_type mmlu_pro \ - --model_id meta/llama-3.1-8b-instruct \ - --model_url https://integrate.api.nvidia.com/v1/chat/completions \ - --model_type chat \ - --api_key_name MY_API_KEY \ - --output_dir ./results \ - --overrides 'target.api_endpoint.adapter_config.interceptors=[{"name":"request_logging","config":{"max_requests":100}}]' -``` - -### Enable Caching - -```bash -eval-factory run_eval \ - --eval_type mmlu_pro \ - --model_id meta/llama-3.1-8b-instruct \ - --model_url https://integrate.api.nvidia.com/v1/chat/completions \ - --model_type chat \ - --api_key_name MY_API_KEY \ - --output_dir ./results \ - --overrides 'target.api_endpoint.adapter_config.interceptors=[{"name":"caching","config":{"cache_dir":"./cache","reuse_cached_responses":true}}]' -``` - -### Multiple Interceptors - -Combine multiple interceptors in a single command: - -```bash -eval-factory run_eval \ - --eval_type mmlu_pro \ - --model_id meta/llama-3.1-8b-instruct \ - --model_url https://integrate.api.nvidia.com/v1/chat/completions \ - --model_type chat \ - --api_key_name MY_API_KEY \ - --output_dir ./results \ - --overrides 'target.api_endpoint.adapter_config.interceptors=[{"name":"request_logging"},{"name":"caching","config":{"cache_dir":"./cache"}},{"name":"reasoning","config":{"start_reasoning_token":"","end_reasoning_token":""}}]' -``` - -For detailed interceptor configuration, see {ref}`nemo-evaluator-interceptors`. - -## Legacy Configuration Support - -Legacy parameter names are still supported for backward compatibility: - -```bash ---overrides 'target.api_endpoint.adapter_config.use_request_logging=true,target.api_endpoint.adapter_config.use_caching=true' -``` - -:::{note} -Legacy parameters will be automatically converted to the modern interceptor-based configuration. For new projects, use the interceptor syntax shown above. -::: - -## Troubleshooting - -### Port Conflicts - -If you encounter adapter server port conflicts: - -```bash -export ADAPTER_PORT=3828 -export ADAPTER_HOST=localhost -``` - -### API Key Issues - -Verify your API key environment variable: - -```bash -echo $MY_API_KEY -``` - -## Environment Variables - -### Adapter Server Configuration - -```bash -export ADAPTER_PORT=3828 # Default: 3825 -export ADAPTER_HOST=localhost -``` - -### API Key Management - -```bash -export MY_API_KEY=your_api_key_here -export HF_TOKEN=your_hf_token_here -``` diff --git a/docs/references/evaluation-utils.md b/docs/references/evaluation-utils.md index 11322699..d44336bd 100644 --- a/docs/references/evaluation-utils.md +++ b/docs/references/evaluation-utils.md @@ -74,10 +74,10 @@ To filter tasks using the CLI: ```bash # List all tasks -nv-eval ls tasks +nemo-evaluator-launcher ls tasks # Filter for specific tasks -nv-eval ls tasks | grep mmlu +nemo-evaluator-launcher ls tasks | grep mmlu ``` #### Check Installation Status @@ -212,7 +212,7 @@ When a task name is provided by more than one framework (for example, both `lm-e ```bash # Use explicit framework.task format in your configuration overrides -nv-eval run --config-dir examples --config-name local_llama_3_1_8b_instruct \ +nemo-evaluator-launcher run --config-dir examples --config-name local_llama_3_1_8b_instruct \ -o 'evaluation.tasks=["lm-evaluation-harness.mmlu"]' ``` @@ -235,13 +235,13 @@ tasks = get_tasks_list() ```bash # List all tasks -nv-eval ls tasks +nemo-evaluator-launcher ls tasks # List recent evaluation runs -nv-eval ls runs +nemo-evaluator-launcher ls runs # Get detailed help -nv-eval --help +nemo-evaluator-launcher --help ``` --- diff --git a/docs/troubleshooting/index.md b/docs/troubleshooting/index.md index 4f2449e6..b995a5fa 100644 --- a/docs/troubleshooting/index.md +++ b/docs/troubleshooting/index.md @@ -20,16 +20,16 @@ Before diving into specific problem areas, run these basic checks to verify your ```bash # Verify launcher installation and basic functionality -nv-eval --version +nemo-evaluator-launcher --version # List available tasks -nv-eval ls tasks +nemo-evaluator-launcher ls tasks # Validate configuration without running -nv-eval run --config-dir examples --config-name local_llama_3_1_8b_instruct --dry-run +nemo-evaluator-launcher run --config-dir examples --config-name local_llama_3_1_8b_instruct --dry-run # Check recent runs -nv-eval ls runs +nemo-evaluator-launcher ls runs ``` ::: diff --git a/docs/troubleshooting/runtime-issues/index.md b/docs/troubleshooting/runtime-issues/index.md index ac77b27b..0324af57 100644 --- a/docs/troubleshooting/runtime-issues/index.md +++ b/docs/troubleshooting/runtime-issues/index.md @@ -12,7 +12,7 @@ When evaluations fail during execution, start with these diagnostic steps: ```bash # Validate configuration before running -nv-eval run --config-dir examples --config-name local_llama_3_1_8b_instruct --dry-run +nemo-evaluator-launcher run --config-dir examples --config-name local_llama_3_1_8b_instruct --dry-run # Test minimal configuration python -c " diff --git a/docs/troubleshooting/runtime-issues/launcher.md b/docs/troubleshooting/runtime-issues/launcher.md index bd3b5edd..be92a645 100644 --- a/docs/troubleshooting/runtime-issues/launcher.md +++ b/docs/troubleshooting/runtime-issues/launcher.md @@ -12,7 +12,7 @@ Troubleshooting guide for NeMo Evaluator Launcher-specific problems including co ```bash # Validate configuration without running -nv-eval run --config-dir examples --config-name local_llama_3_1_8b_instruct --dry-run +nemo-evaluator-launcher run --config-dir examples --config-name local_llama_3_1_8b_instruct --dry-run ``` **Common Issues**: @@ -25,7 +25,7 @@ Error: Missing required field 'execution.output_dir' ``` **Fix**: Add output directory to config or override: ```bash -nv-eval run --config-dir examples --config-name local_llama_3_1_8b_instruct \ +nemo-evaluator-launcher run --config-dir examples --config-name local_llama_3_1_8b_instruct \ -o execution.output_dir=./results ``` @@ -39,7 +39,7 @@ Error: Unknown task 'invalid_task'. Available tasks: hellaswag, arc_challenge, . ``` **Fix**: List available tasks and use correct names: ```bash -nv-eval ls tasks +nemo-evaluator-launcher ls tasks ``` :::: @@ -84,7 +84,7 @@ defaults: 3. **Use Absolute Paths**: ```bash -nv-eval run --config-dir /absolute/path/to/configs --config-name my_config +nemo-evaluator-launcher run --config-dir /absolute/path/to/configs --config-name my_config ``` ## Job Management Issues @@ -96,13 +96,13 @@ nv-eval run --config-dir /absolute/path/to/configs --config-name my_config **Diagnosis**: ```bash # Check job status -nv-eval status +nemo-evaluator-launcher status # List all runs -nv-eval ls runs +nemo-evaluator-launcher ls runs # Check specific job -nv-eval status +nemo-evaluator-launcher status ``` **Common Issues**: @@ -113,7 +113,7 @@ Error: Invocation 'abc123' not found ``` **Fix**: Use correct invocation ID from run output or list recent runs: ```bash -nv-eval ls runs +nemo-evaluator-launcher ls runs ``` 2. **Stale Job Database**: @@ -130,10 +130,10 @@ ls -la ~/.nemo-evaluator/exec-db/exec.v1.jsonl **Solutions**: ```bash # Kill entire invocation -nv-eval kill +nemo-evaluator-launcher kill # Kill specific job -nv-eval kill +nemo-evaluator-launcher kill ``` **Executor-Specific Issues**: @@ -253,10 +253,10 @@ Error: Deployment failed to reach Ready state **Diagnosis**: ```bash # List completed runs -nv-eval ls runs +nemo-evaluator-launcher ls runs # Try export -nv-eval export --dest local --format json +nemo-evaluator-launcher export --dest local --format json ``` **Common Issues**: @@ -289,13 +289,13 @@ When reporting launcher issues, include: 1. **Configuration Details**: ```bash # Show resolved configuration -nv-eval run --config-dir examples --config-name --dry-run +nemo-evaluator-launcher run --config-dir examples --config-name --dry-run ``` 2. **System Information**: ```bash # Launcher version -nv-eval --version +nemo-evaluator-launcher --version # System info python --version @@ -307,10 +307,10 @@ lep workspace list # For Lepton executor 3. **Job Information**: ```bash # Job status -nv-eval status +nemo-evaluator-launcher status # Recent runs -nv-eval ls runs +nemo-evaluator-launcher ls runs ``` 4. **Log Files**: diff --git a/docs/troubleshooting/setup-issues/installation.md b/docs/troubleshooting/setup-issues/installation.md index 71047445..aa7835da 100644 --- a/docs/troubleshooting/setup-issues/installation.md +++ b/docs/troubleshooting/setup-issues/installation.md @@ -30,7 +30,7 @@ show_available_tasks() Or use the CLI: ```bash -nv-eval ls tasks +nemo-evaluator-launcher ls tasks ``` **Solution**: diff --git a/docs/tutorials/create-framework-definition-file.md b/docs/tutorials/create-framework-definition-file.md index 4fa60cc5..6ceb9d60 100644 --- a/docs/tutorials/create-framework-definition-file.md +++ b/docs/tutorials/create-framework-definition-file.md @@ -19,7 +19,7 @@ Learn by building a complete FDF for a simple evaluation framework. By the end, you'll have integrated your evaluation framework with {{ product_name_short }}, allowing users to run: ```bash -eval-factory run_eval \ +nemo-evaluator run_eval \ --eval_type domain_specific_task \ --model_id meta/llama-3.1-8b-instruct \ --model_url https://integrate.api.nvidia.com/v1/chat/completions \ @@ -233,7 +233,7 @@ pip install -e . eval-factory list_evals --framework domain-eval # Run a test evaluation -eval-factory run_eval \ +nemo-evaluator run_eval \ --eval_type medical_qa \ --model_id gpt-3.5-turbo \ --model_url https://api.openai.com/v1/chat/completions \ diff --git a/docs/tutorials/local_evaluation_of_existing_endpoint.md b/docs/tutorials/local-evaluation-of-existing-endpoint.md similarity index 89% rename from docs/tutorials/local_evaluation_of_existing_endpoint.md rename to docs/tutorials/local-evaluation-of-existing-endpoint.md index 79b5038d..019017fb 100644 --- a/docs/tutorials/local_evaluation_of_existing_endpoint.md +++ b/docs/tutorials/local-evaluation-of-existing-endpoint.md @@ -39,7 +39,7 @@ For this tutorial we will use `meta/llama-3.1-8b-instruct` from [build.nvidia.co Choose which benchmarks to evaluate. Available tasks include: ```bash -nv-eval ls tasks +nemo-evaluator-launcher ls tasks ``` For a comprehensive list of supported tasks and descriptions, see {ref}`nemo-evaluator-containers`. @@ -75,7 +75,7 @@ target: api_endpoint: model_id: meta/llama-3.1-8b-instruct # TODO: update to the model you want to evaluate url: https://integrate.api.nvidia.com/v1/chat/completions # TODO: update to the endpoint you want to evaluate - api_key_name: API_KEY # API Key with access to build.nvidia.com or model of your choice + api_key_name: NGC_API_KEY # API Key with access to build.nvidia.com or model of your choice # specify the benchmarks to evaluate evaluation: @@ -89,21 +89,21 @@ evaluation: ### 4. Run Evaluation ```bash -nv-eval run --config-dir configs --config-name local_endpoint \ - -o target.api_endpoint.api_key_name=API_KEY +nemo-evaluator-launcher run --config-dir configs --config-name local_endpoint \ + -o target.api_endpoint.api_key_name=NGC_API_KEY ``` ### 5. Run the Same Evaluation for a Different Model (Using CLI Overrides) ```bash -export API_KEY= +export NGC_API_KEY= MODEL_NAME= URL= # Note: endpoint URL needs to be FULL (e.g., https://api.example.com/v1/chat/completions) -nv-eval run --config-dir configs --config-name local_endpoint \ +nemo-evaluator-launcher run --config-dir configs --config-name local_endpoint \ -o target.api_endpoint.model_id=$MODEL_NAME \ -o target.api_endpoint.url=$URL \ - -o target.api_endpoint.api_key_name=API_KEY + -o target.api_endpoint.api_key_name=NGC_API_KEY ``` After launching, you can view logs and job status. When jobs finish, you can display results and export them using the available exporters. Refer to {ref}`exporters-overview` for available export options.