-
Notifications
You must be signed in to change notification settings - Fork 21
feat: support for adding names to eval runs #80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Add MAX_RUN_NAME_LENGTH constant (100 chars) to constants.py - Add optional run_name field to EvaluationData model with max length validation - Prepares foundation for customizable evaluation run naming 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Máirín Duffy <[email protected]>
- Create new utils package with sanitize_run_name function - Sanitizes filesystem-unsafe characters (/, \, :, *, ?, ", ', `, <, >, |, control chars) - Collapses multiple spaces/underscores into single underscore - Enforces MAX_RUN_NAME_LENGTH (100 chars) limit - Handles edge cases (empty strings, truncation) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Máirín Duffy <[email protected]>
- Add logic to automatically set run_name from evaluation data YAML filename - Only applies when run_name is not explicitly provided in YAML - Uses sanitize_run_name to ensure filesystem safety - Example: rh124_filesystem_basics.yaml → run_name: "rh124_filesystem_basics" 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Máirín Duffy <[email protected]>
- Add --run-name CLI argument to evaluation runner - Implement priority: CLI override > YAML value > filename default - Pass run_name through runner to DataValidator - Update DataValidator to accept run_name_override parameter - All run names sanitized for filesystem safety 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Máirín Duffy <[email protected]>
- Extract run_name from evaluation data in runner
- Pass run_name to OutputHandler constructor
- Prepend run_name to output filenames when provided
- Format: {run_name}_{base_filename}_{timestamp}
- Backwards compatible when run_name is not provided
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <[email protected]>
Signed-off-by: Máirín Duffy <[email protected]>
- Include run_name in JSON summary output - Add run_name to text summary report (conditionally) - Improves traceability of evaluation runs in output files 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Máirín Duffy <[email protected]>
- Test basic alphanumeric strings and edge cases - Test filesystem-unsafe character replacement - Test whitespace/underscore collapsing - Test max length enforcement with trailing underscore removal - Test Unicode support (emojis, Japanese kanji, Chinese characters) - Test Unicode + unsafe character combinations - All 15 tests passing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Máirín Duffy <[email protected]>
WalkthroughIntroduces a run name concept across evaluation: adds MAX_RUN_NAME_LENGTH, a sanitize_run_name utility, a run_name field on EvaluationData, optional run_name plumbing through validator, runner, and output generator, adjusts filename/summary generation, and extends the CLI with --run-name. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User as CLI User
participant CLI as CLI Parser
participant Runner as run_evaluation
participant Validator as DataValidator
participant Utils as sanitize_run_name
participant Model as EvaluationData
participant Output as OutputHandler
User->>CLI: lightspeed-eval ... --run-name "<name>"
CLI->>Runner: run_evaluation(..., run_name)
Runner->>Validator: load_evaluation_data(path, run_name_override=run_name)
alt override provided
Validator->>Utils: sanitize_run_name(override)
Utils-->>Validator: sanitized_name
Validator->>Model: construct EvaluationData(run_name=sanitized_name)
else YAML has run_name
Validator->>Model: construct EvaluationData(run_name=yaml_value)
else derive from filename
Validator->>Utils: sanitize_run_name(filename_stem)
Utils-->>Validator: sanitized_stem
Validator->>Model: construct EvaluationData(run_name=sanitized_stem)
end
Validator-->>Runner: list[EvaluationData]
Runner->>Output: OutputHandler(..., run_name=first_item.run_name)
Output->>Output: generate_reports(prefix with run_name, include in summaries)
Output-->>Runner: artifacts
Runner-->>CLI: summary/status
CLI-->>User: results
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
src/lightspeed_evaluation/core/constants.py(1 hunks)src/lightspeed_evaluation/core/models/data.py(2 hunks)src/lightspeed_evaluation/core/output/generator.py(4 hunks)src/lightspeed_evaluation/core/system/validator.py(3 hunks)src/lightspeed_evaluation/core/utils/__init__.py(1 hunks)src/lightspeed_evaluation/core/utils/sanitize.py(1 hunks)src/lightspeed_evaluation/runner/evaluation.py(4 hunks)tests/unit/core/test_sanitize.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
src/lightspeed_evaluation/core/system/validator.py (2)
src/lightspeed_evaluation/core/utils/sanitize.py (1)
sanitize_run_name(8-49)src/lightspeed_evaluation/core/models/data.py (1)
EvaluationData(141-193)
src/lightspeed_evaluation/core/utils/__init__.py (1)
src/lightspeed_evaluation/core/utils/sanitize.py (1)
sanitize_run_name(8-49)
tests/unit/core/test_sanitize.py (1)
src/lightspeed_evaluation/core/utils/sanitize.py (1)
sanitize_run_name(8-49)
src/lightspeed_evaluation/runner/evaluation.py (2)
src/lightspeed_evaluation/core/system/validator.py (1)
load_evaluation_data(86-154)src/lightspeed_evaluation/core/output/generator.py (1)
OutputHandler(23-310)
🔇 Additional comments (16)
src/lightspeed_evaluation/core/constants.py (1)
22-22: LGTM!The
MAX_RUN_NAME_LENGTH = 100constant is well-placed and provides a reasonable constraint for filesystem-safe run names. This value balances usability with filesystem path length limits.src/lightspeed_evaluation/core/utils/__init__.py (1)
1-5: LGTM!The module correctly exports
sanitize_run_namewith proper__all__declaration, following Python best practices for package-level exports.src/lightspeed_evaluation/core/models/data.py (1)
154-158: LGTM!The
run_namefield is well-integrated into theEvaluationDatamodel with proper Pydantic validation usingmax_length=MAX_RUN_NAME_LENGTH. The dynamic description string provides clear guidance to users.src/lightspeed_evaluation/core/utils/sanitize.py (1)
8-49: LGTM!The
sanitize_run_namefunction is well-implemented with comprehensive sanitization logic:
- Handles empty input and whitespace correctly
- Replaces filesystem-unsafe characters (including control characters) with underscores
- Collapses multiple spaces/underscores
- Enforces maximum length with trailing underscore cleanup
- Preserves Unicode characters appropriately
The implementation aligns with the extensive test coverage provided.
tests/unit/core/test_sanitize.py (1)
9-114: LGTM!The test suite for
sanitize_run_nameis comprehensive and well-structured, covering:
- Basic alphanumeric passthrough
- Edge cases (empty strings, whitespace)
- Filesystem-unsafe character replacement
- Space/underscore collapsing
- Length enforcement with trailing underscore cleanup
- Control character handling
- Unicode preservation and mixed Unicode/ASCII scenarios
- Real-world YAML filename patterns
Excellent test coverage for the sanitization utility.
src/lightspeed_evaluation/core/output/generator.py (4)
31-44: LGTM!The
OutputHandlercorrectly integratesrun_namesupport with proper initialization and storage.
53-56: LGTM!The filename construction logic correctly prepends
run_namewhen provided, maintaining backward compatibility when it's not present.
195-195: LGTM!Including
run_namein the JSON summary provides useful metadata for tracking evaluation runs.
242-243: LGTM!The conditional inclusion of run_name in the text summary is clean and user-friendly.
src/lightspeed_evaluation/runner/evaluation.py (7)
9-9: LGTM!Importing
MAX_RUN_NAME_LENGTHfor the CLI help text provides clear guidance to users about the constraint.
14-25: LGTM!The function signature is properly updated to support
run_namewith clear documentation.
57-59: LGTM!Correctly passing
run_nameasrun_name_overrideto the data validator maintains the priority chain: CLI → YAML → filename.
84-84: LGTM!Correctly passing
run_nameto theOutputHandlerfor consistent labeling across all generated reports.
133-136: LGTM!The CLI argument is well-documented with the length constraint clearly communicated to users.
140-142: LGTM!Correctly plumbing the CLI argument through to the evaluation function.
76-79: Verify run_name presence and consistency in evaluation YAML. I couldn’t locate any run_name fields in config/evaluation_data.yaml; please confirm whether run_name entries exist and, if there are multiple, enforce a single value per file or document that only the first is used.
| # Set run_name with priority: CLI override > YAML value > filename | ||
| if run_name_override is not None: | ||
| # CLI override takes highest priority | ||
| data_dict["run_name"] = sanitize_run_name(run_name_override) | ||
| elif "run_name" not in data_dict or data_dict["run_name"] is None: | ||
| # Default to YAML filename if not provided | ||
| yaml_filename = Path(data_path).stem | ||
| data_dict["run_name"] = sanitize_run_name(yaml_filename) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sanitize YAML-provided run_name values.
The current logic sanitizes the CLI override and filename fallback but doesn't sanitize run_name values provided in the YAML file. This could allow filesystem-unsafe characters to pass through validation.
Apply this diff to ensure all run_name values are sanitized:
# Set run_name with priority: CLI override > YAML value > filename
if run_name_override is not None:
# CLI override takes highest priority
data_dict["run_name"] = sanitize_run_name(run_name_override)
elif "run_name" not in data_dict or data_dict["run_name"] is None:
# Default to YAML filename if not provided
yaml_filename = Path(data_path).stem
data_dict["run_name"] = sanitize_run_name(yaml_filename)
+ else:
+ # Sanitize YAML-provided run_name
+ data_dict["run_name"] = sanitize_run_name(data_dict["run_name"])📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Set run_name with priority: CLI override > YAML value > filename | |
| if run_name_override is not None: | |
| # CLI override takes highest priority | |
| data_dict["run_name"] = sanitize_run_name(run_name_override) | |
| elif "run_name" not in data_dict or data_dict["run_name"] is None: | |
| # Default to YAML filename if not provided | |
| yaml_filename = Path(data_path).stem | |
| data_dict["run_name"] = sanitize_run_name(yaml_filename) | |
| # Set run_name with priority: CLI override > YAML value > filename | |
| if run_name_override is not None: | |
| # CLI override takes highest priority | |
| data_dict["run_name"] = sanitize_run_name(run_name_override) | |
| elif "run_name" not in data_dict or data_dict["run_name"] is None: | |
| # Default to YAML filename if not provided | |
| yaml_filename = Path(data_path).stem | |
| data_dict["run_name"] = sanitize_run_name(yaml_filename) | |
| else: | |
| # Sanitize YAML-provided run_name | |
| data_dict["run_name"] = sanitize_run_name(data_dict["run_name"]) |
🤖 Prompt for AI Agents
In src/lightspeed_evaluation/core/system/validator.py around lines 119 to 127,
the code sanitizes the CLI override and filename fallback but leaves a
YAML-provided data_dict["run_name"] unsanitized; update the logic so that after
handling CLI override and filename default, any existing YAML-provided run_name
is passed through sanitize_run_name before assignment (i.e., when
run_name_override is None and "run_name" exists and is not None, replace
data_dict["run_name"] with sanitize_run_name(data_dict["run_name"]) so all
run_name sources are sanitized).
VladimirKadlec
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great feature, thanks! I'd prefer existing filename sanitize solution, other than that looks very good.
| from lightspeed_evaluation.core.constants import MAX_RUN_NAME_LENGTH | ||
|
|
||
|
|
||
| def sanitize_run_name(run_name: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider pathvalidate, the advantage would be that we don't need maintain+test this func.
|
Please fix the checks, they should run automatically on your new commits/PRs from now. |
This is a user experience improvement. Time-stamp based eval output is difficult to navigate. This names eval output runs based on the eval data filename as a fallback, based on a field in the yaml for the eval data if provided, and provides a command-line flag that overrides the other two. This should make it easier to find results or annotate specific runs.
Summary by CodeRabbit