-
Notifications
You must be signed in to change notification settings - Fork 315
chore: optimize operator naming #5204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Summary
This PR enhances operator naming for Python-based scan operations by replacing generic "PythonFunction Scan" names with more descriptive source-specific names like "Lance(Python) Scan", "DuckDB(Python) Scan", "ClickHouse(Python) Scan", etc.
The changes restructure the FileFormatConfig::PythonFunction
enum variant from a simple unit variant to a struct variant containing three optional fields: source_type
, module_name
, and function_name
. This metadata enables intelligent source type inference through pattern matching on module names in the new infer_source_type_from_module()
method.
Key modifications include:
-
Enhanced FileFormatConfig Structure: The
PythonFunction
variant now captures contextual metadata about the underlying data source, allowing the system to distinguish between different Python-based sources. -
Intelligent Display Name Generation: The
var_name()
method infile_format_config.rs
now returns aString
instead of&'static str
and includes logic to infer source types from module names (e.g., "daft.io.lance" → "Lance(Python)"). -
Progress Bar Optimization: The progress bar display logic in
progress_bar.rs
implements smart truncation that prioritizes preserving data source type information for Scan operations, with the maximum pipeline name length increased from 22 to 30 characters. -
Consistent Pattern Matching Updates: All files that pattern match on
FileFormatConfig::PythonFunction
have been updated to handle the new struct variant, maintaining backward compatibility through wildcard destructuring.
The implementation integrates with Daft's existing DataSource
shim pattern, where Python data sources are bridged to the Rust scan operator system. This improves debugging and monitoring by making operator types clearly distinguishable in query plans, logs, and progress displays.
Confidence score: 4/5
- This PR is safe to merge with minimal risk of breaking existing functionality
- Score reflects well-structured changes with proper backward compatibility handling, though the hardcoded string matching in source type inference could be brittle
- Pay close attention to
src/common/file-formats/src/file_format_config.rs
for the source type inference logic
10 files reviewed, no comments
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #5204 +/- ##
==========================================
- Coverage 74.82% 74.29% -0.54%
==========================================
Files 974 974
Lines 123656 124372 +716
==========================================
- Hits 92529 92402 -127
- Misses 31127 31970 +843
🚀 New features to boost your workflow:
|
@srilman This seems pretty relevant to the changes you've made with observability & logging for udf's. Could you take a look at this one? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jay-ju I feel like there should be a nicer way to do this using the def name
or def display_name
in DataSourceShim or ScanOperator instead of this matching method.
} | ||
} else { | ||
format!("{}...", &name[..MAX_PIPELINE_NAME_LEN - 3]) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer not to have all this specialization here if possible, is this necessary? Can we just name them Read {}
or Scan {}
instead
64264cb
to
ba782fd
Compare
- Enhance operator naming: PythonFunction Scan → Lance/duckdb/clickhouse/mcap Python Scan, DuckDB Python Scan, etc.
249fa02
to
1ca03ee
Compare
@srilman The operator name is now displayed as you suggested. However, there's one remaining issue: the progress bar in on-ray mode still can't be controlled. This needs to be looked into as a to-do item. But currently, the progress bar in on-ray mode is also missing very little content. |
In the current progress, if the scan is of the Python version, its display is not user-friendly.
Therefore, this PR attempts to fix this issue.
The main approach is to convert PythonFunction into a specific Datasource Scan, for example
Before PR
After PR
Changes Made
Related Issues
Checklist
docs/mkdocs.yml
navigation