Skip to content

Conversation

Jay-ju
Copy link
Contributor

@Jay-ju Jay-ju commented Sep 14, 2025

In the current progress, if the scan is of the Python version, its display is not user-friendly.
Therefore, this PR attempts to fix this issue.
The main approach is to convert PythonFunction into a specific Datasource Scan, for example
Before PR

tests/io/lancedb/test_lancedb_reads.py 🗡️ 🐟[1/1] ⠁    PythonFunction Scan | [00:00:00] 
🗡️ 🐟[1/1] ⠁    PythonFunction Scan | [00:00:00] 2 rows out, 0 B bytes read
🗡️ 🐟[1/1] ✓    PythonFunction Scan | [00:00:00] 2 rows out, 0 B bytes read  

After PR

tests/io/lancedb/test_lancedb_reads.py 🗡️ 🐟[1/1] ⠁             Lance(Python) Scan | [00:00:00] 
🗡️ 🐟[1/1] ⠁             Lance(Python) Scan | [00:00:00] 2 rows out, 0 B bytes read 
🗡️ 🐟[1/1] ✓             Lance(Python) Scan | [00:00:00] 2 rows out, 0 B bytes read        

Changes Made

Related Issues

Checklist

  • Documented in API Docs (if applicable)
  • Documented in User Guide (if applicable)
  • If adding a new documentation page, doc is added to docs/mkdocs.yml navigation
  • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)

@github-actions github-actions bot added the chore label Sep 14, 2025
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR enhances operator naming for Python-based scan operations by replacing generic "PythonFunction Scan" names with more descriptive source-specific names like "Lance(Python) Scan", "DuckDB(Python) Scan", "ClickHouse(Python) Scan", etc.

The changes restructure the FileFormatConfig::PythonFunction enum variant from a simple unit variant to a struct variant containing three optional fields: source_type, module_name, and function_name. This metadata enables intelligent source type inference through pattern matching on module names in the new infer_source_type_from_module() method.

Key modifications include:

  1. Enhanced FileFormatConfig Structure: The PythonFunction variant now captures contextual metadata about the underlying data source, allowing the system to distinguish between different Python-based sources.

  2. Intelligent Display Name Generation: The var_name() method in file_format_config.rs now returns a String instead of &'static str and includes logic to infer source types from module names (e.g., "daft.io.lance" → "Lance(Python)").

  3. Progress Bar Optimization: The progress bar display logic in progress_bar.rs implements smart truncation that prioritizes preserving data source type information for Scan operations, with the maximum pipeline name length increased from 22 to 30 characters.

  4. Consistent Pattern Matching Updates: All files that pattern match on FileFormatConfig::PythonFunction have been updated to handle the new struct variant, maintaining backward compatibility through wildcard destructuring.

The implementation integrates with Daft's existing DataSource shim pattern, where Python data sources are bridged to the Rust scan operator system. This improves debugging and monitoring by making operator types clearly distinguishable in query plans, logs, and progress displays.

Confidence score: 4/5

  • This PR is safe to merge with minimal risk of breaking existing functionality
  • Score reflects well-structured changes with proper backward compatibility handling, though the hardcoded string matching in source type inference could be brittle
  • Pay close attention to src/common/file-formats/src/file_format_config.rs for the source type inference logic

10 files reviewed, no comments

Edit Code Review Bot Settings | Greptile

Copy link

codecov bot commented Sep 14, 2025

Codecov Report

❌ Patch coverage is 70.37037% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.29%. Comparing base (f227eac) to head (1ca03ee).

Files with missing lines Patch % Lines
src/common/file-formats/src/file_format_config.rs 64.28% 5 Missing ⚠️
src/common/file-formats/src/lib.rs 0.00% 1 Missing ⚠️
src/common/file-formats/src/python.rs 0.00% 1 Missing ⚠️
src/daft-scan/src/lib.rs 0.00% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #5204      +/-   ##
==========================================
- Coverage   74.82%   74.29%   -0.54%     
==========================================
  Files         974      974              
  Lines      123656   124372     +716     
==========================================
- Hits        92529    92402     -127     
- Misses      31127    31970     +843     
Files with missing lines Coverage Δ
daft/io/__shim.py 94.11% <100.00%> (ø)
daft/io/_generator.py 97.36% <ø> (ø)
daft/io/lance/lance_scan.py 90.83% <ø> (-3.06%) ⬇️
src/daft-local-execution/src/sources/scan_task.rs 76.94% <ø> (ø)
src/daft-micropartition/src/micropartition.rs 83.48% <ø> (-0.94%) ⬇️
src/daft-scan/src/glob.rs 89.42% <ø> (ø)
src/daft-scan/src/python.rs 64.60% <100.00%> (+0.47%) ⬆️
src/common/file-formats/src/lib.rs 77.77% <0.00%> (ø)
src/common/file-formats/src/python.rs 36.50% <0.00%> (ø)
src/daft-scan/src/lib.rs 67.00% <0.00%> (ø)
... and 1 more

... and 28 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@universalmind303
Copy link
Contributor

@srilman This seems pretty relevant to the changes you've made with observability & logging for udf's. Could you take a look at this one?

@universalmind303 universalmind303 requested review from srilman and removed request for srilman September 16, 2025 14:43
@srilman srilman self-requested a review September 16, 2025 15:54
Copy link
Contributor

@srilman srilman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jay-ju I feel like there should be a nicer way to do this using the def name or def display_name in DataSourceShim or ScanOperator instead of this matching method.

}
} else {
format!("{}...", &name[..MAX_PIPELINE_NAME_LEN - 3])
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer not to have all this specialization here if possible, is this necessary? Can we just name them Read {} or Scan {} instead

@Jay-ju Jay-ju force-pushed the optimize_scan_operator_display branch 5 times, most recently from 64264cb to ba782fd Compare September 28, 2025 01:44
- Enhance operator naming: PythonFunction Scan → Lance/duckdb/clickhouse/mcap Python Scan, DuckDB Python Scan, etc.
@Jay-ju Jay-ju force-pushed the optimize_scan_operator_display branch from 249fa02 to 1ca03ee Compare September 28, 2025 02:11
@Jay-ju
Copy link
Contributor Author

Jay-ju commented Sep 28, 2025

@Jay-ju I feel like there should be a nicer way to do this using the def name or def display_name in DataSourceShim or ScanOperator instead of this matching method.

@srilman The operator name is now displayed as you suggested. However, there's one remaining issue: the progress bar in on-ray mode still can't be controlled. This needs to be looked into as a to-do item. But currently, the progress bar in on-ray mode is also missing very little content.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants