refactor(lance): implement Lance DataSink batching mechanism with row-based flush control #5246

Jay-ju · 2025-09-20T01:34:29Z

This commit introduces a comprehensive batching mechanism for Lance DataSink to improve write performance by combining multiple micropartitions before writing, creating larger Lance files and reducing fragmentation.

Key Features:

Add batch_size parameter to control number of micropartitions to batch (default: 1)
Add max_batch_rows parameter for row-based flush control (default: 100,000)
Implement intelligent batching logic with dual flush conditions:
- Flush when batch_size micropartitions are accumulated
- Flush when max_batch_rows total rows are reached
Maintain full backward compatibility with existing code

Changes Made

Related Issues

Checklist

Documented in API Docs (if applicable)
Documented in User Guide (if applicable)
If adding a new documentation page, doc is added to docs/mkdocs.yml navigation
Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)

greptile-apps

Greptile Summary

This PR introduces a comprehensive batching mechanism for Lance DataSink to improve write performance by combining multiple micropartitions before writing to Lance datasets. The implementation addresses a key performance issue where writing many small micropartitions individually results in fragmented Lance files, which negatively impacts both write and read performance.

The core changes include:

API Extensions: New parameters batch_size (default: 1) and max_batch_rows (default: 100,000) added to the write_lance() method in dataframe.py. These parameters provide dual flush control - batching either by micropartition count or total row count, whichever limit is reached first.

DataSink Implementation: The LanceDataSink class in lance_data_sink.py has been extensively modified to support stateful batching. It now maintains instance-level state with _batch_tables, _batch_row_count, and _batch_count to accumulate micropartitions across multiple write() calls. The implementation includes intelligent flush logic that triggers when either batching condition is met.

Query Optimization: A separate optimization in lance_scan.py adds fragment filtering to skip empty fragments during scanning, reducing unnecessary scan task creation.

Comprehensive Testing: New test suite test_write_lance_batch_size.py validates the batching mechanism with scenarios testing parameter acceptance, data integrity, fragment reduction effectiveness, and dual flush conditions.

The batching mechanism maintains full backward compatibility since batch_size=1 (default) preserves existing behavior. When batching is enabled, multiple micropartitions are accumulated in memory and written together as larger Lance files, reducing fragmentation and improving performance. The dual-condition approach provides flexibility for users to control batching based on their specific memory constraints and performance requirements.

Confidence score: 2/5

This PR requires careful review due to production code issues and potential thread safety concerns
Score lowered due to debug print statements left in production code, complex error handling that may mask issues, and instance-level state management that could have thread safety implications
Pay close attention to daft/io/lance/lance_data_sink.py for the debug prints and error handling logic

_{4 files reviewed, 5 comments}

_{Edit Code Review Bot Settings | Greptile}

greptile-apps · 2025-09-20T01:36:17Z

daft/io/lance/lance_scan.py

+            if fragment.count_rows(pushed_expr) == 0:
+                logger.debug("Skipping fragment %s with fragment_id %s with 0 rows", fragment.fragment_id)
+                continue


style: This optimization adds I/O overhead by calling count_rows() with filters for every fragment. Consider measuring the performance impact - for datasets with many small fragments or complex filters, this could be slower than just processing empty fragments downstream.

daft/io/lance/lance_data_sink.py

greptile-apps · 2025-09-20T01:36:19Z

tests/io/lancedb/test_write_lance_batch_size.py

+def test_batch_size_reduces_fragments():
+    """Test that batch_size parameter is accepted and maintains data integrity."""
+    test_batch_size_parameter_acceptance()


style: This function just calls another test function without adding value. Consider removing this wrapper or consolidating the test logic.

greptile-apps · 2025-09-20T01:36:19Z

tests/io/lancedb/test_write_lance_batch_size.py

+        from daft.datatype import DataType
+        from daft.io.lance.lance_data_sink import LanceDataSink
+        from daft.schema import Schema


style: Import statements should be placed at the top of the file rather than within the test function.

Context Used: Rule - Import statements should be placed at the top of the file rather than inline within functions or methods. (link)

…-based flush control This commit introduces a comprehensive batching mechanism for Lance DataSink to improve write performance and merge small files by combining multiple micropartitions before writing, creating larger Lance files and reducing fragmentation. Key Features: - Add batch_size parameter to control number of micropartitions to batch (default: 1) - Add max_batch_rows parameter for row-based flush control (default: 100,000) - Implement intelligent batching logic with dual flush conditions: * Flush when batch_size micropartitions are accumulated * Flush when max_batch_rows total rows are reached - Maintain full backward compatibility with existing code

github-actions bot added the feat label Sep 20, 2025

greptile-apps bot reviewed Sep 20, 2025

View reviewed changes

Jay-ju force-pushed the optimize_lance_datasink branch from 1011cae to 56f6686 Compare September 20, 2025 02:12

Jay-ju force-pushed the optimize_lance_datasink branch from 56f6686 to ef69e2c Compare September 20, 2025 03:57

Jay-ju changed the title ~~feat: implement Lance DataSink batching mechanism with row-based control~~ refactor(lance): implement Lance DataSink batching mechanism with row-based flush control Sep 20, 2025

github-actions bot added refactor and removed feat labels Sep 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor(lance): implement Lance DataSink batching mechanism with row-based flush control #5246

refactor(lance): implement Lance DataSink batching mechanism with row-based flush control #5246

Uh oh!

Jay-ju commented Sep 20, 2025

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Sep 20, 2025

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot Sep 20, 2025

Uh oh!

greptile-apps bot Sep 20, 2025

Uh oh!

Uh oh!

refactor(lance): implement Lance DataSink batching mechanism with row-based flush control #5246

Are you sure you want to change the base?

refactor(lance): implement Lance DataSink batching mechanism with row-based flush control #5246

Uh oh!

Conversation

Jay-ju commented Sep 20, 2025

Changes Made

Related Issues

Checklist

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Summary

Confidence score: 2/5

Uh oh!

greptile-apps bot Sep 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot Sep 20, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Sep 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!