You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
refactor(lance): implement Lance DataSink batching mechanism with row-based flush control
This commit introduces a comprehensive batching mechanism for Lance DataSink to
improve write performance and merge small files by combining multiple micropartitions before writing,
creating larger Lance files and reducing fragmentation.
Key Features:
- Add batch_size parameter to control number of micropartitions to batch (default: 1)
- Add max_batch_rows parameter for row-based flush control (default: 100,000)
- Implement intelligent batching logic with dual flush conditions:
* Flush when batch_size micropartitions are accumulated
* Flush when max_batch_rows total rows are reached
- Maintain full backward compatibility with existing code
mode: The write mode. One of "create", "append", or "overwrite"
1482
1484
io_config (IOConfig, optional): configurations to use when interacting with remote storage.
1485
+
batch_size (int, optional): Number of micropartitions to batch together before writing. Default is 1 (no batching).
1486
+
max_batch_rows (int, optional): Maximum number of rows to accumulate before flushing a batch. Default is 100,000.
1483
1487
**kwargs: Additional keyword arguments to pass to the Lance writer.
1484
1488
1485
1489
Note:
1486
1490
`write_lance` requires python 3.9 or higher
1487
1491
This call is **blocking** and will execute the DataFrame when called
1488
1492
1493
+
Batching Parameters:
1494
+
- batch_size=1 (default): No batching, maintains backward compatibility
1495
+
- batch_size>1: Enables batching to combine multiple micropartitions
1496
+
- max_batch_rows: Row-based flush control for predictable batching behavior
1497
+
1489
1498
Returns:
1490
1499
DataFrame: A DataFrame containing metadata about the written Lance table, such as number of fragments, number of deleted rows, number of small files, and version.
0 commit comments