You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Data Skipping allows Delta Lake to avoid scanning unnecessary data by leveraging metadata stored in Parquet statistics. This improves query performance by reading only relevant data instead of scanning the entire dataset. There are 3 different ways of implementing data-skipping during a write operation.
46
+
43
47
### Default data skipping (Data Skipping stats are automatically collected for the first 32 columns)
### Custom data skipping specifying column names (Data Skipping stats are collected for the column names specified)
64
+
56
65
```
57
66
write_deltalake(
58
67
"./delta_custom_data_skipping_specify_col_names",
@@ -64,22 +73,28 @@ write_deltalake(
64
73
> Note: Data Skipping benefits are available only during the initial read from Delta Lake. Once loaded into memory, Pandas operates on the full dataset, meaning filtering operations will not benefit from this optimization.
65
74
66
75
## 2. Compaction
76
+
67
77
Compaction (file optimization) reduces the number of small Parquet files to improve query performance by minimizing metadata overhead and enhancing read efficiency.
This process merges small files into larger ones, reducing read overhead for Pandas and other consumers.
73
85
74
86
## 3. Z-Ordering
87
+
75
88
Z-Ordering is a data layout optimization technique that improves read performance by clustering related data together. This is particularly useful for range queries, as it reduces the number of files that need to be scanned. Z Ordering is always performed along with Compaction.
76
89
77
90
Similar to Data Skipping, Z-Ordering only provides benefits during the initial read. Once the data is loaded into Pandas, it is held in memory, negating any advantages from the optimized layout.
91
+
78
92
```
79
93
DeltaTable(
80
94
"./delta_default_data_skipping"
81
95
).optimize.z_order(['value'])
82
96
```
97
+
83
98
```
84
99
## Output
85
100
@@ -93,13 +108,15 @@ DeltaTable(
93
108
'totalFilesSkipped': 0,
94
109
'preserveInsertionOrder': True}
95
110
```
96
-
As you can see, there were 10 files initially. However, since Compaction was performed along with Z Ordering, those 10 files were removed, and instead, 7 files were created.
97
111
112
+
As you can see, there were 10 files initially. However, since Compaction was performed along with Z Ordering, those 10 files were removed, and instead, 7 files were created.
98
113
99
114
Let's check the history of the table to see if Z Order and Compaction have been applied
@@ -134,7 +151,9 @@ As you can see above, two versions have been created.
134
151
The first version (version 0) is the initial default-data skipping write. The second version (version 1) is the Compacted and Z-ordered version.
135
152
136
153
## 4. Time Travel
154
+
137
155
Delta Lake maintains historical versions of data, allowing users to query past states, making it useful for auditing, debugging, and rollback scenarios.
As expected, we see three different versions of data exist in the above output_path.
215
238
216
239
We can travel to different versions of data using the version argument when performing the read. Eg: You want to read version 1 of the above path:
240
+
217
241
```
218
242
df_version1 = DeltaTable(
219
243
output_path,
220
244
version = 1
221
245
).to_pandas()
222
246
```
247
+
223
248
> Note: Not specifying the version argument will automatically return the latest version of data available in the path specified.
224
249
225
250
## 5. Vacuum
251
+
226
252
While we will need to store older versions of data on disks for time travel purposes, storing them over a longer period will significantly increase the storage costs. So, we can remove the older files that we no longer need(for time-travel purposes), and delete them permanently from the disk.
227
253
228
254
Vacuum removes Parquet files that are no longer referenced, optimizing storage efficiency and keeping the dataset lean. Regularly running Vacuum helps optimize storage costs by clearing obsolete data.
255
+
229
256
```
230
257
### Let's remove the older 2 versions from the Time Travel example, and we only want to retain the latest version
The deletion operations have been performed successfully.
271
300
272
301
Let's have a look at whether it actually worked:
302
+
273
303
```
274
304
### Version 0
275
305
@@ -278,6 +308,7 @@ df_version0 = DeltaTable(
278
308
version = 0
279
309
).to_pandas()
280
310
```
311
+
281
312
```
282
313
## Output
283
314
@@ -289,6 +320,7 @@ Input In [69], in <cell line: 1>()
289
320
3 version = 0
290
321
4 ).to_pandas()
291
322
```
323
+
292
324
```
293
325
### Version 1
294
326
@@ -297,6 +329,7 @@ df_version1 = DeltaTable(
297
329
version = 1
298
330
).to_pandas()
299
331
```
332
+
300
333
```
301
334
## Output
302
335
@@ -308,6 +341,7 @@ Input In [70], in <cell line: 1>()
308
341
3 version = 1
309
342
4 ).to_pandas()
310
343
```
344
+
311
345
```
312
346
### Version 2 (Latest)
313
347
@@ -316,17 +350,19 @@ df_version2 = DeltaTable(
316
350
version = 2 #(This line is Optional here as it is the latest version anyways)
317
351
).to_pandas()
318
352
```
353
+
319
354
```
320
355
Works Successfully
321
356
```
322
357
323
-
324
358
As expected, the older versions are no longer available and we only have the latest version available, which is what we wanted.
325
359
326
360
Now, let's have a look at the history of changes to this table
361
+
327
362
```
328
363
delta_table.history()
329
364
```
365
+
330
366
```
331
367
## Output
332
368
@@ -383,9 +419,11 @@ The history command will return all the operations that have been performed on t
383
419
To illustrate the performance benefits of Data Skipping and Z-Ordering, let's compare query times for a dataset with and without these optimizations.
384
420
385
421
## Theoretical Context
422
+
386
423
Without Data Skipping and Z-Ordering, queries must scan a significant portion (or all) of the dataset, leading to increased latency. Data Skipping improves efficiency by leveraging Parquet metadata to prune irrelevant rows before reading the data, while Z-Ordering enhances range query performance by sorting and clustering related data together, minimizing unnecessary scans.
387
424
388
425
## Implementation
426
+
389
427
```
390
428
### Performance of a Query with default data skipping only (No Z-Ordering)
As you can see above, the time taken to filter from a Z-Ordered table (4.04 seconds) is more than ~50% faster than the time taken to run the same query from a non-Z-Ordered table (8.23 seconds). While there is not a significant difference due to the relatively small amount of data involved (100M rows and 3 cols), the difference will be huge in real-world scenarios where we have PBs of data.
415
456
416
457
This demonstrates how Delta Lake optimizations reduce scan time and improve retrieval speed. However, once the data is loaded into Pandas, these benefits no longer apply, as Pandas operates entirely in-memory.
417
458
418
459
## Conclusion
460
+
419
461
Using delta-rs, we can leverage Delta Lake's powerful features with Pandas, enhancing data lake storage without requiring Spark. However, optimizations like Data Skipping and Z-Ordering only benefit initial reads, as Pandas loads all data into memory. Implementing Delta Lake outside Spark expands its accessibility, providing ACID guarantees and performance improvements for Python-based data workflows.
420
462
421
463
By leveraging delta-rs effectively, data practitioners can build robust, high-performance data pipelines using Delta Lake beyond the Spark ecosystem.
0 commit comments