Skip to content

Commit 818e525

Browse files
committed
Fix formatting issues in index.mdx
Signed-off-by: Abhishek R <[email protected]>
1 parent 47d270d commit 818e525

File tree

1 file changed

+44
-2
lines changed

1 file changed

+44
-2
lines changed

src/blog/delta-lake-with-pandas/index.mdx

Lines changed: 44 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,11 +13,13 @@ Delta Lake has revolutionized data lake storage by providing ACID transactions,
1313
## Setting Up delta-rs with Pandas
1414

1515
To get started, install the required dependencies:
16+
1617
```
1718
pip install deltalake
1819
```
1920

2021
Next, create a Delta table from a Pandas DataFrame (3 cols, 100 mil rows):
22+
2123
```
2224
import numpy as np
2325
import pandas as pd
@@ -39,20 +41,27 @@ write_deltalake("./delta_table", data)
3941
# Key Delta Lake Techniques in Pandas
4042

4143
## 1. Data Skipping
44+
4245
Data Skipping allows Delta Lake to avoid scanning unnecessary data by leveraging metadata stored in Parquet statistics. This improves query performance by reading only relevant data instead of scanning the entire dataset. There are 3 different ways of implementing data-skipping during a write operation.
46+
4347
### Default data skipping (Data Skipping stats are automatically collected for the first 32 columns)
48+
4449
```
4550
write_deltalake("./delta_default_data_skipping", data)
4651
```
52+
4753
### Custom data skipping with indexed columns (Data Skipping stats are collected for the first "n" columns)
54+
4855
```
4956
write_deltalake(
5057
 "./delta_custom_data_skipping_n_cols",
5158
 data,
5259
 configuration={"delta.dataSkippingNumIndexedCols": "2"}
5360
)
5461
```
62+
5563
### Custom data skipping specifying column names (Data Skipping stats are collected for the column names specified)
64+
5665
```
5766
write_deltalake(
5867
 "./delta_custom_data_skipping_specify_col_names",
@@ -64,22 +73,28 @@ write_deltalake(
6473
> Note: Data Skipping benefits are available only during the initial read from Delta Lake. Once loaded into memory, Pandas operates on the full dataset, meaning filtering operations will not benefit from this optimization.
6574
6675
## 2. Compaction
76+
6777
Compaction (file optimization) reduces the number of small Parquet files to improve query performance by minimizing metadata overhead and enhancing read efficiency.
78+
6879
```
6980
delta_table = DeltaTable("./delta_default_data_skipping")
7081
delta_table.optimize.compact()
7182
```
83+
7284
This process merges small files into larger ones, reducing read overhead for Pandas and other consumers.
7385

7486
## 3. Z-Ordering
87+
7588
Z-Ordering is a data layout optimization technique that improves read performance by clustering related data together. This is particularly useful for range queries, as it reduces the number of files that need to be scanned. Z Ordering is always performed along with Compaction.
7689

7790
Similar to Data Skipping, Z-Ordering only provides benefits during the initial read. Once the data is loaded into Pandas, it is held in memory, negating any advantages from the optimized layout.
91+
7892
```
7993
DeltaTable(
8094
 "./delta_default_data_skipping"
8195
).optimize.z_order(['value'])
8296
```
97+
8398
```
8499
## Output
85100
@@ -93,13 +108,15 @@ DeltaTable(
93108
'totalFilesSkipped': 0,
94109
'preserveInsertionOrder': True}
95110
```
96-
As you can see, there were 10 files initially. However, since Compaction was performed along with Z Ordering, those 10 files were removed, and instead, 7 files were created.
97111

112+
As you can see, there were 10 files initially. However, since Compaction was performed along with Z Ordering, those 10 files were removed, and instead, 7 files were created.
98113

99114
Let's check the history of the table to see if Z Order and Compaction have been applied
115+
100116
```
101117
DeltaTable("./delta_default_data_skipping").history()
102118
```
119+
103120
```
104121
## Output
105122
@@ -134,7 +151,9 @@ As you can see above, two versions have been created.
134151
The first version (version 0) is the initial default-data skipping write. The second version (version 1) is the Compacted and Z-ordered version.
135152

136153
## 4. Time Travel
154+
137155
Delta Lake maintains historical versions of data, allowing users to query past states, making it useful for auditing, debugging, and rollback scenarios.
156+
138157
```
139158
### Creating the version 0 of data
140159
@@ -161,6 +180,7 @@ data_2 = pd.DataFrame({
161180
162181
write_deltalake(output_path, data_2, mode = 'append')
163182
```
183+
164184
```
165185
### Overwrite with new data to the same deltalake path (Version 2)
166186
@@ -172,11 +192,13 @@ data_3 = pd.DataFrame({
172192
173193
write_deltalake(output_path, data_3, mode = 'overwrite')
174194
```
195+
175196
```
176197
### View history of the Delta table
177198
178199
DeltaTable(output_path).history()
179200
```
201+
180202
```
181203
## Output
182204
@@ -211,21 +233,26 @@ DeltaTable(output_path).history()
211233
'clientVersion': 'delta-rs.0.23.1',
212234
'version': 0}]
213235
```
236+
214237
As expected, we see three different versions of data exist in the above output_path.
215238

216239
We can travel to different versions of data using the version argument when performing the read. Eg: You want to read version 1 of the above path:
240+
217241
```
218242
df_version1 = DeltaTable(
219243
 output_path,
220244
 version = 1
221245
).to_pandas()
222246
```
247+
223248
> Note: Not specifying the version argument will automatically return the latest version of data available in the path specified.
224249
225250
## 5. Vacuum
251+
226252
While we will need to store older versions of data on disks for time travel purposes, storing them over a longer period will significantly increase the storage costs. So, we can remove the older files that we no longer need(for time-travel purposes), and delete them permanently from the disk.
227253

228254
Vacuum removes Parquet files that are no longer referenced, optimizing storage efficiency and keeping the dataset lean. Regularly running Vacuum helps optimize storage costs by clearing obsolete data.
255+
229256
```
230257
### Let's remove the older 2 versions from the Time Travel example, and we only want to retain the latest version
231258
@@ -250,13 +277,15 @@ dry_run = True, 
250277
enforce_retention_duration = False
251278
)
252279
```
280+
253281
```
254282
['part-00001-20003cd1-c799-4e7c-83b0-4691da3e35fc-c000.snappy.parquet',
255283
'part-00001-eb826a04-8f6b-49c1-be0c-a63425c255a6-c000.snappy.parquet']
256284
```
257285

258286
The above command will return the path of the files that will be considered for deletion. As expected, it returns the two older files.
259287
Now let's perform the actual vacuum operation:
288+
260289
```
261290
## Set 'dry_run = False' to perform the actual deletion operation
262291
@@ -270,6 +299,7 @@ enforce_retention_duration = False
270299
The deletion operations have been performed successfully. 
271300

272301
Let's have a look at whether it actually worked:
302+
273303
```
274304
### Version 0
275305
@@ -278,6 +308,7 @@ df_version0 = DeltaTable(
278308
 version = 0
279309
).to_pandas()
280310
```
311+
281312
```
282313
## Output
283314
@@ -289,6 +320,7 @@ Input In [69], in <cell line: 1>()
289320
3 version = 0
290321
4 ).to_pandas()
291322
```
323+
292324
```
293325
### Version 1
294326
@@ -297,6 +329,7 @@ df_version1 = DeltaTable(
297329
 version = 1
298330
).to_pandas()
299331
```
332+
300333
```
301334
## Output
302335
@@ -308,6 +341,7 @@ Input In [70], in <cell line: 1>()
308341
3 version = 1
309342
4 ).to_pandas()
310343
```
344+
311345
```
312346
### Version 2 (Latest)
313347
@@ -316,17 +350,19 @@ df_version2 = DeltaTable(
316350
 version = 2 #(This line is Optional here as it is the latest version anyways)
317351
).to_pandas()
318352
```
353+
319354
```
320355
Works Successfully
321356
```
322357

323-
324358
As expected, the older versions are no longer available and we only have the latest version available, which is what we wanted.
325359

326360
Now, let's have a look at the history of changes to this table
361+
327362
```
328363
delta_table.history()
329364
```
365+
330366
```
331367
## Output
332368
@@ -383,9 +419,11 @@ The history command will return all the operations that have been performed on t
383419
To illustrate the performance benefits of Data Skipping and Z-Ordering, let's compare query times for a dataset with and without these optimizations.
384420

385421
## Theoretical Context
422+
386423
Without Data Skipping and Z-Ordering, queries must scan a significant portion (or all) of the dataset, leading to increased latency. Data Skipping improves efficiency by leveraging Parquet metadata to prune irrelevant rows before reading the data, while Z-Ordering enhances range query performance by sorting and clustering related data together, minimizing unnecessary scans.
387424

388425
## Implementation
426+
389427
```
390428
### Performance of a Query with default data skipping only (No Z-Ordering)
391429
@@ -394,6 +432,7 @@ delta_df_only_data_skipping = DeltaTable(
394432
 "./delta_default_data_skipping"
395433
).to_pandas(filters = filters)
396434
```
435+
397436
```
398437
Time Taken: 8.23 seconds
399438
```
@@ -406,16 +445,19 @@ delta_df_z_ordered = DeltaTable(
406445
 "./delta_default_data_skipping"
407446
).to_pandas(filters = filters)
408447
```
448+
409449
```
410450
Time Taken: 4.04 seconds
411451
```
412452

413453
## Results
454+
414455
As you can see above, the time taken to filter from a Z-Ordered table (4.04 seconds) is more than ~50% faster than the time taken to run the same query from a non-Z-Ordered table (8.23 seconds). While there is not a significant difference due to the relatively small amount of data involved (100M rows and 3 cols), the difference will be huge in real-world scenarios where we have PBs of data.
415456

416457
This demonstrates how Delta Lake optimizations reduce scan time and improve retrieval speed. However, once the data is loaded into Pandas, these benefits no longer apply, as Pandas operates entirely in-memory.
417458

418459
## Conclusion
460+
419461
Using delta-rs, we can leverage Delta Lake's powerful features with Pandas, enhancing data lake storage without requiring Spark. However, optimizations like Data Skipping and Z-Ordering only benefit initial reads, as Pandas loads all data into memory. Implementing Delta Lake outside Spark expands its accessibility, providing ACID guarantees and performance improvements for Python-based data workflows.
420462

421463
By leveraging delta-rs effectively, data practitioners can build robust, high-performance data pipelines using Delta Lake beyond the Spark ecosystem.

0 commit comments

Comments
 (0)