Fix formatting issues in index.mdx

Abhishek1005 · Abhishek1005 · commit 818e525dcba4 · 2025-04-13T18:47:46.000-04:00
Signed-off-by: Abhishek R &lt;abhishekramji10@gmail.com&gt;
diff --git a/src/blog/delta-lake-with-pandas/index.mdx b/src/blog/delta-lake-with-pandas/index.mdx
@@ -13,11 +13,13 @@ Delta Lake has revolutionized data lake storage by providing ACID transactions,
 ## Setting Up delta-rs with Pandas
 
 To get started, install the required dependencies:
+
 ```
 pip install deltalake
 ```
 
 Next, create a Delta table from a Pandas DataFrame (3 cols, 100 mil rows):
+
 ```
 import numpy as np
 import pandas as pd
@@ -39,20 +41,27 @@ write_deltalake("./delta_table", data)
 # Key Delta Lake Techniques in Pandas
 
 ## 1. Data Skipping
+
 Data Skipping allows Delta Lake to avoid scanning unnecessary data by leveraging metadata stored in Parquet statistics. This improves query performance by reading only relevant data instead of scanning the entire dataset. There are 3 different ways of implementing data-skipping during a write operation.
+
 ### Default data skipping (Data Skipping stats are automatically collected for the first 32 columns)
+
 ```
 write_deltalake("./delta_default_data_skipping", data)
 ```
+
 ### Custom data skipping with indexed columns (Data Skipping stats are collected for the first "n" columns)
+
 ```
 write_deltalake(
  "./delta_custom_data_skipping_n_cols",
  data,
  configuration={"delta.dataSkippingNumIndexedCols": "2"}
 )
 ```
+
 ### Custom data skipping specifying column names (Data Skipping stats are collected for the column names specified)
+
 ```
 write_deltalake(
  "./delta_custom_data_skipping_specify_col_names",
@@ -64,22 +73,28 @@ write_deltalake(
 > Note: Data Skipping benefits are available only during the initial read from Delta Lake. Once loaded into memory, Pandas operates on the full dataset, meaning filtering operations will not benefit from this optimization.
 
 ## 2. Compaction
+
 Compaction (file optimization) reduces the number of small Parquet files to improve query performance by minimizing metadata overhead and enhancing read efficiency.
+
 ```
 delta_table = DeltaTable("./delta_default_data_skipping")
 delta_table.optimize.compact()
 ```
+
 This process merges small files into larger ones, reducing read overhead for Pandas and other consumers.
 
 ## 3. Z-Ordering
+
 Z-Ordering is a data layout optimization technique that improves read performance by clustering related data together. This is particularly useful for range queries, as it reduces the number of files that need to be scanned. Z Ordering is always performed along with Compaction.
 
 Similar to Data Skipping, Z-Ordering only provides benefits during the initial read. Once the data is loaded into Pandas, it is held in memory, negating any advantages from the optimized layout.
+
 ```
 DeltaTable(
  "./delta_default_data_skipping"
 ).optimize.z_order(['value'])
 ```
+
 ```
 ## Output
 
@@ -93,13 +108,15 @@ DeltaTable(
  'totalFilesSkipped': 0,
  'preserveInsertionOrder': True}
 ```
-As you can see, there were 10 files initially. However, since Compaction was performed along with Z Ordering, those 10 files were removed, and instead, 7 files were created.
 
+As you can see, there were 10 files initially. However, since Compaction was performed along with Z Ordering, those 10 files were removed, and instead, 7 files were created.
 
 Let's check the history of the table to see if Z Order and Compaction have been applied
+
 ```
 DeltaTable("./delta_default_data_skipping").history()
 ```
+
 ```
 ## Output
 
@@ -134,7 +151,9 @@ As you can see above, two versions have been created.
 The first version (version 0) is the initial default-data skipping write. The second version (version 1) is the Compacted and Z-ordered version.
 
 ## 4. Time Travel
+
 Delta Lake maintains historical versions of data, allowing users to query past states, making it useful for auditing, debugging, and rollback scenarios.
+
 ```
 ### Creating the version 0 of data
 
@@ -161,6 +180,7 @@ data_2 = pd.DataFrame({
 
 write_deltalake(output_path, data_2, mode = 'append')
 ```
+
 ```
 ### Overwrite with new data to the same deltalake path (Version 2)
 
@@ -172,11 +192,13 @@ data_3 = pd.DataFrame({
 
 write_deltalake(output_path, data_3, mode = 'overwrite')
 ```
+
 ```
 ### View history of the Delta table
 
 DeltaTable(output_path).history()
 ```
+
 ```
 ## Output
 
@@ -211,21 +233,26 @@ DeltaTable(output_path).history()
   'clientVersion': 'delta-rs.0.23.1',
   'version': 0}]
 ```
+
 As expected, we see three different versions of data exist in the above output_path.
 
 We can travel to different versions of data using the version argument when performing the read. Eg: You want to read version 1 of the above path:
+
 ```
 df_version1 = DeltaTable(
  output_path,
  version = 1
 ).to_pandas()
 ```
+
 > Note: Not specifying the version argument will automatically return the latest version of data available in the path specified.
 
 ## 5. Vacuum
+
 While we will need to store older versions of data on disks for time travel purposes, storing them over a longer period will significantly increase the storage costs. So, we can remove the older files that we no longer need(for time-travel purposes), and delete them permanently from the disk.
 
 Vacuum removes Parquet files that are no longer referenced, optimizing storage efficiency and keeping the dataset lean. Regularly running Vacuum helps optimize storage costs by clearing obsolete data.
+
 ```
 ### Let's remove the older 2 versions from the Time Travel example, and we only want to retain the latest version
 
@@ -250,13 +277,15 @@ dry_run = True, 
 enforce_retention_duration = False
 )
 ```
+
 ```
 ['part-00001-20003cd1-c799-4e7c-83b0-4691da3e35fc-c000.snappy.parquet',
  'part-00001-eb826a04-8f6b-49c1-be0c-a63425c255a6-c000.snappy.parquet']
 ```
 
 The above command will return the path of the files that will be considered for deletion. As expected, it returns the two older files.
 Now let's perform the actual vacuum operation:
+
 ```
 ## Set 'dry_run = False' to perform the actual deletion operation
 
@@ -270,6 +299,7 @@ enforce_retention_duration = False
 The deletion operations have been performed successfully. 
 
 Let's have a look at whether it actually worked:
+
 ```
 ### Version 0
 
@@ -278,6 +308,7 @@ df_version0 = DeltaTable(
  version = 0
 ).to_pandas()
 ```
+
 ```
 ## Output
 
@@ -289,6 +320,7 @@ Input In [69], in <cell line: 1>()
       3     version = 0
       4 ).to_pandas()
 ```
+
 ```
 ### Version 1
 
@@ -297,6 +329,7 @@ df_version1 = DeltaTable(
  version = 1
 ).to_pandas()
 ```
+
 ```
 ## Output
 
@@ -308,6 +341,7 @@ Input In [70], in <cell line: 1>()
       3     version = 1
       4 ).to_pandas()
 ```
+
 ```
 ### Version 2 (Latest)
 
@@ -316,17 +350,19 @@ df_version2 = DeltaTable(
  version = 2 #(This line is Optional here as it is the latest version anyways)
 ).to_pandas()
 ```
+
 ```
 Works Successfully
 ```
 
-
 As expected, the older versions are no longer available and we only have the latest version available, which is what we wanted.
 
 Now, let's have a look at the history of changes to this table
+
 ```
 delta_table.history()
 ```
+
 ```
 ## Output
 
@@ -383,9 +419,11 @@ The history command will return all the operations that have been performed on t
 To illustrate the performance benefits of Data Skipping and Z-Ordering, let's compare query times for a dataset with and without these optimizations.
 
 ## Theoretical Context
+
 Without Data Skipping and Z-Ordering, queries must scan a significant portion (or all) of the dataset, leading to increased latency. Data Skipping improves efficiency by leveraging Parquet metadata to prune irrelevant rows before reading the data, while Z-Ordering enhances range query performance by sorting and clustering related data together, minimizing unnecessary scans.
 
 ## Implementation
+
 ```
 ### Performance of a Query with default data skipping only (No Z-Ordering)
 
@@ -394,6 +432,7 @@ delta_df_only_data_skipping = DeltaTable(
  "./delta_default_data_skipping"
 ).to_pandas(filters = filters)
 ```
+
 ```
 Time Taken: 8.23 seconds
 ```
@@ -406,16 +445,19 @@ delta_df_z_ordered = DeltaTable(
  "./delta_default_data_skipping"
 ).to_pandas(filters = filters)
 ```
+
 ```
 Time Taken: 4.04 seconds
 ```
 
 ## Results
+
 As you can see above, the time taken to filter from a Z-Ordered table (4.04 seconds) is more than ~50% faster than the time taken to run the same query from a non-Z-Ordered table (8.23 seconds). While there is not a significant difference due to the relatively small amount of data involved (100M rows and 3 cols), the difference will be huge in real-world scenarios where we have PBs of data.
 
 This demonstrates how Delta Lake optimizations reduce scan time and improve retrieval speed. However, once the data is loaded into Pandas, these benefits no longer apply, as Pandas operates entirely in-memory.
 
 ## Conclusion
+
 Using delta-rs, we can leverage Delta Lake's powerful features with Pandas, enhancing data lake storage without requiring Spark. However, optimizations like Data Skipping and Z-Ordering only benefit initial reads, as Pandas loads all data into memory. Implementing Delta Lake outside Spark expands its accessibility, providing ACID guarantees and performance improvements for Python-based data workflows.
 
 By leveraging delta-rs effectively, data practitioners can build robust, high-performance data pipelines using Delta Lake beyond the Spark ecosystem.