Skip to content

Commit 911073a

Browse files
authored
Merge pull request #119 from thliang01/Working-with-Apache-Arrow
DuckDB : Improve Apache Arrow + DuckDB notebook clarity and technical accuracy
2 parents a100272 + 2604086 commit 911073a

File tree

1 file changed

+6
-8
lines changed

1 file changed

+6
-8
lines changed

duckdb/011_working_with_apache_arrow.py

Lines changed: 6 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414

1515
import marimo
1616

17-
__generated_with = "0.14.11"
17+
__generated_with = "0.14.12"
1818
app = marimo.App(width="medium")
1919

2020

@@ -300,17 +300,15 @@ def _(mo):
300300
### Key Benefits:
301301
302302
- **Memory Efficiency**: Arrow's columnar format uses 20-40% less memory than traditional DataFrames through compact columnar representation and better compression ratios
303-
- **Zero-Copy Operations**: Data can be shared between DuckDB and Arrow-compatible systems (Polars, Pandas) without any data copying, eliminating redundant memory usage
303+
- **Zero-Copy Operations**: Data can be shared between DuckDB and Arrow-compatible systems (Polars, Pandas) without any data copying, eliminating redundant memory usage
304304
- **Query Performance**: 2-10x faster queries compared to traditional approaches that require data copying
305-
- **Larger-than-Memory Analysis**: Since both libraries support streaming query results, you can execute queries on data bigger than available memory by processing one batch at a time
305+
- **Larger-than-Memory Analysis**: Both DuckDB and Arrow-compatible libraries support streaming query results, allowing you to execute queries on data larger than available memory by processing data in batches.
306306
- **Advanced Query Optimization**: DuckDB's optimizer can push down filters and projections directly into Arrow scans, reading only relevant columns and partitions
307307
Let's demonstrate these benefits with concrete examples:
308308
"""
309309
)
310310
return
311311

312-
313-
314312
@app.cell(hide_code=True)
315313
def _(mo):
316314
mo.md(r"""### Memory Efficiency Demonstration""")
@@ -529,7 +527,6 @@ def _(mo):
529527

530528
@app.cell
531529
def _(polars_data, time):
532-
import psutil
533530
import os
534531
import pyarrow.compute as pc # Add this import
535532

@@ -554,14 +551,14 @@ def _(polars_data, time):
554551
# Compare with traditional copy-based operations
555552
latest_start_time = time.time()
556553

557-
# These operations create copies
554+
# These operations may create copies depending on Pandas' Copy-on-Write (CoW) behavior
558555
pandas_copy = polars_data.to_pandas()
559556
pandas_sliced = pandas_copy.iloc[:100000].copy()
560557
pandas_filtered = pandas_copy[pandas_copy['value'] > 500000].copy()
561558

562559
copy_ops_time = time.time() - latest_start_time
563560
memory_after_copy = process.memory_info().rss / 1024 / 1024 # MB
564-
561+
565562
print("Memory Usage Comparison:")
566563
print(f"Initial memory: {memory_before:.2f} MB")
567564
print(f"After Arrow operations: {memory_after_arrow:.2f} MB (diff: +{memory_after_arrow - memory_before:.2f} MB)")
@@ -606,6 +603,7 @@ def _():
606603
import pandas as pd
607604
import duckdb
608605
import sqlglot
606+
import psutil
609607
return duckdb, mo, pa, pd, pl
610608

611609

0 commit comments

Comments
 (0)