Skip to content

Commit 00378cc

Browse files
committed
Add Parquet documentation with Apache Arrow integration details
Added a new topic on reading Parquet files via Apache Arrow to the documentation. Updated the data sources index and table of contents to include the Parquet guide. The documentation covers method overloads, examples, requirements, performance tips, and current limitations.
1 parent 9c325ce commit 00378cc

File tree

3 files changed

+194
-0
lines changed

3 files changed

+194
-0
lines changed

docs/StardustDocs/d.tree

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -209,6 +209,7 @@
209209
<toc-element topic="CSV-TSV.md"/>
210210
<toc-element topic="Excel.md"/>
211211
<toc-element topic="ApacheArrow.md"/>
212+
<toc-element topic="Parquet.md"/>
212213
<toc-element topic="SQL.md">
213214
<toc-element topic="PostgreSQL.md"/>
214215
<toc-element topic="MySQL.md"/>

docs/StardustDocs/topics/dataSources/Data-Sources.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ Below you'll find a list of supported sources along with instructions on how to
2121
- [CSV / TSV](CSV-TSV.md)
2222
- [Excel](Excel.md)
2323
- [Apache Arrow](ApacheArrow.md)
24+
- [Parquet](Parquet.md)
2425
- [SQL](SQL.md):
2526
- [PostgreSQL](PostgreSQL.md)
2627
- [MySQL](MySQL.md)
Lines changed: 192 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,192 @@
1+
# Parquet
2+
3+
<web-summary>
4+
Read Parquet files via Apache Arrow in Kotlin DataFrame — high‑performance columnar storage for analytics.
5+
</web-summary>
6+
7+
<card-summary>
8+
Use Kotlin DataFrame to read Parquet datasets using Apache Arrow for fast, typed, columnar I/O.
9+
</card-summary>
10+
11+
<link-summary>
12+
Kotlin DataFrame can read Parquet files through Apache Arrow’s Dataset API. Learn how and when to use it.
13+
</link-summary>
14+
15+
Kotlin DataFrame supports reading [Apache Parquet](https://parquet.apache.org/) files through the Apache Arrow integration.
16+
17+
Requires the [`dataframe-arrow` module](Modules.md#dataframe-arrow), which is included by default in the general [`dataframe`](Modules.md#dataframe-general) artifact and in [`%use dataframe`](SetupKotlinNotebook.md#integrate-kotlin-dataframe) for Kotlin Notebook.
18+
19+
> We currently support READING Parquet via Apache Arrow only; writing Parquet is not supported in Kotlin DataFrame.
20+
> {style="note"}
21+
22+
> Apache Arrow is not supported on Android, so reading Parquet files on Android is not available.
23+
> {style="warning"}
24+
25+
> Structured (nested) Arrow types such as Struct are not supported yet in Kotlin DataFrame.
26+
> See an issue: [Add inner / Struct type support in Arrow](https://github.com/Kotlin/dataframe/issues/536)
27+
> {style="warning"}
28+
29+
## Reading Parquet Files
30+
31+
Kotlin DataFrame provides four `readParquet()` methods that can read from different source types.
32+
All overloads accept optional `nullability` inference settings and `batchSize` for Arrow scanning.
33+
34+
```kotlin
35+
// 1) URLs
36+
public fun DataFrame.Companion.readParquet(
37+
vararg urls: URL,
38+
nullability: NullabilityOptions = NullabilityOptions.Infer,
39+
batchSize: Long = ARROW_PARQUET_DEFAULT_BATCH_SIZE,
40+
): AnyFrame
41+
42+
// 2) Strings (interpreted as file paths or URLs, e.g., "data/file.parquet", "file://", or "http(s)://")
43+
public fun DataFrame.Companion.readParquet(
44+
vararg strUrls: String,
45+
nullability: NullabilityOptions = NullabilityOptions.Infer,
46+
batchSize: Long = ARROW_PARQUET_DEFAULT_BATCH_SIZE,
47+
): AnyFrame
48+
49+
// 3) Paths
50+
public fun DataFrame.Companion.readParquet(
51+
vararg paths: Path,
52+
nullability: NullabilityOptions = NullabilityOptions.Infer,
53+
batchSize: Long = ARROW_PARQUET_DEFAULT_BATCH_SIZE,
54+
): AnyFrame
55+
56+
// 4) Files
57+
public fun DataFrame.Companion.readParquet(
58+
vararg files: File,
59+
nullability: NullabilityOptions = NullabilityOptions.Infer,
60+
batchSize: Long = ARROW_PARQUET_DEFAULT_BATCH_SIZE,
61+
): AnyFrame
62+
```
63+
64+
These overloads are defined in the `dataframe-arrow` module and internally use `FileFormat.PARQUET` from Apache Arrow’s
65+
Dataset API to scan the data and materialize it as a Kotlin `DataFrame`.
66+
67+
### Examples
68+
69+
```kotlin
70+
// Read from file paths (as strings)
71+
val df1 = DataFrame.readParquet("data/sales.parquet")
72+
73+
// Read from File objects
74+
val file = File("data/sales.parquet")
75+
val df2 = DataFrame.readParquet(file)
76+
77+
// Read from Path objects
78+
val path = Paths.get("data/sales.parquet")
79+
val df3 = DataFrame.readParquet(path)
80+
81+
// Read from URLs
82+
val url = URL("https://example.com/data/sales.parquet")
83+
val df4 = DataFrame.readParquet(url)
84+
85+
// Customize nullability inference and batch size
86+
val df5 = DataFrame.readParquet(
87+
File("data/sales.parquet"),
88+
nullability = NullabilityOptions.Infer,
89+
batchSize = 64L * 1024 // tune Arrow scan batch size if needed
90+
)
91+
```
92+
93+
### Multiple Files
94+
95+
It's possible to read multiple Parquet files:
96+
97+
```kotlin
98+
val df = DataFrame.readParquet("file1.parquet", "file2.parquet", "file3.parquet")
99+
```
100+
**Requirements:**
101+
102+
- All files must have compatible schemas
103+
- Files are vertically concatenated (union of rows)
104+
- Column types must match exactly
105+
- Missing columns in some files will result in null values
106+
107+
### Batch Size Tuning
108+
109+
- **Default**: (typically 1024) `ARROW_PARQUET_DEFAULT_BATCH_SIZE`
110+
- **Small files** (< 100MB): Use default
111+
- **Large files** (> 1GB): Increase to `64 * 1024` or `128 * 1024`
112+
- **Memory constrained**: Decrease to `256` or `512`
113+
114+
```kotlin
115+
// For large files with enough memory
116+
DataFrame.readParquet("large_file.parquet", batchSize = 64L * 1024)
117+
118+
// For memory-constrained environments
119+
DataFrame.readParquet("file.parquet", batchSize = 256L)
120+
```
121+
122+
### Nullability Inference
123+
124+
Controls how nullable columns are handled:
125+
126+
```kotlin
127+
// Infer nullability from data (default)
128+
DataFrame.readParquet("file.parquet", nullability = NullabilityOptions.Infer)
129+
130+
// Treat all columns as nullable
131+
DataFrame.readParquet("file.parquet", nullability = NullabilityOptions.Enable)
132+
133+
// Treat all columns as non-null (may cause runtime errors)
134+
DataFrame.readParquet("file.parquet", nullability = NullabilityOptions.Disable)
135+
```
136+
137+
## About Parquet
138+
139+
[Apache Parquet](https://parquet.apache.org/) is an open-source, column-oriented data file format designed for efficient data storage and retrieval. It provides several advantages:
140+
141+
- **Columnar storage**: Data is stored column-by-column, which enables efficient compression and encoding schemes
142+
- **Schema evolution**: Supports adding new columns without breaking existing data readers
143+
- **Efficient querying**: Optimized for analytics workloads where you typically read a subset of columns
144+
- **Cross-platform**: Works across different programming languages and data processing frameworks
145+
- **Compression**: Built-in support for various compression algorithms (GZIP, Snappy, etc.)
146+
147+
Parquet files are commonly used in data lakes, data warehouses, and big data processing pipelines. They're frequently created by tools like Apache Spark, Pandas, Dask, and various cloud data services.
148+
149+
## Typical use cases
150+
151+
- Exchanging columnar datasets between Spark and Kotlin/JVM applications.
152+
- Analytical workloads where columnar compression and predicate pushdown matter.
153+
- Reading data exported from data lakes and lakehouse tables (e.g., from Spark, Hive, or Delta/Iceberg exports).
154+
155+
If you want to see a complete, realistic data‑engineering example using Spark and Parquet with Kotlin DataFrame,
156+
check out the [example project](https://github.com/Kotlin/dataframe/tree/master/examples/idea-examples/spark-parquet-dataframe).
157+
158+
### Performance tips
159+
160+
- **Column selection**: Because the ` readParquet ` method reads all columns, use DataFrame operations like `select()` immediately after reading to reduce memory usage in later operations
161+
- **Predicate pushdown**: Currently not supported—filtering happens after data is loaded into memory
162+
- Use Arrow‑compatible JVMs as documented in
163+
[Apache Arrow Java compatibility](https://arrow.apache.org/docs/java/install.html#java-compatibility).
164+
- Adjust `batchSize` if you read huge files and need to tune throughput vs. memory.
165+
166+
## Limitations
167+
168+
### Structured Data Support
169+
170+
> **Important**: We currently don't support reading nested/structured data from Parquet files. Complex types like nested objects, arrays of structs, and maps are not yet supported.
171+
>
172+
> This limitation is tracked in issue [#536: Add inner/Struct type support in Arrow](https://github.com/Kotlin/dataframe/issues/536).
173+
> {style="warning"}
174+
175+
If your Parquet file contains nested structures, you may need to flatten the data before processing or use alternative tools for initial data preparation.
176+
177+
### Android Compatibility
178+
179+
> **Note**: Parquet file reading is **not available on Android** because Apache Arrow is not supported on the Android platform.
180+
> {style="warning"}
181+
182+
If you need to process Parquet files in an Android application, consider:
183+
- Processing files on a server and exposing the data via an API
184+
- Converting Parquet files to a supported format (JSON, CSV) for Android consumption
185+
- Using cloud-based data processing services
186+
187+
### See also
188+
189+
- [](ApacheArrow.md) — reading/writing Arrow IPC formats.
190+
- [Parquet official site](https://parquet.apache.org/).
191+
- Example: [Spark + Parquet + Kotlin DataFrame](https://github.com/Kotlin/dataframe/tree/master/examples/idea-examples/spark-parquet-dataframe)
192+
- [](Data-Sources.md) — Overview of all supported formats

0 commit comments

Comments
 (0)