[Parquet] Implement automatic shredding for `VARIANT` columns in COPY TO parquet #146

Tishj · 2025-10-07T11:21:01Z

This PR is a follow up to

Which added the ability to provide a SHREDDING copy option to manually set the shredded type of a VARIANT column.
This implemented shredding and made it easy to control, but it requires manual configuration by the user.

With this PR, VARIANT columns will always be shredded based on the analysis of the first rowgroup.
(TODO: this should be queryable with parquet_metadata / parquet_schema)

This required some changes to the Parquet writing path:

SchemaElement items are no longer created during bind.
ParquetColumnSchema schema_index is now an optional_idx, as this index refers to the SchemaElements vector, which is populated later.
ColumnWriter no longer takes a reference to a ParquetColumnSchema object, instead it now owns the schema.
FillParquetSchema is removed, consolidated into CreateWriterRecursive
[1]ParquetWriteTransformData is introduced, this struct holds the reusable state needed to perform a pre-processing step to transform the input columns into the shape required by the ColumnWriter (relevant for VARIANT).
FinalizeSchema has been added to create the SchemaElements, and populate the schema_index in the schema objects.
HasTransform, TransformedType and TransformExpression are added to perform the pre-processing step, using the ParquetWriteTransformData.
AnalyzeSchemaInit, AnalyzeSchema and AnalyzeSchemaFinalize are added to auto-detect the shredding for VARIANT columns and apply this change to the schema, to be used before FinalizeSchema is called.

[1] This is not as clean as I want it to be, the ParquetWriteTransformData can't only live in the ParquetWriteLocalState, there also needs to be a copy in the ParquetWriteGlobalState as well as in the ParquetWritePrepareBatch call.

…ol GeoParquet version (duckdb#19244) In duckdb#18832 we added support for reading and writing geometry as native parquet geometry types. I also made it so that we write native geometry types by default, always. I thought this would be safe to do as the new parquet geometry native type is just a logical type, so older parquet readers that don't support it yet would just continue to read it like any other blob. Unfortunately, as shown in duckdb/duckdb-spatial#688 (comment) this turns out to be wrong, lots of readers throw errorsinstead of ignoring the logical type annotation. This PR thus changes the behavior back to writing blobs + geoparquet metadata, but also adds a new `GEOPARQUET_VERSION` flag to select what geoparquet version to use. Unfortunately again, GeoParquet 2.0 is supposed to be based on the new native geometry type, but there is no official standard released yet. Therefore, the valid options are as follows: - `NONE` = Don't write GeoParquet metadata, store geometries as native parquet geometry logical type with geo stats - `V1` = Write GeoParquet metadata, store geometries as blobs with no stats. Made default in this PR, same behavior as DuckDB v1.3.2 - `BOTH` = Write GeoParquet metadata, _and_ store geometries as native parquet geometry logical type with stats Accidentally made default in v1.4.0 Hopefully `NONE` or a future `V2` can be default in DuckDB v1.5 together with the larger geometry overhaul. This is an actual fix compared to duckdb#19243, Ideally this gets in as part of v1.4.1

…ry relation (duckdb#19234)

…ctor somewhere)

This PR bumps the following extensions: - `iceberg` from `9b8c6de6b7` to `4f3c5499e5`

This PR bumps the following extensions: - `avro` from `0c97a61781` to `7b75062f63` - `httpfs` from `0518838dae` to `8356a90174 (patches removed: 1)`

…rquetSchema step, this is now part of CreateWriterRecursive

addresses the mismatch between Azure's 1.4.0 tag and the docs, see also duckdb/duckdb-azure#123

the `imdb` library is only used in benchmarks, but was built pretty much always

… acquire a read or write lock on `UpdateSegment::lock`. Therefore, an Active Query traversing the `UpdateInfo` chain is now mutually exclusive with `UpdateSegment::CleanupUpdate`. Therefore, `old_transactions` is no longer necessary.

…ize (duckdb#19390) Fixes an issue with the line reconstruction code, which would otherwise cause the `first_nl` to be read from an uninitialized piece of memory leading to potential off-by-one errors in line / byte positions.

This PR fixes duckdblabs/duckdb-internal#6250 and duckdblabs/duckdb-internal#6224

This PR adds a visitor class that encapsulates the logic to traverse a variant, removing some duplication and protecting against implementations missing logic for handling certain types (such as GEOMETRY) This pattern is now used in `ConvertVariantToValue` and `ConvertVariantToJSON`, and `variant_to_variant.hpp` Not a *great* sign that the additions are higher than the deletions, but in the long run it should help

This PR switches the `RowVersionManager` that tracks whether or not rows are relevant for a given transaction to using the `FixedSizeAllocator`. This makes the majority of its memory usage trackable with the buffer manager. The `FixedSizeAllocator` does not support evicting yet - but it could in theory also start supporting that in the future using this setup.

Ensure that profiling metrics can be obtained from the appender. Closes: duckdblabs/duckdb-internal#5393

PR based on: duckdb#19181, duckdb#19229 This PR introduces the first `Statement` in the transformer, the `UseStatement` to select the database and/or schema. Other things in this PR: - During tokenization we now keep track of the offset and length for each token - I added a new config to the `Main` workflow that tests with the `allow_parser_override_extension` set to `fallback`. This should test that the new transformer will either (1) parse and transform correctly and return the correct result or (2) throw an error and let the old parser/transformer handle the query. So it should pass all tests. - The `ParserOverrideResult` now stores `ErrorData` rather than the raw string for better error handling and cleaner error messages.

…operators

See duckdb#19335 for detail. Currently, all functions that access or modify the `UpdateInfo` chain acquire a read or write lock on `UpdateSegment::lock`. Therefore, an Active Query traversing the `UpdateInfo` chain is now mutually exclusive with `UpdateSegment::CleanupUpdate`. Therefore, `old_transactions` is no longer necessary.

This PR introduces the `arg_min_nulls_last` and `arg_max_nulls_last` functions, which implement top-N aggregation with NULLS LAST semantics equivalent to regular ordering. The `TopNWindowElimination` optimizer rule (duckdb#19280) uses these functions to enable window rewriting with aggregates on columns containing null values. With this change, the optimizer rule shows its first effect on a benchmark, namely `h2oai/group/q08`, which improves by 2-3x in the regression test. **Example** Given a table `T` with a column `T.a` and values `1`, `2`, and `NULL`: ``` arg_min(a, a, 3) => [1, 2] arg_min_nulls_last(a, a, 3) => [1, 2, NULL] arg_max(a, a, 3) => [2, 1] arg_max_nulls_last(a, a, 3) => [2, 1, NULL] ``` With `T.a` only containing nulls: ``` arg_max(a, a) => NULL arg_max_nulls_last(a, a) => [NULL] ```

…Data' object, as we don't use any offsets

Changes all exception class constructors to use const references (`ARGS const &...params`) instead of pass-by-value (`ARGS... params`).

…_normalize'

…uckdb#19337) This PR implements `variant_normalize`, which does a rewrite of the variant internals to facilitate comparison operations. We push the `variant_normalize` onto a `VARIANT` column as a collation so comparison operations/hashing functions correctly. I've copied all tests from `list_distinct.test` and `struct_distinct.test` and adjusted them to cast at least one side of the comparison to VARIANT.

…nts_shredding_auto

Maxxen and others added 30 commits October 3, 2025 02:30

add option to control geoparquet version

d818186

actually, only write gpq1 version 1 unless opt-in

c558afa

Switch to separate Query() method to get a query

e61a4e2

Bump DuckLake to latest of V1.4 (duckdb#19237)

33b638b

Bump mysql and sqlite (duckdb#19240)

3f51c67

add support for writing geoparquet with v2 metadata too

7aa6348

When executing a relation, generate a query to set if it is not a que…

2549afa

…ry relation (duckdb#19234)

rework ParquetColumnSchema constructors

b75f735

bump extensions

4476118

set the field id for primitive columns (this logic was lost in a refa…

db3862c

…ctor somewhere)

bump extensions

419111f

add support for writing geoparquet with v2 metadata too (duckdb#19246)

8a74cb9

Bump: iceberg (duckdb#19250)

253ebab

This PR bumps the following extensions: - `iceberg` from `9b8c6de6b7` to `4f3c5499e5`

Bump: avro, httpfs (duckdb#19248)

b390a7c

This PR bumps the following extensions: - `avro` from `0c97a61781` to `7b75062f63` - `httpfs` from `0518838dae` to `8356a90174 (patches removed: 1)`

feat: Fix length of sort keys

a99f5be

Don't create recursive choice matchers

5a40d9e

the ColumnWriter now owns the ParquetColumnSchema, removed the FillPa…

3f0f5bc

…rquetSchema step, this is now part of CreateWriterRecursive

missing assertion

53030dc

Create script that generates the method and definitions

a2281ac

Include proper transformer

aee2e0e

Include proper transformer again

e7558e7

Add new Transform overload that doesn't require casting

faeab11

Clean up child

066f364

Format fix

6dffa1f

Include sqlstatement

377815a

bump duckdb-azure ref to HEAD for 1.4.1

1cfef17

addresses the mismatch between Azure's 1.4.0 tag and the docs, see also duckdb/duckdb-azure#123

Catch NotImplementedExceptions

6b51820

Set the statament location and length

467b17a

Fix typo in transformer_stub_generator.py, remove set_statement

f908102

Mytherin and others added 30 commits October 14, 2025 19:18

Update issue template (duckdb#19384)

ef047f7

Correctly tag memory allocated in the RowVersionManager with BASE_TABLE

d25db2f

Not building imdb all the time (duckdb#19388)

746404c

the `imdb` library is only used in benchmarks, but was built pretty much always

[Dev] variant_extract BindData copy fix (duckdb#19394)

4fdf341

This PR fixes duckdblabs/duckdb-internal#6250 and duckdblabs/duckdb-internal#6224

Merge v1.4-andium into main

7d2aae0

[Profiling][CAPI] Add appender profiling test (duckdb#19383)

0babf65

Ensure that profiling metrics can be obtained from the appender. Closes: duckdblabs/duckdb-internal#5393

final fix

f008053

Merge remote-tracking branch 'upstream/main' into variant_comparison_…

f1b2934

…operators

Move broken files to broken directory

71ab357

first commit for 'variant_normalize'

438a121

further tinkering, added an optional visitor callback: 'VisitMetadata'

02512cf

fix a bug shadowing variables, seems to function correctly now

42d4248

remove some bloat, there was no need to have a 'ToVariantGlobalResult…

96f6859

…Data' object, as we don't use any offsets

some renaming, the struct outgrew its original purpose

33f6ef9

Use const references in exceptions varargs (duckdb#18723)

6e197e9

Changes all exception class constructors to use const references (`ARGS const &...params`) instead of pass-by-value (`ARGS... params`).

comments

6f71834

forgot to slice the keys entry

ea59338

push a collation for variant, which runs the variant through 'variant…

7aed30e

…_normalize'

add test with extracted variant, to 'variant_distinct.test'

2cf2a67

Merge v1.4-andium into main (duckdb#19402)

5d422e4

fix tests

8afbd2e

Merge remote-tracking branch 'upstream/main' into write_parquet_varia…

2bf2967

…nts_shredding_auto

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Parquet] Implement automatic shredding for `VARIANT` columns in COPY TO parquet #146

[Parquet] Implement automatic shredding for `VARIANT` columns in COPY TO parquet #146

Uh oh!

Tishj commented Oct 7, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

[Parquet] Implement automatic shredding for VARIANT columns in COPY TO parquet #146

Are you sure you want to change the base?

[Parquet] Implement automatic shredding for VARIANT columns in COPY TO parquet #146

Uh oh!

Conversation

Tishj commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

[Parquet] Implement automatic shredding for `VARIANT` columns in COPY TO parquet #146

[Parquet] Implement automatic shredding for `VARIANT` columns in COPY TO parquet #146

Tishj commented Oct 7, 2025 •

edited

Loading