forked from duckdb/duckdb
-
Notifications
You must be signed in to change notification settings - Fork 0
[Parquet] Implement automatic shredding for VARIANT
columns in COPY TO parquet
#146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Tishj
wants to merge
351
commits into
write_parquet_variants_shredding
Choose a base branch
from
write_parquet_variants_shredding_auto
base: write_parquet_variants_shredding
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
[Parquet] Implement automatic shredding for VARIANT
columns in COPY TO parquet
#146
Tishj
wants to merge
351
commits into
write_parquet_variants_shredding
from
write_parquet_variants_shredding_auto
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…ol GeoParquet version (duckdb#19244) In duckdb#18832 we added support for reading and writing geometry as native parquet geometry types. I also made it so that we write native geometry types by default, always. I thought this would be safe to do as the new parquet geometry native type is just a logical type, so older parquet readers that don't support it yet would just continue to read it like any other blob. Unfortunately, as shown in duckdb/duckdb-spatial#688 (comment) this turns out to be wrong, lots of readers throw errorsinstead of ignoring the logical type annotation. This PR thus changes the behavior back to writing blobs + geoparquet metadata, but also adds a new `GEOPARQUET_VERSION` flag to select what geoparquet version to use. Unfortunately again, GeoParquet 2.0 is supposed to be based on the new native geometry type, but there is no official standard released yet. Therefore, the valid options are as follows: - `NONE` = Don't write GeoParquet metadata, store geometries as native parquet geometry logical type with geo stats - `V1` = Write GeoParquet metadata, store geometries as blobs with no stats. Made default in this PR, same behavior as DuckDB v1.3.2 - `BOTH` = Write GeoParquet metadata, _and_ store geometries as native parquet geometry logical type with stats Accidentally made default in v1.4.0 Hopefully `NONE` or a future `V2` can be default in DuckDB v1.5 together with the larger geometry overhaul. This is an actual fix compared to duckdb#19243, Ideally this gets in as part of v1.4.1
This PR bumps the following extensions: - `iceberg` from `9b8c6de6b7` to `4f3c5499e5`
This PR bumps the following extensions: - `avro` from `0c97a61781` to `7b75062f63` - `httpfs` from `0518838dae` to `8356a90174 (patches removed: 1)`
…rquetSchema step, this is now part of CreateWriterRecursive
addresses the mismatch between Azure's 1.4.0 tag and the docs, see also duckdb/duckdb-azure#123
the `imdb` library is only used in benchmarks, but was built pretty much always
… acquire a read or write lock on `UpdateSegment::lock`. Therefore, an Active Query traversing the `UpdateInfo` chain is now mutually exclusive with `UpdateSegment::CleanupUpdate`. Therefore, `old_transactions` is no longer necessary.
…ize (duckdb#19390) Fixes an issue with the line reconstruction code, which would otherwise cause the `first_nl` to be read from an uninitialized piece of memory leading to potential off-by-one errors in line / byte positions.
This PR adds a visitor class that encapsulates the logic to traverse a variant, removing some duplication and protecting against implementations missing logic for handling certain types (such as GEOMETRY) This pattern is now used in `ConvertVariantToValue` and `ConvertVariantToJSON`, and `variant_to_variant.hpp` Not a *great* sign that the additions are higher than the deletions, but in the long run it should help
This PR switches the `RowVersionManager` that tracks whether or not rows are relevant for a given transaction to using the `FixedSizeAllocator`. This makes the majority of its memory usage trackable with the buffer manager. The `FixedSizeAllocator` does not support evicting yet - but it could in theory also start supporting that in the future using this setup.
Ensure that profiling metrics can be obtained from the appender. Closes: duckdblabs/duckdb-internal#5393
PR based on: duckdb#19181, duckdb#19229 This PR introduces the first `Statement` in the transformer, the `UseStatement` to select the database and/or schema. Other things in this PR: - During tokenization we now keep track of the offset and length for each token - I added a new config to the `Main` workflow that tests with the `allow_parser_override_extension` set to `fallback`. This should test that the new transformer will either (1) parse and transform correctly and return the correct result or (2) throw an error and let the old parser/transformer handle the query. So it should pass all tests. - The `ParserOverrideResult` now stores `ErrorData` rather than the raw string for better error handling and cleaner error messages.
See duckdb#19335 for detail. Currently, all functions that access or modify the `UpdateInfo` chain acquire a read or write lock on `UpdateSegment::lock`. Therefore, an Active Query traversing the `UpdateInfo` chain is now mutually exclusive with `UpdateSegment::CleanupUpdate`. Therefore, `old_transactions` is no longer necessary.
This PR introduces the `arg_min_nulls_last` and `arg_max_nulls_last` functions, which implement top-N aggregation with NULLS LAST semantics equivalent to regular ordering. The `TopNWindowElimination` optimizer rule (duckdb#19280) uses these functions to enable window rewriting with aggregates on columns containing null values. With this change, the optimizer rule shows its first effect on a benchmark, namely `h2oai/group/q08`, which improves by 2-3x in the regression test. **Example** Given a table `T` with a column `T.a` and values `1`, `2`, and `NULL`: ``` arg_min(a, a, 3) => [1, 2] arg_min_nulls_last(a, a, 3) => [1, 2, NULL] arg_max(a, a, 3) => [2, 1] arg_max_nulls_last(a, a, 3) => [2, 1, NULL] ``` With `T.a` only containing nulls: ``` arg_max(a, a) => NULL arg_max_nulls_last(a, a) => [NULL] ```
…Data' object, as we don't use any offsets
Changes all exception class constructors to use const references (`ARGS const &...params`) instead of pass-by-value (`ARGS... params`).
…uckdb#19337) This PR implements `variant_normalize`, which does a rewrite of the variant internals to facilitate comparison operations. We push the `variant_normalize` onto a `VARIANT` column as a collation so comparison operations/hashing functions correctly. I've copied all tests from `list_distinct.test` and `struct_distinct.test` and adjusted them to cast at least one side of the comparison to VARIANT.
…nts_shredding_auto
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR is a follow up to
Which added the ability to provide a
SHREDDING
copy option to manually set the shredded type of a VARIANT column.This implemented shredding and made it easy to control, but it requires manual configuration by the user.
With this PR, VARIANT columns will always be shredded based on the analysis of the first rowgroup.
(TODO: this should be queryable with
parquet_metadata
/parquet_schema
)This required some changes to the Parquet writing path:
SchemaElement
items are no longer created during bind.ParquetColumnSchema
schema_index
is now anoptional_idx
, as this index refers to theSchemaElements
vector, which is populated later.ColumnWriter
no longer takes a reference to aParquetColumnSchema
object, instead it now owns the schema.FillParquetSchema
is removed, consolidated intoCreateWriterRecursive
ParquetWriteTransformData
is introduced, this struct holds the reusable state needed to perform a pre-processing step to transform the input columns into the shape required by the ColumnWriter (relevant for VARIANT).FinalizeSchema
has been added to create theSchemaElements
, and populate theschema_index
in the schema objects.HasTransform
,TransformedType
andTransformExpression
are added to perform the pre-processing step, using theParquetWriteTransformData
.AnalyzeSchemaInit
,AnalyzeSchema
andAnalyzeSchemaFinalize
are added to auto-detect the shredding for VARIANT columns and apply this change to the schema, to be used beforeFinalizeSchema
is called.[1] This is not as clean as I want it to be, the
ParquetWriteTransformData
can't only live in theParquetWriteLocalState
, there also needs to be a copy in theParquetWriteGlobalState
as well as in theParquetWritePrepareBatch
call.