Parquet variant intermediate conversion #145

Tishj · 2025-08-20T08:05:09Z

DO NOT MERGE

This PR is just to be able to check the diff between these two branches correctly

Tishj · 2025-08-20T08:08:36Z

extension/parquet/reader/variant/variant_value.cpp

+	//! NOTE: we initialize these to 0 because some rows will not set them, if the row is NULL
+	//! To simplify the implementation we just allocate 'dictionary.Size()' keys for each row
+	for (idx_t i = 0; i < keys_offset; i++) {
+		keys_selvec.set_index(i, 0);


I think this can be removed, it's copied from to_variant.cpp, but in this case we actually know that the keys are used

Tishj · 2025-08-20T08:09:44Z

test/parquet/variant/variant_nested_with_nulls.test

+array_of_struct_of_variants	STRUCT(v VARIANT)[]	YES	NULL	NULL	NULL
+struct_of_array_of_variants	STRUCT(v VARIANT[])	YES	NULL	NULL	NULL
+
+# FIXME: RemapStruct is not supported for VARIANT, I don't think it can be, since the schema is part of the data


This is old, has been resolved already

…n the DatabaseFilePathManager

…cription_get_column_type

Right now duckdb will always set `RULE_LAUNCH_COMPILE` when either ccache or sccache is found. `RULE_LAUNCH_COMPILE` is [not meant for](https://cmake.org/cmake/help/latest/prop_gbl/RULE_LAUNCH_COMPILE.html) external use: > *"This property is intended for internal use by ctest(1). Projects and developers should use the <LANG>_COMPILER_LAUNCHER target properties or the associated CMAKE_<LANG>_COMPILER_LAUNCHER variables instead."* Using these vars plays nicer with upstreams that depend on DuckDB. This fixes e.g. duckdb/duckdb-python#100.

…operators

…ermediate_conversion

This allows settings to be set, also prior to connecting to the initial database. Syntax is: ``` "settings": [ {"name": "storage_compatibility_version", "value": "latest"} ] ```

…m the hotpath here

Correctly handle identifiers for `IngestionMode::CREATE`

…#19420) Follow-up from duckdb#19393 There are a number of issues still caused by serializing CTE nodes - this PR makes it so that we only serialize CTE nodes when MATERIALIZED is explicitly defined, and serialize only the CommonTableExpressionMap otherwise. In addition, we never deserialize CTENodes anymore - and always reconstruct them from the CommonTableExpressionMap.

This is a follow up to duckdb#19368 In current configuration of the file path it will fail in CI - PR fixes it. We also might need bump httpfs again since we'll keep getting previous state of this test case from currently bumped version as [it happens now](https://github.com/duckdb/duckdb/actions/runs/18546600528/job/52866519080#step:6:5028).

Close duckdblabs/duckdb-internal#3037.

Includes duckdb#19420 This PR adds forwards compatibility testing to the CI. Example usage of the script: ``` python3 scripts/test_storage_compatibility.py --versions "1.2.1|1.3.2" --new-unittest build/release/test/unittest ``` The script works by running the unittester (compiled with the latest version) with the config `test/configs/storage_compatibility.json` for each test with a `CREATE TABLE` or `CREATE VIEW` statement. The provided config then runs the test in `--force-storage` mode, but with a fixed database path (`bwc_storage_test.db`). This then gives us a database file that has a number of tables / views in it as created in the test. For each version specified, we then try to read all tables and views stored within that file. If any of the queries fail, we report an error **if** the original CLI can successfully query the table / view. The reason we also do this is that there might be invalid views in the file, e.g. views that refer to files that have since been deleted. In order to get the older versions, our install script is used (i.e. we run `f'curl https://install.duckdb.org | DUCKDB_VERSION=1.3.2 sh` for each included version).

…non-default schemas. (duckdb#19363) ### Description Query `information_schema.tables` to get each table's schema instead of parsing SQL. Emit `CREATE SCHEMA` statements before tables so dumps are replayable. - Added getTableSchema() to lookup schema from catalog - Use qualified names for TableColumnList() and SELECT queries - Emit `CREATE SCHEMA IF NOT EXISTS` for non-main schemas - Fixed error code check (SQLITE_DONE is success, not error) ### Testing **Verified manually:** ```sql CREATE SCHEMA other; CREATE TABLE other.t(a INT); .dump -- Now outputs: CREATE SCHEMA IF NOT EXISTS other; ... COMMIT; ``` Added Unit tests in `test_shell_basics.py`: - Empty and populated tables in non-default schemas - Multiple schemas simultaneously - Quoted identifiers with special chars - CREATE TABLE IF NOT EXISTS variant Fixes : duckdb#19264

@Maxxen

I've extended the `README` a bit, and also added schemata for the json files. In follow-up PRs, I think we should add the optional parameters to the files, to then nicely generate the headers including "use instead" and "destroy with" documentation. Reworked my initial take on the schemata a bit after I found similar work in @Maxxen's PR here: duckdb#19186

…DuckDB versions

…x_nulls_last (duckdb#19434) I also realized that the slow_test was not able to catch this, even though it should have. fixes: duckdblabs/duckdb-internal/issues/6270

Based on duckdb#19229 and duckdb#19302 (these should be merged first and then I'll merge this with main and undraft). This PR introduces the `SET` and `RESET` statements, which are transformed in `extension/autocomplete/transformer/transform_set.cpp`

Follow up to: duckdb#17992 Replaces all `const string` with `DuckDB::String` in the operators.

…ct Hash Join (duckdb#19332) Started from duckdb#19274 This PR makes the cached hashes of string dictionaries thread-safe by adding a `mutex`, so that we can use `Vector` in parallel. We don't do this anywhere in the code base, so to also test this functionality, I've changed the perfect hash join to emit dictionary vectors that have a size and identifier (just like the string dictionaries that come from our storage). Emitting these allows further dictionary-based optimizations during execution, such as our dictionary aggregation and dictionary functions. This seemed to give minor speedups in the regression test on my fork, but it was quite small so it might be just noise.

duckdb#19436) The following are now enabled by default - `MetricsType::WAITING_TO_ATTACH_LATENCY` - `MetricsType::ATTACH_LOAD_STORAGE_LATENCY` - `MetricsType::ATTACH_REPLAY_WAL_LATENCY` - `MetricsType::CHECKPOINT_LATENCY` Follow up to: - duckdb#19367

…ermediate_conversion

…ake inequality operations work

Tishj commented Aug 20, 2025

View reviewed changes

Tishj and others added 28 commits October 10, 2025 12:48

Merge remote-tracking branch 'upstream/main' into merge_andium_into_main

5d4ca55

Keep track of which database managers have which databases attached i…

e86dcc2

…n the DatabaseFilePathManager

add tiny readme

ab0a268

move unused code to the right place

92ad7a4

just make this typed

68209fe

expose duckdb_table_description_get_column_count and duckdb_table_des…

6676055

…cription_get_column_type

Add node handlers for base_leaf

a3693ce

Fix scoping for node handlers

a44547d

Add missing include

dd76a4d

Merge remote-tracking branch 'upstream/main' into variant_comparison_…

e5189e4

…operators

Merge remote-tracking branch 'upstream/main' into parquet_variant_int…

f351336

…ermediate_conversion

remove ordinality fix

5e8a6be

lower requirements for command stmts vs implicit taggers

612555c

Use correct binder for Finalize bind

ab40d09

Fix for CTE node deserialization

852d439

Partially move set operation binding to be more regular

c6c0e9e

Avoid peeking into BoundSelectNode in setop bind

5bfa28e

add DUCKDB_DATA_DIR

31c4ede

Clean-up: remove internal select node bind code

5714702

Add settings field to test config (duckdb#19330)

d56e2ac

This allows settings to be set, also prior to connecting to the initial database. Syntax is: ``` "settings": [ {"name": "storage_compatibility_version", "value": "latest"} ] ```

init

4847a70

relocate the python sqllogic code to the dedicated repository

a89118a

undo a change that looked innocuous that somehow caused regressions

564a769

while we're at it, might as well remove the pointer dereferencing fro…

a6959a6

…m the hotpath here

fix null order

b301e4a

WIP: move set operations to regular binding

9525f84

fix some issues and use new dictionary api in parquet reader too

65b6761

Mytherin and others added 30 commits October 16, 2025 17:26

ADBC fix: escape schema, table and column identifiers (duckdb#19407)

b6dcebf

Correctly handle identifiers for `IngestionMode::CREATE`

extend documentation around C API header generation

e4aaa93

fix

dbd5b76

[C API] Patch C API query progress documentation (duckdb#19430)

eee3ff7

Close duckdblabs/duckdb-internal#3037.

Merge branch 'v1.4-andium' into merge14

1e78c72

Fix CTE serialization

bd61530

CTEs: Only serialize explicit MATERIALIZED flag when targeting newer …

b424072

…DuckDB versions

fix up tests, now that we are producing VARIANT, not JSON

e991c7f

add new metrics to default

868292b

tidy

1ef5a38

Internal duckdb#6270: Fix small binary fallback values for arg_min/ma…

50acc16

…x_nulls_last (duckdb#19434) I also realized that the slow_test was not able to catch this, even though it should have. fixes: duckdblabs/duckdb-internal/issues/6270

Test fixes

470149f

Update / simplify issue template

3bab8cc

Update deserialized_statement_verifier.hpp

2832e66

Merge v1.4 into main (duckdb#19438)

d921f4d

Add DuckDB::String to operators (duckdb#19423)

1b30f72

Follow up to: duckdb#17992 Replaces all `const string` with `DuckDB::String` in the operators.

struct comparison code was not equipped to handle empty structs

4e6a1c4

Update / simplify issue template (duckdb#19442)

e93d352

fix failing tests

496a4bc

actually fix up tests by regenerating the remaining ones

1a65a12

Merge remote-tracking branch 'upstream/main' into parquet_variant_int…

b30cda6

…ermediate_conversion

use Final instead of directly returning true, probably necessary to m…

19b1a56

…ake inequality operations work

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parquet variant intermediate conversion #145

Parquet variant intermediate conversion #145

Uh oh!

Tishj commented Aug 20, 2025

Uh oh!

Tishj Aug 20, 2025

Uh oh!

Tishj Aug 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

19 participants

Parquet variant intermediate conversion #145

Are you sure you want to change the base?

Parquet variant intermediate conversion #145

Uh oh!

Conversation

Tishj commented Aug 20, 2025

Uh oh!

Tishj Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

Tishj Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

19 participants