-
Notifications
You must be signed in to change notification settings - Fork 122
Add MERGE incremental strategy with DuckDB extensions support #605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add MERGE incremental strategy with DuckDB extensions support #605
Conversation
…d merge to the valid incremental strategies for dbt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
happy to just keep running these tests as you push new changes in; the linting stuff is done via pre-commit run --all-files
IIRC
dbt/adapters/duckdb/impl.py
Outdated
_, cursor = self.connections.add_select_query("SELECT version()") | ||
version_string = cursor.fetchone()[0] | ||
|
||
logger.info(f"DuckDB version: '{version_string}'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI that I'm going to want these logger lines to go or be downgraded to debug prior to merging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jwills Thanks for the hints. I removed some of the logger lines that were meant for debugging, ran pre-commit
and tox -p auto
, and pushed again.
I checked the logs of some of the tests — they are failing because tox is not using DuckDB 1.4.0, right? I ran the tests locally after updating tox to use the DuckDB preview version 1.4.0.dev2079 with .tox/functional/bin/pip install --pre duckdb --upgrade
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah duh of course-- I'll have to try them out locally
FYI: @jwills I just added a validation for the DuckLake MERGE restrictions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking really good-- couple of things i'd like to fix and one more of a style thing
dbt/adapters/duckdb/impl.py
Outdated
"""DuckDB supports MERGE statements starting from version 1.4.0.""" | ||
base_strategies = ["append", "delete+insert"] | ||
|
||
if self._supports_merge_strategy(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not wild about doing this on every call to valid_incremental_strategies()
-- i'd rather the set of valid strategies be cached s.t. this is only done once per dbt run instead of every time we need to check for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ve updated the code to use cached properties for the DuckDB version and incremental strategy values, and I’m now using the Version class from packaging to compare versions. This is a more Pythonic and elegant approach.
dbt/adapters/duckdb/impl.py
Outdated
if self._supports_merge_strategy(): | ||
return base_strategies + ["merge"] | ||
|
||
logger.info( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another logger line that I would like to be downgraded to debug and/or removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the logger line but added an exception in case the user tries to use merge with an older version. I had to override the get_incremental_strategy_macro
for this; otherwise, the user would get an exception stating that merge is not available for the adapter, which is no longer true after this PR.
@@ -0,0 +1,236 @@ | |||
{% macro validate_merge_update_options(update_all, update_by_name, update_by_position) %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly a style/vibe comment here-- this is a lot of validation code entirely in jinja; is this how other adapters that have complex merge logic handle it? I'm inclined to keep it b/c I think moving stuff to jinja is generally a good thing as we look towards a dbt-fusion future, just something I wanted to call out in case there were better pattern for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of the adapters' merge logic is as complex as this one. The ones for BigQuery and Snowflake just use the default get_merge_sql
from dbt-core with some additional code. And yes, usually they just use Jinja for this kind of stuff.
DuckDB introduced many additional extensions to the basic merge and i am trying to take advantage of these extensions, which is why the code is more complex in our case. Thus we also need more validations to ensure that people configure them correctly. In other adapters, I couldn't find much validation code in incremental strategies because they simply don't have a lot of configurations to validate.
That said, I still agree with you that some of the code could be removed to improve performance and readability. I could redesign it in the following way:
- More compact configurations — e.g.,
merge_update_by
expecting a value'position'
or'name'
instead of having two booleans. Still, I would need to validate if the user provides the correct options. - Removing some basic configuration validations by trusting that people won’t mess them up.
- Only one merge validation macro with validation rules and a loop, instead of having many separate macros.
What do you think @jwills?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me-- thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check the README.md
first — I think it’s the easiest way to understand my approach. In short: it requires only minimal configuration for basic use cases (about 90% I’d guess) while still providing maximum flexibility for more complex scenarios by taking advantage of DuckDB’s MERGE
syntax extensions.
In terms of validations, I used a more modular approach this time, which reduced the overall code. Most of the validation logic isn’t even used for basic configurations, so performance shouldn’t be an issue. For complex use cases, the few extra milliseconds added to execution time shouldn’t bother anyone.
Also added a validation to check for DuckLake based on the restrictions from duckdb/ducklake#351
…constants for strategy selection
…ew three configuration options
@VahOkan ahh now maybe ducklake 0.3 does support MERGE? |
I think |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VahOkan sorry for the insane lag here, just a hectic month at work. Tripped over some stuff with the merge logic that I think we need to fix up + have test coverage for before we can merge it in.
{%- set exclude_columns = when_matched.get('update', {}).get('exclude', []) -%} | ||
{%- set set_expressions = when_matched.get('update', {}).get('set_expressions', {}) -%} | ||
|
||
{%- set update_columns = get_merge_update_columns(include_columns, exclude_columns, dest_columns) -%} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wheres does this method get defined? Did I lose it somehow in my rebases or am I just missing it?
{%- elif when_not_matched.get('mode') == 'star' -%} | ||
INSERT * | ||
{%- elif when_not_matched.get('mode') == 'explicit' -%} | ||
{%- set insert_columns = when_matched.get('insert', {}).get('columns', []) -%} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these lines supposed to reference when_not_matched
? I don't totally follow the logic here
MERGE INTO {{ target }} AS DBT_INTERNAL_DEST | ||
USING {{ source }} AS DBT_INTERNAL_SOURCE | ||
{%- if merge_on_using_columns %} | ||
USING ({{ merge_on_using_columns | join(', ') }}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this supposed to be ON ({{ merge_on_using_columns | join(', ') }})
? I don't grok the USING ... USING ...
syntax
…port Merged changes from PR duckdb#605 (feature/merge-into-incremental-strategy) which adds comprehensive MERGE incremental strategy support with DuckDB extensions, while preserving microbatch incremental strategy support. Key changes: - Add comprehensive MERGE strategy with validation and custom clauses - Support for DuckDB MERGE extensions (BY NAME, BY POSITION, etc.) - Extensive configuration options for merge behavior - DuckLake restrictions validation - Preserve microbatch incremental strategy support - Reorganize macros into incremental_strategy/ directory - Add comprehensive merge strategy documentation to README 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Implements MERGE incremental strategy for dbt-duckdb with support for DuckDB's advanced
MERGE extensions from PR #18135.
Features
UPDATE/INSERT BY NAME
,UPDATE/INSERT BY POSITION
,UPDATE SET *
WHEN NOT MATCHED BY SOURCE
DO NOTHING
andERROR
actionsUSING
clause with explicit columnsConfigurations Added
unique_key
incremental_predicates
merge_on_using_columns
merge_update_condition
merge_insert_condition
merge_update_columns
merge_exclude_columns
merge_update_set_expressions
merge_returning_columns
merge_clauses
with support forwhen_matched
andwhen_not_matched
actions/modesupdate
,insert
,delete
,do_nothing
,error
Notes
UPDATE
orDELETE
action is allowed inwhen_matched
clauses due to current limitations.