Skip to content

Conversation

VahOkan
Copy link

@VahOkan VahOkan commented Aug 5, 2025

Implements MERGE incremental strategy for dbt-duckdb with support for DuckDB's advanced
MERGE extensions from PR #18135.

Features

  • Standard MERGE functionality
  • DuckDB extensions:
    • UPDATE/INSERT BY NAME, UPDATE/INSERT BY POSITION, UPDATE SET *
    • WHEN NOT MATCHED BY SOURCE
    • DO NOTHING and ERROR actions
    • USING clause with explicit columns
  • Custom merge clauses with multiple actions, conditions, and error handling
  • Version checking (requires DuckDB >= 1.4.0)
  • Comprehensive test suite covering basic, enhanced, and custom configurations

Configurations Added

  • Basic config:
    • unique_key
  • Enhanced config:
    • incremental_predicates
    • merge_on_using_columns
    • merge_update_condition
    • merge_insert_condition
    • merge_update_columns
    • merge_exclude_columns
    • merge_update_set_expressions
    • merge_returning_columns
  • Custom config:
    • merge_clauses with support for when_matched and when_not_matched actions/modes
  • Supported actions: update, insert, delete, do_nothing, error

Notes

  • MERGE requires DuckDB >= 1.4.0
  • When using DuckLake, only a single UPDATE or DELETE action is allowed in when_matched clauses due to current limitations.

Copy link
Collaborator

@jwills jwills left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

happy to just keep running these tests as you push new changes in; the linting stuff is done via pre-commit run --all-files IIRC

_, cursor = self.connections.add_select_query("SELECT version()")
version_string = cursor.fetchone()[0]

logger.info(f"DuckDB version: '{version_string}'")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI that I'm going to want these logger lines to go or be downgraded to debug prior to merging

Copy link
Author

@VahOkan VahOkan Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jwills Thanks for the hints. I removed some of the logger lines that were meant for debugging, ran pre-commit and tox -p auto, and pushed again.

I checked the logs of some of the tests — they are failing because tox is not using DuckDB 1.4.0, right? I ran the tests locally after updating tox to use the DuckDB preview version 1.4.0.dev2079 with .tox/functional/bin/pip install --pre duckdb --upgrade

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah duh of course-- I'll have to try them out locally

@VahOkan
Copy link
Author

VahOkan commented Aug 8, 2025

FYI: @jwills I just added a validation for the DuckLake MERGE restrictions.

Copy link
Collaborator

@jwills jwills left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking really good-- couple of things i'd like to fix and one more of a style thing

"""DuckDB supports MERGE statements starting from version 1.4.0."""
base_strategies = ["append", "delete+insert"]

if self._supports_merge_strategy():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not wild about doing this on every call to valid_incremental_strategies()-- i'd rather the set of valid strategies be cached s.t. this is only done once per dbt run instead of every time we need to check for this

Copy link
Author

@VahOkan VahOkan Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve updated the code to use cached properties for the DuckDB version and incremental strategy values, and I’m now using the Version class from packaging to compare versions. This is a more Pythonic and elegant approach.

if self._supports_merge_strategy():
return base_strategies + ["merge"]

logger.info(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another logger line that I would like to be downgraded to debug and/or removed

Copy link
Author

@VahOkan VahOkan Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the logger line but added an exception in case the user tries to use merge with an older version. I had to override the get_incremental_strategy_macro for this; otherwise, the user would get an exception stating that merge is not available for the adapter, which is no longer true after this PR.

@@ -0,0 +1,236 @@
{% macro validate_merge_update_options(update_all, update_by_name, update_by_position) %}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly a style/vibe comment here-- this is a lot of validation code entirely in jinja; is this how other adapters that have complex merge logic handle it? I'm inclined to keep it b/c I think moving stuff to jinja is generally a good thing as we look towards a dbt-fusion future, just something I wanted to call out in case there were better pattern for this

Copy link
Author

@VahOkan VahOkan Aug 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of the adapters' merge logic is as complex as this one. The ones for BigQuery and Snowflake just use the default get_merge_sql from dbt-core with some additional code. And yes, usually they just use Jinja for this kind of stuff.

DuckDB introduced many additional extensions to the basic merge and i am trying to take advantage of these extensions, which is why the code is more complex in our case. Thus we also need more validations to ensure that people configure them correctly. In other adapters, I couldn't find much validation code in incremental strategies because they simply don't have a lot of configurations to validate.

That said, I still agree with you that some of the code could be removed to improve performance and readability. I could redesign it in the following way:

  • More compact configurations — e.g., merge_update_by expecting a value 'position' or 'name' instead of having two booleans. Still, I would need to validate if the user provides the correct options.
  • Removing some basic configuration validations by trusting that people won’t mess them up.
  • Only one merge validation macro with validation rules and a loop, instead of having many separate macros.

What do you think @jwills?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me-- thanks!

Copy link
Author

@VahOkan VahOkan Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the README.md first — I think it’s the easiest way to understand my approach. In short: it requires only minimal configuration for basic use cases (about 90% I’d guess) while still providing maximum flexibility for more complex scenarios by taking advantage of DuckDB’s MERGE syntax extensions.

In terms of validations, I used a more modular approach this time, which reduced the overall code. Most of the validation logic isn’t even used for basic configurations, so performance shouldn’t be an issue. For complex use cases, the few extra milliseconds added to execution time shouldn’t bother anyone.

Also added a validation to check for DuckLake based on the restrictions from duckdb/ducklake#351

@jwills
Copy link
Collaborator

jwills commented Sep 17, 2025

@VahOkan ahh now maybe ducklake 0.3 does support MERGE?

@VahOkan
Copy link
Author

VahOkan commented Sep 18, 2025

@VahOkan ahh now maybe ducklake 0.3 does support MERGE?

I think MERGE is supported in Ducklake, but with some restrictions. I’ve tried to account for those restrictions in my PR, but it should be tested to be sure.

Copy link
Collaborator

@jwills jwills left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VahOkan sorry for the insane lag here, just a hectic month at work. Tripped over some stuff with the merge logic that I think we need to fix up + have test coverage for before we can merge it in.

{%- set exclude_columns = when_matched.get('update', {}).get('exclude', []) -%}
{%- set set_expressions = when_matched.get('update', {}).get('set_expressions', {}) -%}

{%- set update_columns = get_merge_update_columns(include_columns, exclude_columns, dest_columns) -%}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wheres does this method get defined? Did I lose it somehow in my rebases or am I just missing it?

{%- elif when_not_matched.get('mode') == 'star' -%}
INSERT *
{%- elif when_not_matched.get('mode') == 'explicit' -%}
{%- set insert_columns = when_matched.get('insert', {}).get('columns', []) -%}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these lines supposed to reference when_not_matched? I don't totally follow the logic here

MERGE INTO {{ target }} AS DBT_INTERNAL_DEST
USING {{ source }} AS DBT_INTERNAL_SOURCE
{%- if merge_on_using_columns %}
USING ({{ merge_on_using_columns | join(', ') }})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be ON ({{ merge_on_using_columns | join(', ') }})? I don't grok the USING ... USING ... syntax

gfrmin pushed a commit to gfrmin/dbt-duckdb that referenced this pull request Oct 8, 2025
…port

Merged changes from PR duckdb#605 (feature/merge-into-incremental-strategy)
which adds comprehensive MERGE incremental strategy support with DuckDB
extensions, while preserving microbatch incremental strategy support.

Key changes:
- Add comprehensive MERGE strategy with validation and custom clauses
- Support for DuckDB MERGE extensions (BY NAME, BY POSITION, etc.)
- Extensive configuration options for merge behavior
- DuckLake restrictions validation
- Preserve microbatch incremental strategy support
- Reorganize macros into incremental_strategy/ directory
- Add comprehensive merge strategy documentation to README

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants