Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
e99a625
feat(datasets): `ibis.TableDataset` add configurable save mode for ta…
gitgud5000 May 31, 2025
1f609c5
chore: reformat code with Ruff for consistency
gitgud5000 May 31, 2025
523c7aa
feat(datasets): `ibis.TableDataset` add credentials param and semi-de…
gitgud5000 May 31, 2025
8b0b2a6
- refactor: streamlines mode dispatch logic in save operations by rep…
gitgud5000 May 31, 2025
6445239
fix: added missing 'errorifexists' to conditions in save modes
gitgud5000 May 31, 2025
256795d
Merge branch 'main' into ibis-dataset-savemode-support
ankatiyar Jun 3, 2025
6130581
fix(datasets): handle empty DataFrames and improve append mode
gitgud5000 Aug 15, 2025
c19d1b3
fix(ibis): handle mode overwrite and empty save edge cases
gitgud5000 Aug 18, 2025
55b8666
test(tabledataset): add tests for save modes: append, error, ignore, …
gitgud5000 Aug 18, 2025
5a84f44
docs(RELEASE.md): update major features and improvements section with…
gitgud5000 Aug 18, 2025
7b04389
Merge remote-tracking branch 'origin/main' into ibis-dataset-savemode…
gitgud5000 Aug 18, 2025
6cfb8fe
fix lint
gitgud5000 Aug 18, 2025
2f50c6d
docs(table_dataset): fix indentation in docstring
gitgud5000 Aug 18, 2025
4c6ddb9
Fix lint
ankatiyar Aug 18, 2025
f6ce09b
lint fix
gitgud5000 Aug 18, 2025
8b05334
Signed-off-by: gitgud5000 <[email protected]>
gitgud5000 Aug 18, 2025
ff1d0bb
fix comment
gitgud5000 Aug 18, 2025
ff64396
fix: address some of @deepyaman comments
gitgud5000 Aug 18, 2025
1b971cc
fix: replace NotImplementedError with DatasetError for unsupported in…
gitgud5000 Sep 9, 2025
bb253e9
fix(ibis): Validate save mode during initialization and restrict to …
gitgud5000 Sep 9, 2025
4ba11b2
fix(table_dataset): Improve error message for invalid save mode
gitgud5000 Sep 9, 2025
b29d7a0
ibis: improve connection/credentials handling and defaults
gitgud5000 Sep 10, 2025
c2655dd
ibis: accept ibis.Table for save and treat empty tables as no-op
gitgud5000 Sep 10, 2025
9b96b88
tests(ibis): update tests for new credentials handling and ibis.Table…
gitgud5000 Sep 10, 2025
1c92e36
tests(ibis): remove conftest.py for obsolete backend stubbing
gitgud5000 Sep 10, 2025
63d9bd7
format fix
gitgud5000 Sep 10, 2025
51fc565
test(ibis): re-adds a test fixture that stubs optional ibis backends …
gitgud5000 Sep 10, 2025
0af5b82
fix(tests): replace realistic DB credentials in ibis table tests to a…
gitgud5000 Sep 10, 2025
940d901
fix(tests): more of... replace realistic DB credentials in ibis tabl…
gitgud5000 Sep 10, 2025
3c836e3
Fixes credential reference when constructing connection
gitgud5000 Sep 10, 2025
398d292
fix: remove unused pandas import from table_dataset.py
gitgud5000 Sep 19, 2025
50acb2d
chore(datasets): move notes under upcoming release
deepyaman Sep 23, 2025
e4636b7
revert(datasets): remove changes unrelated to mode
deepyaman Sep 23, 2025
d922af1
feat(datasets): add back changes unrelated to mode
deepyaman Sep 23, 2025
6a23b81
revert(datasets): remove changes unrelated to mode
deepyaman Sep 23, 2025
513aab0
refactor(datasets): define save mode using an enum
deepyaman Sep 23, 2025
2629069
chore(datasets): remove notes unrelated to this PR
deepyaman Sep 24, 2025
c4534e5
chore(datasets): reword release note on save modes
deepyaman Sep 24, 2025
48c5e85
refactor(datasets): simplify branching in `save()`
deepyaman Sep 24, 2025
c75643d
revert(datasets): remove extra blank added to file
deepyaman Sep 24, 2025
b4ba79a
test(datasets): clean up any tables that we create
deepyaman Sep 24, 2025
4299844
test(datasets): condense multi-parametrizes to one
deepyaman Sep 24, 2025
5f9fc08
test(datasets): check raises a narrower error type
deepyaman Sep 24, 2025
4946541
test(datasets): test legacy overwrite with fixture
deepyaman Sep 24, 2025
55ba8a0
chore(datasets): use `<=` instead of `.issubset()`
deepyaman Sep 24, 2025
586c24c
Merge branch 'main' into ibis-dataset-savemode-support
ankatiyar Oct 8, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion kedro-datasets/RELEASE.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# Upcoming Release

## Major features and improvements

- Group datasets documentation according to the dependencies to clean up the nav bar.
Expand All @@ -15,14 +16,20 @@
| ------------------------------ | ------------------------------------------------------------- | ------------------------------------ |
| `polars.PolarsDatabaseDataset` | A dataset to load and save data to a SQL backend using Polars | `kedro_datasets_experimental.polars` |

- Added `mode` save argument to `ibis.TableDataset`, supporting "append", "overwrite", "error"/"errorifexists", and "ignore" save modes. The `overwrite` save argument is mapped to `mode` for backward compatibility; specifying both results in an error.
Copy link
Contributor

@ankatiyar ankatiyar Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add that overwrite argument is deprecated here and remove the entry from "breaking changes" entirely


## Bug fixes and other changes

- Added primary key constraint to BaseTable.
- Added save/load with `use_pyarrow=True` save_args for LazyPolarsDataset partitioned parquet files.
- Updated the json schema for Kedro 1.0.0.

## Breaking Changes

- `ibis.TableDataset`: Deprecated `save_args.overwrite` and the `connection` parameter in favor of `save_args.mode` and `credentials`. Using both `overwrite` and `mode` together raises an error; providing both `credentials` and `connection` emits a deprecation warning. The deprecated options will be removed in a future release.
Comment on lines +28 to +29
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `ibis.TableDataset`: Deprecated `save_args.overwrite` and the `connection` parameter in favor of `save_args.mode` and `credentials`. Using both `overwrite` and `mode` together raises an error; providing both `credentials` and `connection` emits a deprecation warning. The deprecated options will be removed in a future release.

I don't think it's a breaking change even if raise a DeprecationWarning for using 'overwrite' (separate issue about actually raising the DeprecationWarning; see below).

(Also, the credentials and connection bit shouldn't have been left in here when I split out the PR; my bad.)


## Community contributions

- [Minura Punchihewa](https://github.com/MinuraPunchihewa)
- [gitgud5000](https://github.com/gitgud5000)

Expand Down Expand Up @@ -56,7 +63,6 @@ Many thanks to the following Kedroids for contributing PRs to this release:
- [Seohyun Park](https://github.com/soyamimi)
- [Daniel Russell-Brain](https://github.com/killerfridge)


# Release 7.0.0

## Major features and improvements
Expand Down
67 changes: 62 additions & 5 deletions kedro-datasets/kedro_datasets/ibis/table_dataset.py
Original file line number Diff line number Diff line change
@@ -1,18 +1,35 @@
"""Provide data loading and saving functionality for Ibis's backends."""
from __future__ import annotations

import sys
from copy import deepcopy
from enum import auto
from typing import TYPE_CHECKING, Any, ClassVar

if sys.version_info >= (3, 11):
from enum import StrEnum # pragma: no cover
else:
from backports.strenum import StrEnum # pragma: no cover

import ibis.expr.types as ir
from kedro.io import AbstractDataset
from kedro.io import AbstractDataset, DatasetError

from kedro_datasets._utils import ConnectionMixin

if TYPE_CHECKING:
from ibis import BaseBackend


class SaveMode(StrEnum):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gitgud5000 I took the liberty of refactoring save mode logic into an enum.

"""`SaveMode` is used to specify the expected behavior of saving a table."""

APPEND = auto()
OVERWRITE = auto()
ERROR = auto()
ERRORIFEXISTS = auto()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need both "error" and "errorifexists"? "errorifexists" is a bit more explicit, but also a bit more verbose.

I decided to check if one or the other is preferred by Spark (maybe we could just use that), but it turns out it's not so simple. https://issues.apache.org/jira/browse/SPARK-21640 added "errorifexists" to the Python (and other API) options—"error" used to be the only option—but https://github.com/apache/spark/blob/v4.0.0/sql/api/src/main/java/org/apache/spark/sql/SaveMode.java only defines "errorifexists". Since we don't have the baggage, I'd be fine with just keeping "error" (feels simpler), but I also don't have any strong opinion against supporting both. Happy to have another maintainer (or @gitgud5000) weigh in.

IGNORE = auto()


class TableDataset(ConnectionMixin, AbstractDataset[ir.Table, ir.Table]):
"""`TableDataset` loads/saves data from/to Ibis table expressions.

Expand All @@ -28,14 +45,18 @@ class TableDataset(ConnectionMixin, AbstractDataset[ir.Table, ir.Table]):
database: company.db
save_args:
materialized: table
mode: append

motorbikes:
type: ibis.TableDataset
table_name: motorbikes
connection:
backend: duckdb
database: company.db
```
save_args:
materialized: view
mode: overwrite
```

Using the [Python API](https://docs.kedro.org/en/stable/catalog-data/advanced_data_catalog_usage/):

Expand All @@ -62,7 +83,7 @@ class TableDataset(ConnectionMixin, AbstractDataset[ir.Table, ir.Table]):
DEFAULT_LOAD_ARGS: ClassVar[dict[str, Any]] = {}
DEFAULT_SAVE_ARGS: ClassVar[dict[str, Any]] = {
"materialized": "view",
"overwrite": True,
"mode": "overwrite",
}

_CONNECTION_GROUP: ClassVar[str] = "ibis"
Expand Down Expand Up @@ -109,7 +130,12 @@ def __init__( # noqa: PLR0913
`create_{materialized}` method. By default, ``ir.Table``
objects are materialized as views. To save a table using
a different materialization strategy, supply a value for
`materialized` in `save_args`.
`materialized` in `save_args`. The `mode` parameter controls
the behavior when saving data:
- _"overwrite"_: Overwrite existing data in the table.
- _"append"_: Append contents of the new data to the existing table (does not overwrite).
- _"error"_ or _"errorifexists"_: Throw an exception if the table already exists.
- _"ignore"_: Silently ignore the operation if the table already exists.
metadata: Any arbitrary metadata. This is ignored by Kedro,
but may be consumed by users or external plugins.
"""
Expand All @@ -134,6 +160,22 @@ def __init__( # noqa: PLR0913

self._materialized = self._save_args.pop("materialized")

# Handle mode/overwrite conflict.
if save_args and "mode" in save_args and "overwrite" in self._save_args:
raise ValueError("Cannot specify both 'mode' and deprecated 'overwrite'.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's actually no DeprecationWarning, but I'm also not 100% sure there needs to be one. However, if there isn't a DeprecationWarning, it doesn't make sense to call it "deprecated", and it shouldn't be on the list of breaking changes in the release notes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're doing a deprecation and then removal then we could emit a DeprecationWarning if overwrite is used whether or not it's used in conjunction with the mode and keep this ValueError as it is. Either way deprecation is not a breaking change so I would suggest not adding it to the "breaking change" section in the docs


# Map legacy overwrite if present.
if "overwrite" in self._save_args:
legacy = self._save_args.pop("overwrite")
# Remove any lingering 'mode' key from defaults to avoid
# leaking into writer kwargs.
del self._save_args["mode"]
mode = "overwrite" if legacy else "error"
else:
mode = self._save_args.pop("mode")

self._mode = SaveMode(mode)

def _connect(self) -> BaseBackend:
import ibis # noqa: PLC0415

Expand All @@ -151,7 +193,21 @@ def load(self) -> ir.Table:

def save(self, data: ir.Table) -> None:
writer = getattr(self.connection, f"create_{self._materialized}")
writer(self._table_name, data, **self._save_args)
if self._mode == "append":
if not self._exists():
writer(self._table_name, data, overwrite=False, **self._save_args)
elif hasattr(self.connection, "insert"):
self.connection.insert(self._table_name, data, **self._save_args)
else:
raise DatasetError(
f"The {self.connection.name} backend for Ibis does not support inserts."
)
elif self._mode == "overwrite":
writer(self._table_name, data, overwrite=True, **self._save_args)
elif self._mode in {"error", "errorifexists"}:
writer(self._table_name, data, overwrite=False, **self._save_args)
elif self._mode == "ignore" and not self._exists():
writer(self._table_name, data, overwrite=False, **self._save_args)

def _describe(self) -> dict[str, Any]:
load_args = deepcopy(self._load_args)
Expand All @@ -165,6 +221,7 @@ def _describe(self) -> dict[str, Any]:
"load_args": load_args,
"save_args": save_args,
"materialized": self._materialized,
"mode": self._mode,
}

def _exists(self) -> bool:
Expand Down
1 change: 1 addition & 0 deletions kedro-datasets/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ description = "Kedro-Datasets is where you can find all of Kedro's data connecto
requires-python = ">=3.10"
license = {text = "Apache Software License (Apache 2.0)"}
dependencies = [
"backports.strenum; python_version < '3.11'",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a backport of a standard lib module; I don't personally think it's a big deal to add this dependency for everything, and just for Python 3.10.

"kedro>=1.0.0rc1, <2.0.0",
"lazy_loader",
]
Expand Down
134 changes: 130 additions & 4 deletions kedro-datasets/tests/ibis/test_table_dataset.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
import duckdb
import ibis
import pandas as pd
import pytest
from kedro.io import DatasetError
from packaging.version import Version
from pandas.testing import assert_frame_equal

Expand Down Expand Up @@ -37,13 +39,15 @@ def connection_config(request, database):

@pytest.fixture
def table_dataset(database_name, connection_config, load_args, save_args):
return TableDataset(
ds = TableDataset(
table_name="test",
database=database_name,
connection=connection_config,
load_args=load_args,
save_args=save_args,
)
yield ds
getattr(ds._connection, f"drop_{ds._materialized}")("test", force=True)


@pytest.fixture
Expand Down Expand Up @@ -77,9 +81,10 @@ def test_save_and_load(self, table_dataset, dummy_table, database):
assert "test" in con.sql("SELECT * FROM duckdb_views").fetchnumpy()["view_name"]

@pytest.mark.parametrize(
"connection_config", [{"backend": "polars"}], indirect=True
("connection_config", "save_args"),
[({"backend": "polars"}, {"materialized": "table"})],
indirect=True,
)
@pytest.mark.parametrize("save_args", [{"materialized": "table"}], indirect=True)
def test_save_and_load_polars(
self, table_dataset, connection_config, save_args, dummy_table
):
Expand All @@ -102,14 +107,135 @@ def test_exists(self, table_dataset, dummy_table):
table_dataset.save(dummy_table)
assert table_dataset.exists()

@pytest.mark.parametrize(
"save_args", [{"materialized": "table", "mode": "append"}], indirect=True
)
def test_save_mode_append(self, table_dataset, dummy_table):
"""Saving with mode=append should add rows to an existing table."""
df1 = dummy_table
df2 = dummy_table

table_dataset.save(df1)
table_dataset.save(df2)

df1 = df1.execute()
df2 = df2.execute()
reloaded = table_dataset.load().execute()
assert len(reloaded) == len(df1) + len(df2)

@pytest.mark.parametrize(
"save_args",
[
{"materialized": "table", "mode": "error"},
{"materialized": "table", "mode": "errorifexists"},
],
indirect=True,
)
def test_save_mode_error_variants(self, table_dataset, dummy_table):
"""Saving with error/errorifexists should raise when table exists."""
table_dataset.save(dummy_table)
with pytest.raises(DatasetError, match='Table with name "test" already exists'):
table_dataset.save(dummy_table)

@pytest.mark.parametrize(
"save_args", [{"materialized": "table", "mode": "ignore"}], indirect=True
)
def test_save_mode_ignore(self, table_dataset, dummy_table):
"""Saving with ignore should not change existing table."""
df1 = dummy_table
df2 = dummy_table

table_dataset.save(df1)
table_dataset.save(df2)
df1 = df1.execute()

reloaded = table_dataset.load().execute()
# Should remain as first save only
assert_frame_equal(reloaded.reset_index(drop=True), df1.reset_index(drop=True))

def test_unsupported_save_mode_raises(self, database_name, connection_config):
"""Providing an unsupported save mode should raise a DatasetError."""
with pytest.raises(
ValueError, match="'unsupported_mode' is not a valid SaveMode"
):
TableDataset(
table_name="unsupported_mode",
database=database_name,
connection=connection_config,
save_args={"materialized": "table", "mode": "unsupported_mode"},
)

def test_legacy_overwrite_conflict_raises(self, database_name, connection_config):
"""Providing both mode and overwrite should raise a ValueError."""
with pytest.raises(ValueError):
TableDataset(
table_name="conflict",
database=database_name,
connection=connection_config,
save_args={
"materialized": "table",
"mode": "append",
"overwrite": True,
},
)

@pytest.mark.parametrize(
("connection_config", "save_args"),
[({"backend": "polars"}, {"materialized": "table", "mode": "append"})],
indirect=True,
)
def test_append_mode_no_insert_raises(self, table_dataset, dummy_table):
"""Test that saving with mode=append on a backend without 'insert' raises DatasetError (polars backend)."""
# Save once to create the table
table_dataset.save(dummy_table)
# Try to append again, should raise DatasetError
with pytest.raises(DatasetError, match="does not support inserts"):
table_dataset.save(dummy_table)

@pytest.mark.parametrize(
"save_args",
[
{"materialized": "table", "overwrite": True},
{"materialized": "table", "overwrite": False},
],
indirect=True,
)
def test_legacy_overwrite_behavior(self, table_dataset, save_args, dummy_table):
"""Legacy overwrite should map to overwrite or error behavior."""
legacy_overwrite = save_args["overwrite"]
df2 = ibis.memtable(pd.DataFrame({"col1": [7], "col2": [8], "col3": [9]}))

table_dataset.save(dummy_table) # First save should always work
if legacy_overwrite:
# Should overwrite existing table with new contents
table_dataset.save(df2)
df2 = df2.execute()
out = table_dataset.load().execute().reset_index(drop=True)
assert_frame_equal(out, df2.reset_index(drop=True))
else:
# Should raise on second save when table exists
with pytest.raises(DatasetError):
table_dataset.save(df2)

def test_describe_includes_backend_mode_and_materialized(self, table_dataset):
"""_describe should expose backend, mode and materialized; nested args exclude database."""

desc = table_dataset._describe()

assert {"backend", "mode", "materialized"} <= desc.keys()
assert "database" in desc
# database key should not be duplicated inside nested args
assert "database" not in desc["load_args"]
assert "database" not in desc["save_args"]

@pytest.mark.parametrize("load_args", [{"database": "test"}], indirect=True)
def test_load_extra_params(self, table_dataset, load_args):
"""Test overriding the default load arguments."""
for key, value in load_args.items():
assert table_dataset._load_args[key] == value

@pytest.mark.parametrize("save_args", [{"materialized": "table"}], indirect=True)
def test_save_extra_params(self, table_dataset, save_args, dummy_table, database):
def test_save_extra_params(self, table_dataset, dummy_table, database):
"""Test overriding the default save arguments."""
table_dataset.save(dummy_table)

Expand Down
Loading