-
Notifications
You must be signed in to change notification settings - Fork 107
feat(datasets): make table write mode configurable #1093
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
e99a625
1f609c5
523c7aa
8b0b2a6
6445239
256795d
6130581
c19d1b3
55b8666
5a84f44
7b04389
6cfb8fe
2f50c6d
4c6ddb9
f6ce09b
8b05334
ff1d0bb
ff64396
1b971cc
bb253e9
4ba11b2
b29d7a0
c2655dd
9b96b88
1c92e36
63d9bd7
51fc565
0af5b82
940d901
3c836e3
398d292
50acb2d
e4636b7
d922af1
6a23b81
513aab0
2629069
c4534e5
48c5e85
c75643d
b4ba79a
4299844
5f9fc08
4946541
55ba8a0
586c24c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -1,4 +1,5 @@ | ||||||
# Upcoming Release | ||||||
|
||||||
## Major features and improvements | ||||||
|
||||||
- Group datasets documentation according to the dependencies to clean up the nav bar. | ||||||
|
@@ -15,14 +16,20 @@ | |||||
| ------------------------------ | ------------------------------------------------------------- | ------------------------------------ | | ||||||
| `polars.PolarsDatabaseDataset` | A dataset to load and save data to a SQL backend using Polars | `kedro_datasets_experimental.polars` | | ||||||
|
||||||
- Added `mode` save argument to `ibis.TableDataset`, supporting "append", "overwrite", "error"/"errorifexists", and "ignore" save modes. The `overwrite` save argument is mapped to `mode` for backward compatibility; specifying both results in an error. | ||||||
|
||||||
## Bug fixes and other changes | ||||||
|
||||||
- Added primary key constraint to BaseTable. | ||||||
- Added save/load with `use_pyarrow=True` save_args for LazyPolarsDataset partitioned parquet files. | ||||||
- Updated the json schema for Kedro 1.0.0. | ||||||
|
||||||
## Breaking Changes | ||||||
|
||||||
- `ibis.TableDataset`: Deprecated `save_args.overwrite` and the `connection` parameter in favor of `save_args.mode` and `credentials`. Using both `overwrite` and `mode` together raises an error; providing both `credentials` and `connection` emits a deprecation warning. The deprecated options will be removed in a future release. | ||||||
Comment on lines
+28
to
+29
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
I don't think it's a breaking change even if raise a (Also, the |
||||||
|
||||||
## Community contributions | ||||||
|
||||||
- [Minura Punchihewa](https://github.com/MinuraPunchihewa) | ||||||
- [gitgud5000](https://github.com/gitgud5000) | ||||||
|
||||||
|
@@ -56,7 +63,6 @@ Many thanks to the following Kedroids for contributing PRs to this release: | |||||
- [Seohyun Park](https://github.com/soyamimi) | ||||||
- [Daniel Russell-Brain](https://github.com/killerfridge) | ||||||
|
||||||
|
||||||
# Release 7.0.0 | ||||||
|
||||||
## Major features and improvements | ||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,18 +1,35 @@ | ||
"""Provide data loading and saving functionality for Ibis's backends.""" | ||
from __future__ import annotations | ||
|
||
import sys | ||
from copy import deepcopy | ||
from enum import auto | ||
from typing import TYPE_CHECKING, Any, ClassVar | ||
|
||
if sys.version_info >= (3, 11): | ||
from enum import StrEnum # pragma: no cover | ||
else: | ||
from backports.strenum import StrEnum # pragma: no cover | ||
|
||
import ibis.expr.types as ir | ||
from kedro.io import AbstractDataset | ||
from kedro.io import AbstractDataset, DatasetError | ||
|
||
from kedro_datasets._utils import ConnectionMixin | ||
|
||
if TYPE_CHECKING: | ||
from ibis import BaseBackend | ||
|
||
|
||
class SaveMode(StrEnum): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @gitgud5000 I took the liberty of refactoring save mode logic into an enum. |
||
"""`SaveMode` is used to specify the expected behavior of saving a table.""" | ||
|
||
APPEND = auto() | ||
OVERWRITE = auto() | ||
ERROR = auto() | ||
ERRORIFEXISTS = auto() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we need both "error" and "errorifexists"? "errorifexists" is a bit more explicit, but also a bit more verbose. I decided to check if one or the other is preferred by Spark (maybe we could just use that), but it turns out it's not so simple. https://issues.apache.org/jira/browse/SPARK-21640 added "errorifexists" to the Python (and other API) options—"error" used to be the only option—but https://github.com/apache/spark/blob/v4.0.0/sql/api/src/main/java/org/apache/spark/sql/SaveMode.java only defines "errorifexists". Since we don't have the baggage, I'd be fine with just keeping "error" (feels simpler), but I also don't have any strong opinion against supporting both. Happy to have another maintainer (or @gitgud5000) weigh in. |
||
IGNORE = auto() | ||
|
||
|
||
class TableDataset(ConnectionMixin, AbstractDataset[ir.Table, ir.Table]): | ||
"""`TableDataset` loads/saves data from/to Ibis table expressions. | ||
|
||
|
@@ -28,14 +45,18 @@ class TableDataset(ConnectionMixin, AbstractDataset[ir.Table, ir.Table]): | |
database: company.db | ||
save_args: | ||
materialized: table | ||
mode: append | ||
|
||
motorbikes: | ||
type: ibis.TableDataset | ||
table_name: motorbikes | ||
connection: | ||
backend: duckdb | ||
database: company.db | ||
``` | ||
save_args: | ||
materialized: view | ||
mode: overwrite | ||
``` | ||
|
||
Using the [Python API](https://docs.kedro.org/en/stable/catalog-data/advanced_data_catalog_usage/): | ||
|
||
|
@@ -62,7 +83,7 @@ class TableDataset(ConnectionMixin, AbstractDataset[ir.Table, ir.Table]): | |
DEFAULT_LOAD_ARGS: ClassVar[dict[str, Any]] = {} | ||
DEFAULT_SAVE_ARGS: ClassVar[dict[str, Any]] = { | ||
"materialized": "view", | ||
"overwrite": True, | ||
"mode": "overwrite", | ||
} | ||
|
||
_CONNECTION_GROUP: ClassVar[str] = "ibis" | ||
|
@@ -109,7 +130,12 @@ def __init__( # noqa: PLR0913 | |
`create_{materialized}` method. By default, ``ir.Table`` | ||
objects are materialized as views. To save a table using | ||
a different materialization strategy, supply a value for | ||
`materialized` in `save_args`. | ||
`materialized` in `save_args`. The `mode` parameter controls | ||
the behavior when saving data: | ||
- _"overwrite"_: Overwrite existing data in the table. | ||
- _"append"_: Append contents of the new data to the existing table (does not overwrite). | ||
- _"error"_ or _"errorifexists"_: Throw an exception if the table already exists. | ||
- _"ignore"_: Silently ignore the operation if the table already exists. | ||
metadata: Any arbitrary metadata. This is ignored by Kedro, | ||
but may be consumed by users or external plugins. | ||
""" | ||
|
@@ -134,6 +160,22 @@ def __init__( # noqa: PLR0913 | |
|
||
self._materialized = self._save_args.pop("materialized") | ||
|
||
# Handle mode/overwrite conflict. | ||
if save_args and "mode" in save_args and "overwrite" in self._save_args: | ||
raise ValueError("Cannot specify both 'mode' and deprecated 'overwrite'.") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There's actually no There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we're doing a deprecation and then removal then we could emit a |
||
|
||
# Map legacy overwrite if present. | ||
if "overwrite" in self._save_args: | ||
legacy = self._save_args.pop("overwrite") | ||
# Remove any lingering 'mode' key from defaults to avoid | ||
# leaking into writer kwargs. | ||
del self._save_args["mode"] | ||
mode = "overwrite" if legacy else "error" | ||
else: | ||
mode = self._save_args.pop("mode") | ||
|
||
self._mode = SaveMode(mode) | ||
|
||
def _connect(self) -> BaseBackend: | ||
import ibis # noqa: PLC0415 | ||
|
||
|
@@ -151,7 +193,21 @@ def load(self) -> ir.Table: | |
|
||
def save(self, data: ir.Table) -> None: | ||
writer = getattr(self.connection, f"create_{self._materialized}") | ||
writer(self._table_name, data, **self._save_args) | ||
if self._mode == "append": | ||
if not self._exists(): | ||
writer(self._table_name, data, overwrite=False, **self._save_args) | ||
elif hasattr(self.connection, "insert"): | ||
self.connection.insert(self._table_name, data, **self._save_args) | ||
else: | ||
raise DatasetError( | ||
f"The {self.connection.name} backend for Ibis does not support inserts." | ||
) | ||
elif self._mode == "overwrite": | ||
writer(self._table_name, data, overwrite=True, **self._save_args) | ||
elif self._mode in {"error", "errorifexists"}: | ||
writer(self._table_name, data, overwrite=False, **self._save_args) | ||
elif self._mode == "ignore" and not self._exists(): | ||
writer(self._table_name, data, overwrite=False, **self._save_args) | ||
|
||
def _describe(self) -> dict[str, Any]: | ||
load_args = deepcopy(self._load_args) | ||
|
@@ -165,6 +221,7 @@ def _describe(self) -> dict[str, Any]: | |
"load_args": load_args, | ||
"save_args": save_args, | ||
"materialized": self._materialized, | ||
"mode": self._mode, | ||
} | ||
|
||
def _exists(self) -> bool: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,6 +11,7 @@ description = "Kedro-Datasets is where you can find all of Kedro's data connecto | |
requires-python = ">=3.10" | ||
license = {text = "Apache Software License (Apache 2.0)"} | ||
dependencies = [ | ||
"backports.strenum; python_version < '3.11'", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's a backport of a standard lib module; I don't personally think it's a big deal to add this dependency for everything, and just for Python 3.10. |
||
"kedro>=1.0.0rc1, <2.0.0", | ||
"lazy_loader", | ||
] | ||
|
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add that
overwrite
argument is deprecated here and remove the entry from "breaking changes" entirely