-
Notifications
You must be signed in to change notification settings - Fork 106
feat(datasets): make table write mode configurable #1093
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(datasets): make table write mode configurable #1093
Conversation
…ble materialization - feat: Introduces a `mode` parameter for save operations, allowing "append", "overwrite", "error", and "ignore" options to control write behavior. - Supports legacy `overwrite` flag with backward compatibility and deprecation warning. - Adds mode dispatching logic to handle different write scenarios. - Updates examples and docs to reflect new parameter. - Prevents simultaneous use of both `mode` and legacy `overwrite` to avoid ambiguity. - docs: Improves documentation and usage examples for new save modes. Semi-breaking change: Replaces `overwrite` save argument with `mode`; users should update configurations to use `mode`. Signed-off-by: gitgud5000 <[email protected]>
2e9e539
to
a41bb19
Compare
Signed-off-by: gitgud5000 <[email protected]>
a41bb19
to
1f609c5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
Did a quick-pass review. I see you're still actively updating, so I'll revisit this later.
|
||
self._materialized = self._save_args.pop("materialized") | ||
|
||
# Handle mode / overwrite conflict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonder if insert + view needs to be disallowed; I haven't actually checked yet.
I've been activily working with this because I need the features now 😅. I've also added support for passing credentials to the |
ibis.TableDataset
add configurable save mode for table materializationibis.TableDataset
add configurable save mode for table materialization and credentials
support
…precate connection - feat: introduce a new `credentials` parameter to accept connection info as a string or dict, preferred over the deprecated `connection` param - warn if both `credentials` and `connection` are provided, prioritizing `credentials` - support connection strings and dicts with connection strings for backend connections - feat: add backend name extraction to be used in `_describe()` method - docs: update docstrings to explain `credentials` and deprecation of `connection` !Semi- Breaking Change: deprecates the `connection` parameter in favor of `credentials` Signed-off-by: gitgud5000 <[email protected]>
…lacing the function dispatch dict with direct conditional handling, improving clarity and maintainability - docs: expands and clarifies docstring to explain available save modes and their effects, referencing Spark semantics for familiarity Signed-off-by: gitgud5000 <[email protected]>
Signed-off-by: gitgud5000 <[email protected]>
d545871
to
6445239
Compare
Best reason. 😁
Sorry, I didn't check in time; ideal is a separate PR, but it's probably no big deal to have it in the same. If it ends up being a blocker to get something merged, can always split it out later. |
- fix: add early return for empty DataFrame - fix: ensure table creation occurs before insert operations in append mode when table doesn't exist Signed-off-by: gitgud5000 <[email protected]>
40bbfe3
to
6130581
Compare
Could you please take a look at this pull request, @ankatiyar @ravi-kumar-pilla? |
@gitgud5000 I'll take a look, in the meanwhile could you add some unit tests to go along with the changes? :D |
- fix: remove lingering 'mode' key when mapping legacy overwrite to prevent unexpected writer kwargs - fix: treat empty pandas DataFrame as a no-op in save, supporting both pandas and ibis tables Prevents accidental parameter leakage and avoids errors when saving empty data. Signed-off-by: gitgud5000 <[email protected]>
…and legacy overwrite behavior Signed-off-by: gitgud5000 <[email protected]>
db200f3
to
55b8666
Compare
… new save modes and credentials parameter Signed-off-by: gitgud5000 <[email protected]>
Signed-off-by: gitgud5000 <[email protected]>
Signed-off-by: gitgud5000 <[email protected]>
@ankatiyar Done ✅ I wasn’t able to resolve a couple of things:
=========================== short test summary info ============================
FAILED kedro-datasets/tests/ibis/test_table_dataset.py::TestTableDataset::test_connection_config[None-None-None-connection_config1-key1]
|
Signed-off-by: Ankita Katiyar <[email protected]>
63b95c1
to
4c6ddb9
Compare
0cbaabd
to
940d901
Compare
Signed-off-by: gitgud5000 <[email protected]>
@gitgud5000 Sorry for the delay; I'll review in the next few days! |
Signed-off-by: gitgud5000 <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
a3c387b
to
e4636b7
Compare
This reverts commit e4636b7. Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
a245267
to
6a23b81
Compare
@gitgud5000 No rush at all (will focus on getting this PR merged first), but whenever you get a chance you can create one more PR off The main reason I'm asking you to create the PR instead of doing it myself is because then only one other person needs to review it other than myself. :) |
f29fe5c
to
f819670
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excited to have this long-awaited functionality!
|
||
# Handle mode/overwrite conflict. | ||
if save_args and "mode" in save_args and "overwrite" in self._save_args: | ||
raise ValueError("Cannot specify both 'mode' and deprecated 'overwrite'.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's actually no DeprecationWarning
, but I'm also not 100% sure there needs to be one. However, if there isn't a DeprecationWarning
, it doesn't make sense to call it "deprecated", and it shouldn't be on the list of breaking changes in the release notes.
|
||
- `ibis.TableDataset`: Deprecated `save_args.overwrite` and the `connection` parameter in favor of `save_args.mode` and `credentials`. Using both `overwrite` and `mode` together raises an error; providing both `credentials` and `connection` emits a deprecation warning. The deprecated options will be removed in a future release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- `ibis.TableDataset`: Deprecated `save_args.overwrite` and the `connection` parameter in favor of `save_args.mode` and `credentials`. Using both `overwrite` and `mode` together raises an error; providing both `credentials` and `connection` emits a deprecation warning. The deprecated options will be removed in a future release. |
I don't think it's a breaking change even if raise a DeprecationWarning
for using 'overwrite' (separate issue about actually raising the DeprecationWarning
; see below).
(Also, the credentials
and connection
bit shouldn't have been left in here when I split out the PR; my bad.)
kedro-datasets/RELEASE.md
Outdated
- Fixed `PartitionedDataset` to reliably load newly created partitions, particularly with `ParallelRunner`, by ensuring `load()` always re-scans the filesystem . | ||
- Add a parameter `encoding` inside the dataset `SQLQueryDataset` to choose the encoding format of the query. | ||
- Corrected the `APIDataset` docstring to clarify that request parameters should be passed via `load_args`, not as top-level arguments. | ||
- Improved `_connect` and `_describe` for `ibis.TableDataset`; saving an empty pandas DataFrame is now a no-op. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Improved `_connect` and `_describe` for `ibis.TableDataset`; saving an empty pandas DataFrame is now a no-op. |
This was also supposed to get split out, sorry.
from ibis import BaseBackend | ||
|
||
|
||
class SaveMode(StrEnum): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gitgud5000 I took the liberty of refactoring save mode logic into an enum.
APPEND = auto() | ||
OVERWRITE = auto() | ||
ERROR = auto() | ||
ERRORIFEXISTS = auto() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need both "error" and "errorifexists"? "errorifexists" is a bit more explicit, but also a bit more verbose.
I decided to check if one or the other is preferred by Spark (maybe we could just use that), but it turns out it's not so simple. https://issues.apache.org/jira/browse/SPARK-21640 added "errorifexists" to the Python (and other API) options—"error" used to be the only option—but https://github.com/apache/spark/blob/v4.0.0/sql/api/src/main/java/org/apache/spark/sql/SaveMode.java only defines "errorifexists". Since we don't have the baggage, I'd be fine with just keeping "error" (feels simpler), but I also don't have any strong opinion against supporting both. Happy to have another maintainer (or @gitgud5000) weigh in.
requires-python = ">=3.10" | ||
license = {text = "Apache Software License (Apache 2.0)"} | ||
dependencies = [ | ||
"backports.strenum; python_version < '3.11'", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a backport of a standard lib module; I don't personally think it's a big deal to add this dependency for everything, and just for Python 3.10.
def table_dataset(table_name, database_name, connection_config, load_args, save_args): | ||
return TableDataset( | ||
table_name="test", | ||
table_name=table_name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gitgud5000 Was making this parametrizable necessary? Just curious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gitgud5000 So, I ended up removing this. I see why it's necessary, but I think the correct thing to do is to clean up the environment by deleting any tables created (FWIW that's how Ibis tests also handle it). I've refactored it as such.
f819670
to
cda534a
Compare
Signed-off-by: Deepyaman Datta <[email protected]>
1057b14
to
513aab0
Compare
ibis.TableDataset
add configurable save mode for table materialization and credentials
supportSigned-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
62d976a
to
55ba8a0
Compare
TODO(deepyaman): Move notes on
credentials
support to a new PR from @gitgud5000 with those changes, once it exists. I'm not deleting the detailed notes from here until they're moved elsewhere.feat(datasets): ibis.TableDataset add
mode
andcredentials
supportmode
parameter for save operations, allowing "append", "overwrite", "error", and "ignore" options to control write behavior.overwrite
flag with backward compatibility and deprecation warning.mode
and legacyoverwrite
to avoid ambiguity.credentials
parameter to accept connection info (dict or string URI), supersedingconnection
.credentials
and deprecatedconnection
are provided._describe()
to use it.Semi-breaking change:
overwrite
andconnection
are deprecated; users should migrate tomode
andcredentials
.Description
This PR introduces two key enhancements to
ibis.TableDataset
:Configurable Save Modes:
mode
parameter tosave_args
similar to Spark’sDataFrameWriter.mode
and Pandas’to_csv(mode=...)
, with support for:"append"
: Insert data into an existing table (requires the backend to implementinsert()
)."overwrite"
: Drop and recreate the table/view."error"
or"errorifexists"
: Fail if the table already exists."ignore"
: Do nothing if the table exists; otherwise, create it.overwrite=True|False
maps tomode="overwrite"|"error"
.mode
andoverwrite
are specified simultaneously.credentials
Parameter:credentials
as the preferred method for specifying Ibis backend connection configurations.con
string).connection
parameter._connect()
and_describe()
to support new functionality.Development notes
mode
argument).DataFrameWriter.mode
and Pandas’sto_csv(mode=…)
.overwrite
behavior.ibis.connect()
Expand for details:
Save Mode Handling
DEFAULT_SAVE_ARGS
updated to usemode="overwrite"
by default.__init__
:mode
/overwrite
presence.overwrite
→mode
internally.save()
dispatches based on mode:append
→ callsinsert()
if supported, else raisesNotImplementedError
.overwrite
,error
,ignore
→ handled viacreate_table
/create_view
with appropriate overwrite flag._describe()
includesmode
.Credentials Support
credentials
param to init signature and docstring.con
string._connect()
routes accordingly based on type._get_backend_name()
extracts backend info for_describe()
.All supported
mode
options were tested using both DuckDB and Postgres backends.credentials
parsing was tested for string, dict-with-con, and dict-with-backend formats.Checklist
jsonschema/kedro-catalog-X.XX.json
if necessaryRELEASE.md
file