Skip to content

Conversation

gitgud5000
Copy link
Contributor

@gitgud5000 gitgud5000 commented May 31, 2025

TODO(deepyaman): Move notes on credentials support to a new PR from @gitgud5000 with those changes, once it exists. I'm not deleting the detailed notes from here until they're moved elsewhere.

feat(datasets): ibis.TableDataset add mode and credentials support

  • feat: Introduces a mode parameter for save operations, allowing "append", "overwrite", "error", and "ignore" options to control write behavior.
    • Supports legacy overwrite flag with backward compatibility and deprecation warning.
    • Adds mode dispatching logic to handle different write scenarios.
    • Prevents simultaneous use of both mode and legacy overwrite to avoid ambiguity.
    • Updates examples and docs to reflect new parameter.
  • feat: Adds a credentials parameter to accept connection info (dict or string URI), superseding connection.
    • Raises a warning if both credentials and deprecated connection are provided.
    • Adds backend extraction helper and adjusts _describe() to use it.
    • Improves documentation to reflect the preferred usage.

Semi-breaking change: overwrite and connection are deprecated; users should migrate to mode and credentials.

Description

This PR introduces two key enhancements to ibis.TableDataset:

  1. Configurable Save Modes:

    • Adds a mode parameter to save_args similar to Spark’s DataFrameWriter.mode and Pandas’ to_csv(mode=...), with support for:
      • "append": Insert data into an existing table (requires the backend to implement insert()).
      • "overwrite": Drop and recreate the table/view.
      • "error" or "errorifexists": Fail if the table already exists.
      • "ignore": Do nothing if the table exists; otherwise, create it.
    • Backward compatible: legacy overwrite=True|False maps to mode="overwrite"|"error".
    • Raises an error if both mode and overwrite are specified simultaneously.
  2. credentials Parameter:

    • Introduces credentials as the preferred method for specifying Ibis backend connection configurations.
    • Accepts:
      • A string connection URI.
      • A dictionary of parameters (with optional con string).
    • Supersedes the older connection parameter.
    • Warns if both are provided.
    • Updates both _connect() and _describe() to support new functionality.

Development notes

  • Closes Support for inserting data in ibis.TableDataset  #834 (requested by @deepyaman on Sep 14, 2024, to add “insert” support via a mode argument).
    • In the parent issue Enhance current integration between Kedro and Ibis #1174, the main ask was to allow an “append”/“insert” operation for Ibis datasets.
    • This implementation covers:
      • Rework the API akin to Spark’s DataFrameWriter.mode and Pandas’s to_csv(mode=…).
      • Backward compatibility with overwrite behavior.
      • Clear, explicit behavior for each mode option.
      • Support for passing credentials in different formats supported by ibis.connect()
Expand for details:

Save Mode Handling

  • DEFAULT_SAVE_ARGS updated to use mode="overwrite" by default.
  • __init__:
    • Validates mode/overwrite presence.
    • Maps overwritemode internally.
  • save() dispatches based on mode:
    • append → calls insert() if supported, else raises NotImplementedError.
    • overwrite, error, ignore → handled via create_table/create_view with appropriate overwrite flag.
  • _describe() includes mode.

Credentials Support

  • Added credentials param to init signature and docstring.
  • Accepts either:
    • A connection string (e.g. "duckdb:///my.db")
    • A dict with connection params or a con string.
  • _connect() routes accordingly based on type.
  • _get_backend_name() extracts backend info for _describe().

All supported mode options were tested using both DuckDB and Postgres backends.
credentials parsing was tested for string, dict-with-con, and dict-with-backend formats.

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Updated jsonschema/kedro-catalog-X.XX.json if necessary
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes
  • Received approvals from at least half of the TSC (required for adding a new, non-experimental dataset)

…ble materialization

- feat: Introduces a `mode` parameter for save operations, allowing "append", "overwrite", "error", and "ignore" options to control write behavior.
  - Supports legacy `overwrite` flag with backward compatibility and deprecation warning.
  - Adds mode dispatching logic to handle different write scenarios.
  - Updates examples and docs to reflect new parameter.
  - Prevents simultaneous use of both `mode` and legacy `overwrite` to avoid ambiguity.
  - docs: Improves documentation and usage examples for new save modes.

Semi-breaking change: Replaces `overwrite` save argument with `mode`; users should update configurations to use `mode`.

Signed-off-by: gitgud5000 <[email protected]>
@gitgud5000 gitgud5000 force-pushed the ibis-dataset-savemode-support branch from 2e9e539 to a41bb19 Compare May 31, 2025 01:41
@gitgud5000 gitgud5000 force-pushed the ibis-dataset-savemode-support branch from a41bb19 to 1f609c5 Compare May 31, 2025 01:44
Copy link
Member

@deepyaman deepyaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

Did a quick-pass review. I see you're still actively updating, so I'll revisit this later.


self._materialized = self._save_args.pop("materialized")

# Handle mode / overwrite conflict
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder if insert + view needs to be disallowed; I haven't actually checked yet.

@gitgud5000
Copy link
Contributor Author

gitgud5000 commented May 31, 2025

I've been activily working with this because I need the features now 😅.

I've also added support for passing credentials to the ibis.TableDataset to this PR I'll update the OP shortly to reflect this.
Is that ok or should I create a separate PR for this feature? @deepyaman

@gitgud5000 gitgud5000 changed the title feat(datasets): ibis.TableDataset add configurable save mode for table materialization feat(datasets): ibis.TableDataset add configurable save mode for table materialization and credentials support May 31, 2025
…precate connection

- feat: introduce a new `credentials` parameter to accept connection info as a string or dict, preferred over the deprecated `connection` param
  - warn if both `credentials` and `connection` are provided, prioritizing `credentials`
  - support connection strings and dicts with connection strings for backend connections
- feat: add backend name extraction to be used in `_describe()` method
- docs: update docstrings to explain `credentials` and deprecation of `connection`

!Semi- Breaking Change: deprecates the `connection` parameter in favor of `credentials`

Signed-off-by: gitgud5000 <[email protected]>
…lacing the function dispatch dict with direct conditional handling, improving clarity and maintainability

- docs: expands and clarifies docstring to explain available save modes and their effects, referencing Spark semantics for familiarity

Signed-off-by: gitgud5000 <[email protected]>
@gitgud5000 gitgud5000 force-pushed the ibis-dataset-savemode-support branch from d545871 to 6445239 Compare May 31, 2025 03:52
@deepyaman
Copy link
Member

deepyaman commented May 31, 2025

I've been activily working with this because I need the features now 😅.

Best reason. 😁

I've also added support for passing credentials to the ibis.TableDataset to this PR I'll update the OP shortly to reflect this. Is that ok or should I create a separate PR for this feature? @deepyaman

Sorry, I didn't check in time; ideal is a separate PR, but it's probably no big deal to have it in the same. If it ends up being a blocker to get something merged, can always split it out later.

@ankatiyar ankatiyar requested a review from deepyaman June 3, 2025 14:52
@gitgud5000 gitgud5000 marked this pull request as draft June 4, 2025 11:13
@gitgud5000 gitgud5000 marked this pull request as ready for review August 15, 2025 18:49
- fix: add early return for empty DataFrame
- fix: ensure table creation occurs before insert operations in append mode when table doesn't exist

Signed-off-by: gitgud5000 <[email protected]>
@gitgud5000 gitgud5000 force-pushed the ibis-dataset-savemode-support branch from 40bbfe3 to 6130581 Compare August 15, 2025 20:15
@gitgud5000
Copy link
Contributor Author

Could you please take a look at this pull request, @ankatiyar @ravi-kumar-pilla?

@ankatiyar
Copy link
Contributor

@gitgud5000 I'll take a look, in the meanwhile could you add some unit tests to go along with the changes? :D

- fix: remove lingering 'mode' key when mapping legacy overwrite to prevent unexpected writer kwargs
- fix: treat empty pandas DataFrame as a no-op in save, supporting both pandas and ibis tables

Prevents accidental parameter leakage and avoids errors when saving empty data.

Signed-off-by: gitgud5000 <[email protected]>
…and legacy overwrite behavior

Signed-off-by: gitgud5000 <[email protected]>
@gitgud5000 gitgud5000 force-pushed the ibis-dataset-savemode-support branch from db200f3 to 55b8666 Compare August 18, 2025 14:31
@gitgud5000
Copy link
Contributor Author

gitgud5000 commented Aug 18, 2025

@ankatiyar Done ✅

I wasn’t able to resolve a couple of things:

  • All the tests I added pass locally, except for test_connection_config with the mssql config.
=========================== short test summary info ============================
FAILED kedro-datasets/tests/ibis/test_table_dataset.py::TestTableDataset::test_connection_config[None-None-None-connection_config1-key1]
  • The lint check pipeline is still failing, and I’m not sure why or how to fix it. Would appreciate your guidance 🙂

Signed-off-by: Ankita Katiyar <[email protected]>
@gitgud5000 gitgud5000 force-pushed the ibis-dataset-savemode-support branch from 63b95c1 to 4c6ddb9 Compare August 18, 2025 16:38
@gitgud5000 gitgud5000 force-pushed the ibis-dataset-savemode-support branch from 0cbaabd to 940d901 Compare September 10, 2025 07:28
@deepyaman
Copy link
Member

@gitgud5000 Sorry for the delay; I'll review in the next few days!

@deepyaman deepyaman force-pushed the ibis-dataset-savemode-support branch from a3c387b to e4636b7 Compare September 23, 2025 19:35
@deepyaman deepyaman force-pushed the ibis-dataset-savemode-support branch from a245267 to 6a23b81 Compare September 23, 2025 19:50
@deepyaman
Copy link
Member

@gitgud5000 No rush at all (will focus on getting this PR merged first), but whenever you get a chance you can create one more PR off main with a git cherry-pick d922af12211e301dd53245bee0d08aa5c6e59852 (there are some minor conflicts, but they look easy to resolve). This will include all the changes unrelated to adding configurable save mode (credentials, empty datasets, etc.). If need to split that up more later can do it.

The main reason I'm asking you to create the PR instead of doing it myself is because then only one other person needs to review it other than myself. :)

@deepyaman deepyaman force-pushed the ibis-dataset-savemode-support branch from f29fe5c to f819670 Compare September 23, 2025 23:10
Copy link
Member

@deepyaman deepyaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excited to have this long-awaited functionality!


# Handle mode/overwrite conflict.
if save_args and "mode" in save_args and "overwrite" in self._save_args:
raise ValueError("Cannot specify both 'mode' and deprecated 'overwrite'.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's actually no DeprecationWarning, but I'm also not 100% sure there needs to be one. However, if there isn't a DeprecationWarning, it doesn't make sense to call it "deprecated", and it shouldn't be on the list of breaking changes in the release notes.

Comment on lines +18 to +19

- `ibis.TableDataset`: Deprecated `save_args.overwrite` and the `connection` parameter in favor of `save_args.mode` and `credentials`. Using both `overwrite` and `mode` together raises an error; providing both `credentials` and `connection` emits a deprecation warning. The deprecated options will be removed in a future release.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `ibis.TableDataset`: Deprecated `save_args.overwrite` and the `connection` parameter in favor of `save_args.mode` and `credentials`. Using both `overwrite` and `mode` together raises an error; providing both `credentials` and `connection` emits a deprecation warning. The deprecated options will be removed in a future release.

I don't think it's a breaking change even if raise a DeprecationWarning for using 'overwrite' (separate issue about actually raising the DeprecationWarning; see below).

(Also, the credentials and connection bit shouldn't have been left in here when I split out the PR; my bad.)

- Fixed `PartitionedDataset` to reliably load newly created partitions, particularly with `ParallelRunner`, by ensuring `load()` always re-scans the filesystem .
- Add a parameter `encoding` inside the dataset `SQLQueryDataset` to choose the encoding format of the query.
- Corrected the `APIDataset` docstring to clarify that request parameters should be passed via `load_args`, not as top-level arguments.
- Improved `_connect` and `_describe` for `ibis.TableDataset`; saving an empty pandas DataFrame is now a no-op.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Improved `_connect` and `_describe` for `ibis.TableDataset`; saving an empty pandas DataFrame is now a no-op.

This was also supposed to get split out, sorry.

from ibis import BaseBackend


class SaveMode(StrEnum):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gitgud5000 I took the liberty of refactoring save mode logic into an enum.

APPEND = auto()
OVERWRITE = auto()
ERROR = auto()
ERRORIFEXISTS = auto()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need both "error" and "errorifexists"? "errorifexists" is a bit more explicit, but also a bit more verbose.

I decided to check if one or the other is preferred by Spark (maybe we could just use that), but it turns out it's not so simple. https://issues.apache.org/jira/browse/SPARK-21640 added "errorifexists" to the Python (and other API) options—"error" used to be the only option—but https://github.com/apache/spark/blob/v4.0.0/sql/api/src/main/java/org/apache/spark/sql/SaveMode.java only defines "errorifexists". Since we don't have the baggage, I'd be fine with just keeping "error" (feels simpler), but I also don't have any strong opinion against supporting both. Happy to have another maintainer (or @gitgud5000) weigh in.

requires-python = ">=3.10"
license = {text = "Apache Software License (Apache 2.0)"}
dependencies = [
"backports.strenum; python_version < '3.11'",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a backport of a standard lib module; I don't personally think it's a big deal to add this dependency for everything, and just for Python 3.10.

def table_dataset(table_name, database_name, connection_config, load_args, save_args):
return TableDataset(
table_name="test",
table_name=table_name,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gitgud5000 Was making this parametrizable necessary? Just curious.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gitgud5000 So, I ended up removing this. I see why it's necessary, but I think the correct thing to do is to clean up the environment by deleting any tables created (FWIW that's how Ibis tests also handle it). I've refactored it as such.

@deepyaman deepyaman force-pushed the ibis-dataset-savemode-support branch from f819670 to cda534a Compare September 23, 2025 23:30
@deepyaman deepyaman force-pushed the ibis-dataset-savemode-support branch from 1057b14 to 513aab0 Compare September 23, 2025 23:44
@deepyaman deepyaman changed the title feat(datasets): ibis.TableDataset add configurable save mode for table materialization and credentials support feat(datasets): make table write mode configurable Sep 23, 2025
@deepyaman deepyaman force-pushed the ibis-dataset-savemode-support branch from 62d976a to 55ba8a0 Compare September 24, 2025 19:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for inserting data in ibis.TableDataset
3 participants