Skip to content

Conversation

gtauzin
Copy link
Contributor

@gtauzin gtauzin commented May 19, 2025

Description

While putting pipelines in production that rely on the StudyDataset that I wrote in #1021, I noticed that providing it with a password in credentials does not work. The password along with the database name, user name, host and ports are translated into a SQLAlchemy URL that was then casted to a string, resulting in the password becoming "***".

Development notes

  • Remove the casting of the storage URL into a string and keep the SQLAlchemy URL
  • Whenever the actual storage URL is needed, make use of URL.render_as_string
  • Add a test that checks the conversion (and the _study_name_exists method)

In order to try out in a real case, you can make use of the optuna_dist branch of kedro-dagster-example:

git clone --branch optuna_dist [email protected]:gtauzin/kedro-dagster-example.git
cd kedro-dagster-example

Spin the postgresql DB:

docker compose -f docker/pipelines-dev.docker-compose.yml up

In a separate terminal, do:

export POSTGRES_DB=dev_db
export POSTGRES_USER=dev_user                           
export POSTGRES_PASSWORD=dev_password
export POSTGRES_HOST=localhost
export POSTGRES_PORT=5432
export KEDRO_ENV=dev 
uv run kedro run --env KEDRO_ENV

You can change from the local StudyDataset (which is the one I pushed here as well - kedro_dagster_example.study_dataset.StudyDataset) to the one release in kedro-datasets (kedro_datasets_experimental.optuna.StudyDataset) in the conf/dev/catalog.yml to observe the fix.

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • [] Updated the documentation to reflect the code changes
  • [] Updated jsonschema/kedro-catalog-X.XX.json if necessary
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes
  • Received approvals from at least half of the TSC (required for adding a new, non-experimental dataset)

@gtauzin gtauzin force-pushed the fix/optuna_pwd_db branch from ca8b961 to e827ed9 Compare May 19, 2025 19:46
pruner_config = load_args.pop("pruner")
pruner = self._get_pruner(pruner_config)

storage_url_str = self._storage_url.render_as_string(hide_password=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this expose your database password? I have never used Optuna, so don't really know how this works, but ideally this should still be a secure way of handling credentials.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does indeed expose your database password. I am not sure what are the security consequences of this but I am not seeing a workaround.

Optuna's RDBStorage, requires the url of the RDB in the form of a SQLAlchemy URL string and stores it directly as a class attribute. This string includes the target db, its host, port, password and name. Previously with self._storage = str(storage), the password was hidden and translated to "***" within the string which meant authentication to a password-protected RDB was impossible. Now even if I instantiate the RDBStorage class in the dataset class constructor to reuse it in save/load methods, its url class attribute would still contain all the credentials.

Would you have any pointers regarding to the security scenario where that would be problematic or to a discussion regarding secure handling of credentials? I am happy to learn more about this and try and come back with a better solution!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @merelcht
It's been a few months already and I would really like to get this merge as the current StudyDataset does not really allow for distributed hyperparameter tuning. I am happy to work on a solution but I need some pointers as mentioned above. Please let me know :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @gtauzin , sincere apologies for not getting back to you earlier! This slipped my mind during and after the 1.0 release of Kedro.

My main concern is whether this could accidentally leak credentials and make it possible for bad actors to retrieve your database keys. Would it work if you just directly call:

storage = optuna.storages.RDBStorage(url=self._storage_url)

And skip the rendering as string?

Copy link
Member

@deepyaman deepyaman Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with the password handling and don't necessarily see something better. The rendered password is only in a variable, not in anything saved to the class, right? Unless I'm missing something in my pass through.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deepyaman Indeed, all credentials are stored in a class variable _storage_url which means they can be directly accessed as my_dataset._storage_url.render_as_string(hide_password=False). This is the only place where they appear.

@merelcht No worries! Initially I figured you were busy with 1.0 and then I forgot about it until I started needing it again :)

For your suggestion, the url parameter required by RDBStorage is a string, which is why I use the render_as_string method.

I have had a look at redis.PickleDataset and it seems to me you can get back credentials from a dataset instance as well by doing my_pickle_dataset._redis_db.connection_pool.connection_kwargs.

Is the problem related to credentials are stored in the class and can be accessed clearly or the presence of a line of code which exposes them clearly?

Thanks to both of you for the help.

## Bug fixes and other changes

- Fixed `PartitionedDataset` to reliably load newly created partitions, particularly with `ParallelRunner`, by ensuring `load()` always re-scans the filesystem .
- Fixed `StudyDataset` to properly propagate a RDB password through the dataset's `credentials`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Fixed `StudyDataset` to properly propagate a RDB password through the dataset's `credentials`
- Fixed `StudyDataset` to properly propagate a RDB password through the dataset's `credentials`.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants