Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion kedro-datasets/RELEASE.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
## Bug fixes and other changes

- Fixed `PartitionedDataset` to reliably load newly created partitions, particularly with `ParallelRunner`, by ensuring `load()` always re-scans the filesystem .
- Fixed `StudyDataset` to properly propagate a RDB password through the dataset's `credentials`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Fixed `StudyDataset` to properly propagate a RDB password through the dataset's `credentials`
- Fixed `StudyDataset` to properly propagate a RDB password through the dataset's `credentials`.


## Breaking changes

Expand All @@ -16,7 +17,7 @@

Many thanks to the following Kedroids for contributing PRs to this release:

- ...
- [Guillaume Tauzin](https://github.com/gtauzin)

# Release 7.0.0

Expand Down
26 changes: 18 additions & 8 deletions kedro-datasets/kedro_datasets_experimental/optuna/study_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -116,13 +116,13 @@ def __init__( # noqa: PLR0913
self._study_name = self._validate_study_name(study_name=study_name)

credentials = self._validate_credentials(backend=backend, credentials=credentials)
storage = URL.create(
storage_url = URL.create(
drivername=backend,
database=database,
**credentials,
)

self._storage = str(storage)
self._storage_url = storage_url
self.metadata = metadata

filepath = None
Expand Down Expand Up @@ -286,8 +286,10 @@ def load(self) -> optuna.Study:
pruner_config = load_args.pop("pruner")
pruner = self._get_pruner(pruner_config)

storage_url_str = self._storage_url.render_as_string(hide_password=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this expose your database password? I have never used Optuna, so don't really know how this works, but ideally this should still be a secure way of handling credentials.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does indeed expose your database password. I am not sure what are the security consequences of this but I am not seeing a workaround.

Optuna's RDBStorage, requires the url of the RDB in the form of a SQLAlchemy URL string and stores it directly as a class attribute. This string includes the target db, its host, port, password and name. Previously with self._storage = str(storage), the password was hidden and translated to "***" within the string which meant authentication to a password-protected RDB was impossible. Now even if I instantiate the RDBStorage class in the dataset class constructor to reuse it in save/load methods, its url class attribute would still contain all the credentials.

Would you have any pointers regarding to the security scenario where that would be problematic or to a discussion regarding secure handling of credentials? I am happy to learn more about this and try and come back with a better solution!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @merelcht
It's been a few months already and I would really like to get this merge as the current StudyDataset does not really allow for distributed hyperparameter tuning. I am happy to work on a solution but I need some pointers as mentioned above. Please let me know :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @gtauzin , sincere apologies for not getting back to you earlier! This slipped my mind during and after the 1.0 release of Kedro.

My main concern is whether this could accidentally leak credentials and make it possible for bad actors to retrieve your database keys. Would it work if you just directly call:

storage = optuna.storages.RDBStorage(url=self._storage_url)

And skip the rendering as string?

Copy link
Member

@deepyaman deepyaman Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with the password handling and don't necessarily see something better. The rendered password is only in a variable, not in anything saved to the class, right? Unless I'm missing something in my pass through.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deepyaman Indeed, all credentials are stored in a class variable _storage_url which means they can be directly accessed as my_dataset._storage_url.render_as_string(hide_password=False). This is the only place where they appear.

@merelcht No worries! Initially I figured you were busy with 1.0 and then I forgot about it until I started needing it again :)

For your suggestion, the url parameter required by RDBStorage is a string, which is why I use the render_as_string method.

I have had a look at redis.PickleDataset and it seems to me you can get back credentials from a dataset instance as well by doing my_pickle_dataset._redis_db.connection_pool.connection_kwargs.

Is the problem related to credentials are stored in the class and can be accessed clearly or the presence of a line of code which exposes them clearly?

Thanks to both of you for the help.

storage = optuna.storages.RDBStorage(url=storage_url_str)
study = optuna.load_study(
storage=self._storage,
storage=storage,
study_name=self._get_load_study_name(),
sampler=sampler,
pruner=pruner,
Expand All @@ -297,37 +299,45 @@ def load(self) -> optuna.Study:

def save(self, study: optuna.Study) -> None:
save_study_name = self._get_save_study_name()

storage_url_str = self._storage_url.render_as_string(hide_password=False)
if self._backend == "sqlite":
os.makedirs(os.path.dirname(self._filepath), exist_ok=True)

if not os.path.isfile(self._filepath):
optuna.create_study(
storage=self._storage,
storage=storage_url_str,
)

storage = optuna.storages.RDBStorage(url=storage_url_str)

# To overwrite an existing study, we need to first delete it if it exists
if self._study_name_exists(save_study_name):
optuna.delete_study(
storage=self._storage,
storage=storage,
study_name=save_study_name,
)

optuna.copy_study(
from_study_name=study.study_name,
from_storage=study._storage,
to_storage=self._storage,
to_storage=storage,
to_study_name=save_study_name,
)

def _study_name_exists(self, study_name) -> bool:
if self._backend == "sqlite" and not os.path.isfile(self._database):
return False

study_names = optuna.study.get_all_study_names(storage=self._storage)
storage_url_str = self._storage_url.render_as_string(hide_password=False)
storage = optuna.storages.RDBStorage(url=storage_url_str)
study_names = optuna.study.get_all_study_names(storage=storage)
return study_name in study_names

def _study_name_glob(self, pattern):
study_names = optuna.study.get_all_study_names(storage=self._storage)
storage_url_str = self._storage_url.render_as_string(hide_password=False)
storage = optuna.storages.RDBStorage(url=storage_url_str)
study_names = optuna.study.get_all_study_names(storage=storage)
for study_name in study_names:
if fnmatch.fnmatch(study_name, pattern):
yield study_name
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,26 @@ def test_invalid_credentials(self):
credentials={"username": "user", "pwd": "pass"}, # pragma: allowlist secret
)

def test_study_existence(self):
"""Test invalid credentials raise ValueError."""
study_dataset = StudyDataset(
study_name="test",
backend="postgresql",
database="optuna_db",
credentials={"username": "user", "password": "pass"}, # pragma: allowlist secret
)

# Test that RDB storage can be created but DB access module cannot be imported
with pytest.raises(ImportError, match="Failed to import DB access module"):
study_dataset._study_name_exists(study_name="blah")

study_dataset = StudyDataset(
study_name="test",
backend="sqlite",
database="optuna.db",
)
assert study_dataset._study_name_exists(study_name="blah") is False

def test_study_name_exists(self, study_dataset, dummy_study):
"""Test `_study_name_exists` method."""
assert not study_dataset._study_name_exists("test")
Expand Down