Skip to content

Conversation

Pijukatel
Copy link
Collaborator

Description

This change enables scenarios with mixed storage clients. For example, running on the Apify platform, but also using some specialized storage based on a different storage client.

  • StorageInstanceManager now also caches based on the used storage client, by using client_opener.__qualname__ as cache key. This allows, for example, to keep two storages of the same name if they use different storage clients.
  • RecoverableState now has an optional argument key_value_store to allow explicit setting of the desired kvs to be used for persistence. This is to prevent a situation where the storage client-specific state is persisted to a different storage client just because of the global service locator. For example, when using SQL based storage client on Apify platform, it is desired to persist the recoverable state to the SQL storage-based kvs and not to the Apify-based kvs.

Issues

Testing

Added new unit tests.

@github-actions github-actions bot added this to the 121st sprint - Tooling team milestone Aug 22, 2025
@github-actions github-actions bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Aug 22, 2025
@Pijukatel Pijukatel changed the title feat: Update RecoverableState and StorageInstanceManager to ensure correct proper persitance feat: Update RecoverableState and StorageInstanceManager to ensure proper persitance Aug 22, 2025
@Pijukatel Pijukatel marked this pull request as ready for review August 22, 2025 05:58
@janbuchar janbuchar changed the title feat: Update RecoverableState and StorageInstanceManager to ensure proper persitance feat: Update RecoverableState and StorageInstanceManager to ensure proper persistence Aug 25, 2025
Comment on lines 56 to 57
key_value_store: KeyValueStore to use for persistence. If not provided, the service locator is used to
provide suitable KeyValueStore.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd reword this slightly to something like "a system-wide KeyValueStore will be used, based on service locator configuration"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded.

@@ -19,6 +20,18 @@
"""Type alias for the client opener function."""


class StorageClientCache:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a lot like a dataclass. Also, could you make it private, or add __all__ and exclude it from there? I'd like to make it clear that it's just an auxiliary class for StorageInstanceManager.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

if (
id is None
and name is None
and cls in self._cache_by_storage_client[client_opener.__qualname__].default_instances
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no guarantee that client_opener will be stable enough to use as a cache key. Something similar to id(storage_client) would be more appropriate IMO, but that would require a more sizable change to the code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, with the current state, we do not have access to the storage_client before caching, so I was left with client_opener.

I was thinking about accessing storage_client through client_opener.__self__, but that would work only for bound methods of the storage clients, and client_opener can be any callable. That is why I opted for client_opener.__qualname__, which is generic enough and I think should be stable enough for the cache, as the most common case will be just one of our prepared factory methods, like MemoryStorageClient.create_kvs_client

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that that's another consideration related to the instance cache mechanism. We should resolve that before v1.

storage_client=MemoryStorageClient(),
configuration=config,
)
assert kvs1 is kvs2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really correct? Those two clients could be configured differently (imagine e.g., two filesystem clients with different storage dirs)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is also current behavior, and I have not changed that. It is a fair point though, but I would question it based on the scope of the PR:

The current state is that it is allowed to have one storage of a specific type with the same ID or name.

Basically, caching is f(storage_type, id or name)

I am expanding it to f(storage_type, id or name, client_opener) to support use case in the issue I am dealing with.

To support the use case you are describing, I would have to expand it to f(storage_type, id or name, client_opener, configuration)

So I can further expand the caching to support this as well, if required, even though it is not necessary for this PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, let's not make a behemoth PR here, but please create an issue to track this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Pijukatel Pijukatel requested a review from janbuchar August 26, 2025 07:19
Copy link
Collaborator

@Mantisus Mantisus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@janbuchar
Copy link
Collaborator

For example, when using SQL based storage client on Apify platform, it is desired to persist the recoverable state to the SQL storage-based kvs and not to the Apify-based kvs.

After thinking about this for a while during the review of #1339, I'm not so sure. The Apify-based KVS has the advantage of having a default storage for each run. An SQL database server outside of the platform has no way of telling different Actor runs apart, so it can only have a single shared default KVS for all of them.

Point being... maybe RecoverableState should try hard to put its state in the Apify KVS if given the chance. I might be wrong though 🤷 I just wanted to write this idea down.

@Pijukatel
Copy link
Collaborator Author

For example, when using SQL based storage client on Apify platform, it is desired to persist the recoverable state to the SQL storage-based kvs and not to the Apify-based kvs.

After thinking about this for a while during the review of #1339, I'm not so sure. The Apify-based KVS has the advantage of having a default storage for each run. An SQL database server outside of the platform has no way of telling different Actor runs apart, so it can only have a single shared default KVS for all of them.

Point being... maybe RecoverableState should try hard to put its state in the Apify KVS if given the chance. I might be wrong though 🤷 I just wanted to write this idea down.

I think that it is the storage's responsibility to ensure it is suitable for a recoverable state. This change is mainly about allowing (not enforcing) decoupling between RecoverableState and global kvs provided by the service locator.

from crawlee.storages._key_value_store import KeyValueStore # noqa: PLC0415 avoid circular imports

# Prepare kvs for recoverable state
kvs_client = await FileSystemStorageClient().create_kvs_client(name=f'__RQ_STATE_{metadata.id}')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait a minute. This is a subtle bug waiting to happen - the FileSystemStorageClient that creates this RQ client might have a different configuration than the one you create here.

Can you pass the actual storage client down to this method instead?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

All crawlee persistance names should be valid names for Apify platform
3 participants