-
Notifications
You must be signed in to change notification settings - Fork 435
feat: Update RecoverableState
and StorageInstanceManager
to ensure proper persistence
#1368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
RecoverableState
and StorageInstanceManager
to ensure correct proper persitanceRecoverableState
and StorageInstanceManager
to ensure proper persitance
RecoverableState
and StorageInstanceManager
to ensure proper persitanceRecoverableState
and StorageInstanceManager
to ensure proper persistence
key_value_store: KeyValueStore to use for persistence. If not provided, the service locator is used to | ||
provide suitable KeyValueStore. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd reword this slightly to something like "a system-wide KeyValueStore will be used, based on service locator configuration"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reworded.
@@ -19,6 +20,18 @@ | |||
"""Type alias for the client opener function.""" | |||
|
|||
|
|||
class StorageClientCache: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks a lot like a dataclass. Also, could you make it private, or add __all__
and exclude it from there? I'd like to make it clear that it's just an auxiliary class for StorageInstanceManager
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
if ( | ||
id is None | ||
and name is None | ||
and cls in self._cache_by_storage_client[client_opener.__qualname__].default_instances |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no guarantee that client_opener
will be stable enough to use as a cache key. Something similar to id(storage_client)
would be more appropriate IMO, but that would require a more sizable change to the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, with the current state, we do not have access to the storage_client before caching, so I was left with client_opener.
I was thinking about accessing storage_client
through client_opener.__self__
, but that would work only for bound methods of the storage clients, and client_opener can be any callable. That is why I opted for client_opener.__qualname__
, which is generic enough and I think should be stable enough for the cache, as the most common case will be just one of our prepared factory methods, like MemoryStorageClient.create_kvs_client
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that that's another consideration related to the instance cache mechanism. We should resolve that before v1.
storage_client=MemoryStorageClient(), | ||
configuration=config, | ||
) | ||
assert kvs1 is kvs2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really correct? Those two clients could be configured differently (imagine e.g., two filesystem clients with different storage dirs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is also current behavior, and I have not changed that. It is a fair point though, but I would question it based on the scope of the PR:
The current state is that it is allowed to have one storage of a specific type with the same ID or name.
Basically, caching is f(storage_type, id or name)
I am expanding it to f(storage_type, id or name, client_opener) to support use case in the issue I am dealing with.
To support the use case you are describing, I would have to expand it to f(storage_type, id or name, client_opener, configuration)
So I can further expand the caching to support this as well, if required, even though it is not necessary for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, let's not make a behemoth PR here, but please create an issue to track this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
After thinking about this for a while during the review of #1339, I'm not so sure. The Apify-based KVS has the advantage of having a default storage for each run. An SQL database server outside of the platform has no way of telling different Actor runs apart, so it can only have a single shared default KVS for all of them. Point being... maybe |
I think that it is the storage's responsibility to ensure it is suitable for a recoverable state. This change is mainly about allowing (not enforcing) decoupling between |
from crawlee.storages._key_value_store import KeyValueStore # noqa: PLC0415 avoid circular imports | ||
|
||
# Prepare kvs for recoverable state | ||
kvs_client = await FileSystemStorageClient().create_kvs_client(name=f'__RQ_STATE_{metadata.id}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait a minute. This is a subtle bug waiting to happen - the FileSystemStorageClient
that creates this RQ client might have a different configuration than the one you create here.
Can you pass the actual storage client down to this method instead?
Description
This change enables scenarios with mixed storage clients. For example, running on the Apify platform, but also using some specialized storage based on a different storage client.
StorageInstanceManager
now also caches based on the used storage client, by usingclient_opener.__qualname__
as cache key. This allows, for example, to keep two storages of the same name if they use different storage clients.RecoverableState
now has an optional argumentkey_value_store
to allow explicit setting of the desiredkvs
to be used for persistence. This is to prevent a situation where the storage client-specific state is persisted to a different storage client just because of the global service locator. For example, when using SQL based storage client on Apify platform, it is desired to persist the recoverable state to the SQL storage-based kvs and not to the Apify-based kvs.Issues
Testing
Added new unit tests.