-
Notifications
You must be signed in to change notification settings - Fork 437
refactor!: Refactor storage creation and caching, configuration and services #1386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Rework of service_locator implicit setting of services, storages and storage creation.
Can you please expand the PR description with an explanation of how this updated logic works with the Apify SDK? Namely I'm interested in the way it overrides the global storage client. For instance,
Judging from apify/apify-sdk-python#576, it should be fine. But we should make sure that this is covered by tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments
@@ -36,15 +35,13 @@ async def open( | |||
*, | |||
id: str | None = None, | |||
name: str | None = None, | |||
configuration: Configuration | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this intentional? Doesn't look like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thanks. I am actually disappointed that mypy was not complaining about this.
storage_client: StorageClient | None = None, | ||
) -> Storage: | ||
"""Open a storage, either restore existing or create a new one. | ||
|
||
Args: | ||
id: The storage ID. | ||
name: The storage name. | ||
configuration: Configuration object used during the storage creation or restoration process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this intentional? Doesn't look like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thanks.
@@ -0,0 +1,123 @@ | |||
from pathlib import Path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could use a parametrized fixture to test it across all storages (dataset, kvs, rq).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, parametrized where it made sense.
storage_client=self._service_locator.get_storage_client(), | ||
configuration=self._service_locator.get_configuration(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, so we're passing this from basic crawler so that the storage does not use the value from the global service locator, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly. And if there was no custom configuration/storage_client, then the crawler will have the same as the global service_locator has.
@@ -28,6 +30,13 @@ class StorageClient(ABC): | |||
(where applicable), and consistent access patterns across all storage types it supports. | |||
""" | |||
|
|||
def get_additional_cache_key(self, configuration: Configuration) -> Hashable: # noqa: ARG002 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def get_additional_cache_key(self, configuration: Configuration) -> Hashable: # noqa: ARG002 | |
def get_additional_cache_key(self, _: Configuration) -> Hashable: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't use it here. It can be called as named argument and child classes, for example, FilesystemStorageClient will use it.
configuration=configuration, | ||
client_opener=storage_client.create_dataset_client, | ||
client_opener=client_opener, | ||
storage_client_type=storage_client.__class__, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay - because we use the storage client class for caching, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly
src/crawlee/storages/_dataset.py
Outdated
@@ -317,7 +319,6 @@ async def export_to( | |||
to_kvs_id: str | None = None, | |||
to_kvs_name: str | None = None, | |||
to_kvs_storage_client: StorageClient | None = None, | |||
to_kvs_configuration: Configuration | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. This was deleted by mistake. I reverted it and added test to cover those parameters.
|
||
|
||
@pytest.fixture | ||
async def rq( | ||
storage_client: StorageClient, | ||
configuration: Configuration, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the configuration
fixture is now being used implicitly even if we don't specify it as an argument? (maybe it is my limited knowledge of pytest)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I forgot to delete the configuration fixture, which is not needed at all, because in global autouse fixture _isolate_test_environment
we set monkeypatch.setenv('CRAWLEE_STORAGE_DIR', str(tmp_path))
and this ensures that any implicitly created configuration will point to this temp path - which is unique per test case.
if request.param == 'memory': | ||
return MemoryStorageClient() | ||
|
||
return FileSystemStorageClient() | ||
storage_client: StorageClient | ||
storage_client = MemoryStorageClient() if request.param == 'memory' else FileSystemStorageClient() | ||
service_locator.set_storage_client(storage_client) | ||
return storage_client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't do this as we will have the sql
client
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kind of an edit by the linter, not really important. Once SQL client is merged, we can just expand the argument and add one more elif
.
The SDK PR includes the tests that cover the interaction of BasicCrawler and Actor with respect to init and services, the description is also included in the linked PR.
No, configuring the crawler with a custom storage_client will not set it in the global servcice_locator. This side effect was removed.
It will work and Actor.init will take it from global service locator, but if you set (explicitly or implicitly) storage client in global service locator and then you try to use different storage client in Actor it will raise ServiceConflictError. See tests from linked PR:
|
That means that this: crawler = BasicCrawler(storage_client=MemoryStorageClient())
# ...
await crawler.run()
dataset = await Dataset.open()
await dataset.export_data()` will behave differently after this PR, correct? While I agree that it is better to be explicit about this, I'm pretty sure that it will surprise someone. |
Yes, it can surprise people who are used to the old behavior. But I think this makes more sense, especially since we have public methods for getting storages on BasicCrawler.
Alternativelly you can set the storage_client globally and not pass it to the crawler |
Can we come up with some warning if this happens? Another concern, if a crawler uses a slightly modified instance of |
Adding warning and test to SDK PR apify/apify-sdk-python@5bf51f7 |
Description
This is a collection of closely related changes that are hard to separate from one another. The main purpose is to enable flexible storage use across the code base without unexpected limitations and limit unexpected side effects in global services.
Top-level changes:
ServiceConflictError
)StorageInstanceManager
allows for similar but different storage instances to be used at the same time(Previously, similar storage instances could be incorrectly retrieved instead of creating a new storage instance).StorageClient
and are different only by using differentConfiguration
.Crawler
can no longer cause side effects in the global service_locator (apart from adding new instances toStorageInstanceManager
).service_locator
can be used at the same time as local instances ofServiceLocator
(for example, each Crawler has its ownServiceLocator
instance, which does not interfere with the global service_locator.)ServiceLocator
can be set only once. Any attempt to reset them will throw an Error. Not setting the services and using them is possible. That will set services inServiceLocator
to some implicit default, and it will log warnings as implicit services can lead to hard-to-predict code. The preferred way is to set services explicitly. Either manually or through some helper code, for example, throughActor
. See related PRImplementation notes:
name
,id
,storage_type
,storage_client_type
, there is also anadditional_cache_key
. This can be used by theStorageClient
to define a unique way to distinguish between two similar but different instances. For example,FileSystemStorageClient
depends onConfiguration.storage_dir
, which is included in the custom cache key forFileSystemStorageClient
, but this is not true forMemoryStorageClient
as thestorage_dir
is not relevant for it, see example:(This
additional_cache_key
could possibly be used for caching of NDU in feat: Add support for NDU storages #1401)ServiceLocator
. It will either use explicitly passed services(configuration, storage client, event_manager) to crawler init or services from the globalservice_locator
as implicit defaults. This allows multiple differently configured crawlers to work in the same code. For example:ServiceLocator
is now way more strict when it comes to setting the services. Previously, it allowed changing services until some service had_was_retrieved
flag set toTrue
. Then it would throw a runtime error. This led to hard-to-predict code as the globalservice_locator
could be changed as a side effect from many places. Now the services inServiceLocator
can be set only once, and the side effects of attempting to change the services are limited as much as possible. Such side effects are also accompanied by warning messages to draw attention to code that could cause RuntimeError.Issues
Closes: #1379
Connected to:
StorageInstanceManagaer
)StorageInstanceManagaer
and storage clients/configuration related changes inservice_locator
)Testing
Apify
platform together with SDK changes in related PR