refactor!: Refactor storage creation and caching, configuration and services #1386

Pijukatel · 2025-09-02T15:59:33Z

Description

This is a collection of closely related changes that are hard to separate from one another. The main purpose is to enable flexible storage use across the code base without unexpected limitations and limit unexpected side effects in global services.

Top-level changes:

There can be multiple crawlers with different storage clients, configurations, or event managers. (Previously, this would cause ServiceConflictError)
StorageInstanceManager allows for similar but different storage instances to be used at the same time(Previously, similar storage instances could be incorrectly retrieved instead of creating a new storage instance).
Differently configured storages can be used at the same time, even the storages that are using the same StorageClient and are different only by using different Configuration.
Crawler can no longer cause side effects in the global service_locator (apart from adding new instances to StorageInstanceManager).
Global service_locator can be used at the same time as local instances of ServiceLocator (for example, each Crawler has its own ServiceLocator instance, which does not interfere with the global service_locator.)
Services in ServiceLocator can be set only once. Any attempt to reset them will throw an Error. Not setting the services and using them is possible. That will set services in ServiceLocator to some implicit default, and it will log warnings as implicit services can lead to hard-to-predict code. The preferred way is to set services explicitly. Either manually or through some helper code, for example, through Actor. See related PR

Implementation notes:

Storage caching now supports all relevant ways to distinguish storage instances. Apart from generic parameters like name, id, storage_type, storage_client_type, there is also an additional_cache_key. This can be used by the StorageClient to define a unique way to distinguish between two similar but different instances. For example, FileSystemStorageClient depends on Configuration.storage_dir, which is included in the custom cache key for FileSystemStorageClient, but this is not true for MemoryStorageClient as the storage_dir is not relevant for it, see example:
(This additional_cache_key could possibly be used for caching of NDU in feat: Add support for NDU storages #1401)

storage_client = FileSystemStorageClient()
d1= await Dataset.open(storage_client=storage_client, configuration=Configuration(storage_dir="path1"))
d2= await Dataset.open(storage_client=storage_client, configuration=Configuration(storage_dir="path2"))
d3= await Dataset.open(storage_client=storage_client, configuration=Configuration(storage_dir="path1"))

assert d2 is not d1
assert d3 is d1

storage_client_2 =MemoryStorageClient()
d4= await Dataset.open(storage_client=storage_client_2, configuration=Configuration(storage_dir="path1"))
d5= await Dataset.open(storage_client=storage_client_2, configuration=Configuration(storage_dir="path2"))
assert d4 is d5

Each crawler will create its own instance of ServiceLocator. It will either use explicitly passed services(configuration, storage client, event_manager) to crawler init or services from the global service_locator as implicit defaults. This allows multiple differently configured crawlers to work in the same code. For example:

custom_configuration_1 = Configuration()
custom_event_manager_1 = LocalEventManager.from_config(custom_configuration_1)
custom_storage_client_1 = MemoryStorageClient()

custom_configuration_2 = Configuration()
custom_event_manager_2 = LocalEventManager.from_config(custom_configuration_2)
custom_storage_client_2 = MemoryStorageClient()

crawler_1 = BasicCrawler(
    configuration=custom_configuration_1,
    event_manager=custom_event_manager_1,
    storage_client=custom_storage_client_1,
)

crawler_2 = BasicCrawler(
    configuration=custom_configuration_2,
    event_manager=custom_event_manager_2,
    storage_client=custom_storage_client_2,
  )

# use crawlers without runtime crash...

ServiceLocator is now way more strict when it comes to setting the services. Previously, it allowed changing services until some service had _was_retrieved flag set to True. Then it would throw a runtime error. This led to hard-to-predict code as the global service_locator could be changed as a side effect from many places. Now the services in ServiceLocator can be set only once, and the side effects of attempting to change the services are limited as much as possible. Such side effects are also accompanied by warning messages to draw attention to code that could cause RuntimeError.

Issues

Closes: #1379
Connected to:

All crawlee persistance names should be valid names for Apify platform #1354 (through necessary changes in StorageInstanceManagaer)
Add specialized/parametrized RequestQueueClients apify-sdk-python#513 (through necessary changes in StorageInstanceManagaer and storage clients/configuration related changes in service_locator)

Testing

New unit tests were added.
Tested on the Apify platform together with SDK changes in related PR

Rework of service_locator implicit setting of services, storages and storage creation.

…tion

janbuchar · 2025-09-11T12:46:23Z

Can you please expand the PR description with an explanation of how this updated logic works with the Apify SDK? Namely I'm interested in the way it overrides the global storage client. For instance,

If I reconfigure the storage client in a crawler constructor, will that one be preserved after Actor.init?
If I reconfigure the storage client in the global service locator, will Actor.init keep it that way? Will it not crash?

Judging from apify/apify-sdk-python#576, it should be fine. But we should make sure that this is covered by tests.

vdusek

A few comments

vdusek · 2025-09-11T13:07:40Z

src/crawlee/storages/_base.py

@@ -36,15 +35,13 @@ async def open(
        *,
        id: str | None = None,
        name: str | None = None,
-        configuration: Configuration | None = None,


Is this intentional? Doesn't look like.

Good catch, thanks. I am actually disappointed that mypy was not complaining about this.

vdusek · 2025-09-11T13:07:45Z

src/crawlee/storages/_base.py

        storage_client: StorageClient | None = None,
    ) -> Storage:
        """Open a storage, either restore existing or create a new one.

        Args:
            id: The storage ID.
            name: The storage name.
-            configuration: Configuration object used during the storage creation or restoration process.


Is this intentional? Doesn't look like.

Good catch, thanks.

vdusek · 2025-09-11T13:44:00Z

tests/unit/storages/test_storage_instance_manager.py

@@ -0,0 +1,123 @@
+from pathlib import Path


Maybe we could use a parametrized fixture to test it across all storages (dataset, kvs, rq).

Ok, parametrized where it made sense.

vdusek · 2025-09-11T13:54:58Z

src/crawlee/crawlers/_basic/_basic_crawler.py

+            storage_client=self._service_locator.get_storage_client(),
+            configuration=self._service_locator.get_configuration(),


okay, so we're passing this from basic crawler so that the storage does not use the value from the global service locator, correct?

Exactly. And if there was no custom configuration/storage_client, then the crawler will have the same as the global service_locator has.

vdusek · 2025-09-11T13:55:27Z

src/crawlee/storage_clients/_base/_storage_client.py

@@ -28,6 +30,13 @@ class StorageClient(ABC):
    (where applicable), and consistent access patterns across all storage types it supports.
    """

+    def get_additional_cache_key(self, configuration: Configuration) -> Hashable:  # noqa: ARG002


Suggested change

def get_additional_cache_key(self, configuration: Configuration) -> Hashable: # noqa: ARG002

def get_additional_cache_key(self, _: Configuration) -> Hashable:

I can't use it here. It can be called as named argument and child classes, for example, FilesystemStorageClient will use it.

vdusek · 2025-09-11T13:57:01Z

src/crawlee/storages/_dataset.py

-            configuration=configuration,
-            client_opener=storage_client.create_dataset_client,
+            client_opener=client_opener,
+            storage_client_type=storage_client.__class__,


okay - because we use the storage client class for caching, right?

vdusek · 2025-09-11T13:57:47Z

src/crawlee/storages/_dataset.py

@@ -317,7 +319,6 @@ async def export_to(
        to_kvs_id: str | None = None,
        to_kvs_name: str | None = None,
        to_kvs_storage_client: StorageClient | None = None,
-        to_kvs_configuration: Configuration | None = None,


Good catch. This was deleted by mistake. I reverted it and added test to cover those parameters.

vdusek · 2025-09-11T14:05:32Z

tests/unit/storages/test_request_queue.py



 @pytest.fixture
 async def rq(
    storage_client: StorageClient,
-    configuration: Configuration,


So the configuration fixture is now being used implicitly even if we don't specify it as an argument? (maybe it is my limited knowledge of pytest)

Actually, I forgot to delete the configuration fixture, which is not needed at all, because in global autouse fixture _isolate_test_environment we set monkeypatch.setenv('CRAWLEE_STORAGE_DIR', str(tmp_path)) and this ensures that any implicitly created configuration will point to this temp path - which is unique per test case.

vdusek · 2025-09-11T14:06:10Z

tests/unit/storages/test_request_queue.py

-    if request.param == 'memory':
-        return MemoryStorageClient()
-
-    return FileSystemStorageClient()
+    storage_client: StorageClient
+    storage_client = MemoryStorageClient() if request.param == 'memory' else FileSystemStorageClient()
+    service_locator.set_storage_client(storage_client)
+    return storage_client


Don't do this as we will have the sql client

This is kind of an edit by the linter, not really important. Once SQL client is merged, we can just expand the argument and add one more elif.

… fixture

Pijukatel · 2025-09-12T09:30:31Z

Can you please expand the PR description with an explanation of how this updated logic works with the Apify SDK? Namely I'm interested in the way it overrides the global storage client. For instance,

Judging from apify/apify-sdk-python#576, it should be fine. But we should make sure that this is covered by tests.

The SDK PR includes the tests that cover the interaction of BasicCrawler and Actor with respect to init and services, the description is also included in the linked PR.

* If I reconfigure the storage client in a crawler constructor, will that one be preserved after `Actor.init`?

No, configuring the crawler with a custom storage_client will not set it in the global servcice_locator. This side effect was removed.

* If I reconfigure the storage client in the global service locator, will `Actor.init` keep it that way? Will it not crash?

It will work and Actor.init will take it from global service locator, but if you set (explicitly or implicitly) storage client in global service locator and then you try to use different storage client in Actor it will raise ServiceConflictError. See tests from linked PR:

async def test_existing_apify_config_respected_by_actor() -> None:
    """Set Apify Configuration in service_locator and verify that Actor respects it."""
    max_used_cpu_ratio = 0.123456  # Some unique value to verify configuration
    apify_config = ApifyConfiguration(max_used_cpu_ratio=max_used_cpu_ratio)
    service_locator.set_configuration(apify_config)
    async with Actor:
        pass

    returned_config = service_locator.get_configuration()
    assert returned_config is apify_config


async def test_existing_apify_config_throws_error_when_set_in_actor() -> None:
    """Test that passing explicit configuration to actor after service locator configuration was already set,
    raises exception."""
    service_locator.set_configuration(ApifyConfiguration())
    with pytest.raises(ServiceConflictError):
        async with Actor(configuration=ApifyConfiguration()):
            pass

janbuchar · 2025-09-12T10:02:26Z

* If I reconfigure the storage client in a crawler constructor, will that one be preserved after `Actor.init`?
No, configuring the crawler with a custom storage_client will not set it in the global servcice_locator. This side effect was removed.

That means that this:

crawler = BasicCrawler(storage_client=MemoryStorageClient())
# ...
await crawler.run()
dataset = await Dataset.open()
await dataset.export_data()`

will behave differently after this PR, correct? While I agree that it is better to be explicit about this, I'm pretty sure that it will surprise someone.

Pijukatel · 2025-09-12T11:26:43Z

That means that this:
crawler = BasicCrawler(storage_client=MemoryStorageClient())
# ...
await crawler.run()
dataset = await Dataset.open()
await dataset.export_data()`
will behave differently after this PR, correct? While I agree that it is better to be explicit about this, I'm pretty sure that it will surprise someone.

Yes, it can surprise people who are used to the old behavior. But I think this makes more sense, especially since we have public methods for getting storages on BasicCrawler.
Example for dataset (Same situation for RQ and KVS):

# Depends on the global service_locator
default_dataset = await Dataset.open()
# Depends on the crawler service_locator (it can be the same as the previous, but it can be custom)
crawler_default_dataset = await crawler.get_dataset()

Alternativelly you can set the storage_client globally and not pass it to the crawler service_locator.set_storage_client(MemoryStorageClient())

janbuchar · 2025-09-12T13:43:46Z

That means that this:
crawler = BasicCrawler(storage_client=MemoryStorageClient())
# ...
await crawler.run()
dataset = await Dataset.open()
await dataset.export_data()
will behave differently after this PR, correct? While I agree that it is better to be explicit about this, I'm pretty sure that it will surprise someone.
Yes, it can surprise people who are used to the old behavior. But I think this makes more sense, especially since we have public methods for getting storages on BasicCrawler. Example for dataset (Same situation for RQ and KVS):

Can we come up with some warning if this happens?

Another concern, if a crawler uses a slightly modified instance of FilesystemStorageClient (e.g., different path), Actor.init won't replace that. While it's true that it's more predictable, I believe it might worsen the DX... Also that way the user will have to use the specialized ApifyFilesystemStorageClient explicitly if they want to handle input.json as expected.

Pijukatel · 2025-09-12T15:40:07Z

...
Can we come up with some warning if this happens?

Adding warning and test to SDK PR apify/apify-sdk-python@5bf51f7

Pijukatel added 11 commits August 28, 2025 17:04

WIP

a3aa0e3

Draft for discussion.

be54e46

Rework of service_locator implicit setting of services, storages and storage creation.

Remove temp edits for e2e tests

9c6a521

Simplify types

86c4244

Update configuration according to Pydantic docs recommendation

dd5f914

Properly create custom storage clients when p[assing custom configura…

df06b82

…tion

Update the create methods

430f2ad

Add global instanc manager

1340c5c

Try to keep config in open, but rework the instance cache

a1746f3

Properly set global_storage_instance_manager

e86356d

Avoid coroutine ... was never awaited

07ffa19

github-actions bot assigned Pijukatel Sep 2, 2025

github-actions bot added this to the 122nd sprint - Tooling team milestone Sep 2, 2025

github-actions bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Sep 2, 2025

Pijukatel mentioned this pull request Sep 3, 2025

StorageInstanceManager should support caching based on used configuration #1379

Open

Pijukatel changed the title ~~feat: Rework storage creation and caching, configuration and services~~ feat!: Rework storage creation and caching, configuration and services Sep 10, 2025

Pijukatel changed the title ~~feat!: Rework storage creation and caching, configuration and services~~ refactor!: Refactor storage creation and caching, configuration and services Sep 10, 2025

github-actions bot modified the milestones: 122nd sprint - Tooling team, 123rd sprint - Tooling team Sep 10, 2025

Pijukatel marked this pull request as ready for review September 10, 2025 15:01

Pijukatel requested review from vdusek, janbuchar and Mantisus September 10, 2025 15:01

vdusek reviewed Sep 11, 2025

View reviewed changes

Pijukatel added 4 commits September 12, 2025 10:37

Revert incorrect delete, add test that would catch it, remove useless…

6026b3a

… fixture

Revert unintentional edit in base class

330b598

Parametrize storage_instance_manager tests by storage type

8143277

Extract shared fixture and remove unused fixtures

ed4bca7

Pijukatel mentioned this pull request Sep 12, 2025

refactor!: Make Actor initialization stricter and more predictable apify/apify-sdk-python#576

Open

Update upgrading guide

b06054b

Pijukatel requested a review from vdusek September 12, 2025 12:00

vdusek mentioned this pull request Sep 12, 2025

feat: Add support for NDU storages #1401

Merged

1 task

		storage_client=self._service_locator.get_storage_client(),
		configuration=self._service_locator.get_configuration(),

	def get_additional_cache_key(self, configuration: Configuration) -> Hashable: # noqa: ARG002
	def get_additional_cache_key(self, _: Configuration) -> Hashable:

refactor!: Refactor storage creation and caching, configuration and services #1386

Are you sure you want to change the base?

refactor!: Refactor storage creation and caching, configuration and services #1386

Uh oh!

Conversation

Pijukatel commented Sep 2, 2025 • edited by vdusek Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Top-level changes:

Implementation notes:

Issues

Testing

Uh oh!

janbuchar commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Pijukatel Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Pijukatel Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Pijukatel commented Sep 12, 2025

Uh oh!

janbuchar commented Sep 12, 2025

Uh oh!

Pijukatel commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

janbuchar commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Pijukatel commented Sep 12, 2025

Uh oh!

Uh oh!

Pijukatel commented Sep 2, 2025 •

edited by vdusek

Loading

janbuchar commented Sep 11, 2025 •

edited

Loading

Pijukatel Sep 12, 2025 •

edited

Loading

Pijukatel Sep 12, 2025 •

edited

Loading

Pijukatel commented Sep 12, 2025 •

edited

Loading

janbuchar commented Sep 12, 2025 •

edited

Loading