feat: Add deduplication to `add_batch_of_requests` #534

Pijukatel · 2025-08-07T08:18:47Z

Description

Ensure that already known requests are excluded from api_client.batch_add_requests calls to avoid expensive and pointless API calls.
Add all new requests to the cache when calling batch_add_requests.
Add test with real API usage measurement.

Issues

Closes: Ensure that duplicate links are handled in a cost effective way when using Apify RequestQueue #514

Testing

Added new integration tests to verify reduced API usage.
Comparing benchmark actor based on master vs this PR. Actor is a simple ParselCrawler that crawls the whole crawlee.dev, which contains many duplicate links, as the documentation is cross-linked thoroughly. Results:
- Massive reduction of cost from request queue.
- Significant overall speed up due to reduced API calls.

janbuchar · 2025-08-14T10:51:48Z

tests/integration/test_actor_request_queue.py

+            await rq.add_request(request)
+            await rq.add_request(request)


Maybe you could make two distinct Request instances with the same uniqueKey here?

Yes, good point. I also added one more test to make it explicit that deduplication works based on unique_key only and unless we use use_extended_unique_key argument, some attributes of the request might be ignored. Another test makes this behavior clearly intentional to avoid some confusion in the future.

janbuchar · 2025-08-14T10:59:02Z

tests/integration/test_actor_request_queue.py

+                await rq.add_requests(requests)
+
+            add_requests_workers = [asyncio.create_task(add_requests_worker()) for _ in range(10)]
+            await asyncio.gather(*add_requests_workers)


I guess you made sure that these do in fact run in parallel? To the naked eye, 100 requests doesn't seem like much, I'd expect that the event loop may run the tasks in sequence.

Maybe you could add the requests in each worker in smaller batches and add some random delays? Or just add a comment saying that you verified parallel execution empirically 😁

I wrote the test for the implementation that did not take parallel execution into account, and it was failing consistently. So from that perspective, I consider the test sufficient.

Anyway, I added some chunking to make the test slightly more challenging. The parallel execution can be verified in the logs. For example, below. From the logs it can be seen that the add_batch_of_requests that was started first did not finish first - as it was "taken over" during it's await by another worker.

DEBUG Tried to add new requests: 10, succeeded to add new requests: 10, skipped already present requests: 10 DEBUG Tried to add new requests: 10, succeeded to add new requests: 10, skipped already present requests: 20 DEBUG Tried to add new requests: 10, succeeded to add new requests: 10, skipped already present requests: 90 DEBUG Tried to add new requests: 10, succeeded to add new requests: 10, skipped already present requests: 80 DEBUG Tried to add new requests: 10, succeeded to add new requests: 10, skipped already present requests: 0 DEBUG Tried to add new requests: 10, succeeded to add new requests: 10, skipped already present requests: 40 DEBUG Tried to add new requests: 10, succeeded to add new requests: 10, skipped already present requests: 50 DEBUG Tried to add new requests: 10, succeeded to add new requests: 10, skipped already present requests: 60 DEBUG Tried to add new requests: 10, succeeded to add new requests: 10, skipped already present requests: 30 DEBUG Tried to add new requests: 10, succeeded to add new requests: 10, skipped already present requests: 70 INFO {'readCount': 0, 'writeCount': 100, 'deleteCount': 0, 'headItemReadCount': 0, 'storageBytes': 7400}

janbuchar · 2025-08-14T11:04:46Z

tests/integration/test_actor_request_queue.py

+    with mock.patch(
+        'apify_client.clients.resource_clients.request_queue.RequestQueueClientAsync.batch_add_requests',
+        side_effect=return_unprocessed_requests,
+    ):
+        # Simulate failed API call for adding requests. Request was not processed and should not be cached.
+        await apify_named_rq.add_requests(['http://example.com/1'])
+
+    # This will succeed.
+    await apify_named_rq.add_requests(['http://example.com/1'])


Any chance we could verify that the request was actually not cached between the two add_requests calls?

This is checked implicitly in the last line where it is asserted that there was exactly 1 writeCount difference. The first call is "hardcoded" to fail, even on all retries, so it never even sends the API request and thus has no chance of increasing the writeCount.

The second call can make the write only if it is not cached, as cached requests do not make the call (tested in other tests). So this means the request was not cached in between.

I could assert the state of the cache in between those calls, but since it is kind of an implementation detail, I would prefer not to.

Fair enough, can you explain this in a comment then?

Yes, added to the test description.

src/apify/storage_clients/_apify/_request_queue_client.py

janbuchar · 2025-08-14T11:10:12Z

src/apify/storage_clients/_apify/_request_queue_client.py

+
+        for request in requests:
+            if self._requests_cache.get(request.id):
+                # We are no sure if it was already handled at this point, and it is not worth calling API for it.


Did you mean "We are now sure that it was already handled..."? I'm not sure 😁

Yes, that was not very clear. Updated

janbuchar · 2025-08-14T11:15:26Z

src/apify/storage_clients/_apify/_request_queue_client.py

+            ]
+
+            # Send requests to API.
+            response = await self._api_client.batch_add_requests(requests=requests_dict, forefront=forefront)


It's probably out of the scope of the PR, but it might be worth it to validate the response with a Pydantic model.

That happens in the original code already few lines down: api_response = AddRequestsResponse.model_validate(response)

I'm sorry, I meant validating the whole response object with the two lists, so that you wouldn't need to do response['unprocessedRequests']

I see, added.

janbuchar · 2025-08-15T11:05:10Z

src/apify/storage_clients/_apify/_request_queue_client.py

+        already_present_requests: list[ProcessedRequest] = []
+
+        for request in requests:
+            if self._requests_cache.get(request.id):


Judging by apify/crawlee#3120, a day may come when we try to limit the size of _requests_cache somehow. Perhaps we should think ahead and come up with a more space-efficient way of tracking already added requests?

EDIT: hollup a minute, do you use the ID here for deduplication instead of unique key?

Since there is this deterministic transformation function unique_key_to_request_id, which respects Apify platform way of creating IDs, this seems ok. If someone starts creating Requests with a custom id, then deduplication will most likely stop working.

There are two issues I created based on the discussion about this:

Investigate caching options in ApifyRequestQueueClient #550

Remove request.id crawlee-python#1358

tests/integration/test_actor_request_queue.py

Mantisus

LGTM

vdusek

Good job! LGTM (let's wait for Honza's approval as well)

vdusek and others added 30 commits June 26, 2025 08:22

Rm old Apify storage clients

5c437c9

Add init version of new Apify storage clients

bf55338

Move specific models from Crawlee to SDK

6b2f82b

Adapt to Crawlee v1

38bef68

Adapt to Crawlee v1 (p2)

1f85430

Fix default storage IDs

a3d68a2

Fix integration test and Not implemented exception in purge

c77e8d5

Fix unit tests

8731aff

fix lint

8dfaffb

add KVS record_exists not implemented

53fad07

update to apify client 1.12 and implement record exists

5869f8e

Move default storage IDs to Configuration

82e65fc

opening storages get default id from config

8de950b

Addressing more feedback

98b76c5

Fixing integration test test_push_large_data_chunks_over_9mb

7b5ee07

Abstract open method is removed from storage clients

afcb8c7

fixing generate public url for KVS records

3bacab7

add async metadata getters

287a119

Merge branch 'master' into new-apify-storage-clients

e45d65b

better usage of apify config

51178ca

renaming

3cd7dfe

Merge branch 'master' into new-apify-storage-clients

6fe9eb3

fixes after merge commit

1547cbd

Merge branch 'master' into new-apify-storage-clients

bb47efc

Change from orphan commit to master in crawlee version

4e4fa93

Merge branch 'master' into new-apify-storage-clients

683cb31

fix encrypted secrets test

e5b2bc4

Add Apify's version of FS client that keeps the INPUT json

638756f

update metadata fixes

931b0ca

Merge branch 'master' into new-apify-storage-clients

ad7c0d8

Keep only relevant log

03dcb15

Pijukatel changed the title ~~Add deduplication to add_batch_of_requests and test~~ feat: Add deduplication to add_batch_of_requests and test Aug 7, 2025

Pijukatel added enhancement New feature or request. t-tooling Issues with this label are in the ownership of the tooling team. labels Aug 7, 2025

Pijukatel changed the title ~~feat: Add deduplication to add_batch_of_requests and test~~ feat: Add deduplication to add_batch_of_requests Aug 7, 2025

Update to handle parallel requests with same links

65b297a

Base automatically changed from new-apify-storage-clients to master August 12, 2025 16:45

Merge remote-tracking branch 'origin/master' into add-deduplication

079f890

github-actions bot assigned Pijukatel Aug 13, 2025

github-actions bot added this to the 121st sprint - Tooling team milestone Aug 13, 2025

github-actions bot added the tested Temporary label used only programatically for some analytics. label Aug 13, 2025

Handle unprocessed requests in deduplication cache correctly

2c3d0ce

Pijukatel requested review from vdusek and Mantisus August 13, 2025 12:48

Pijukatel marked this pull request as ready for review August 13, 2025 12:48

janbuchar reviewed Aug 14, 2025

View reviewed changes

Pijukatel mentioned this pull request Aug 14, 2025

feat: Implement SQLStorageClient based on sqlalchemy v2+ apify/crawlee-python#1339

Open

Pijukatel added 2 commits August 15, 2025 11:22

Adress review comments

329baed

Add deduplication test for use_extended_unique_key requests

978d49e

Pijukatel requested a review from janbuchar August 15, 2025 09:43

Do early response validation

1b92532

janbuchar reviewed Aug 15, 2025

View reviewed changes

Merge remote-tracking branch 'origin/master' into add-deduplication

cfdb1e2

Pijukatel requested a review from janbuchar August 15, 2025 12:39

Mantisus reviewed Aug 18, 2025

View reviewed changes

tests/integration/test_actor_request_queue.py Show resolved Hide resolved

Mantisus approved these changes Aug 18, 2025

View reviewed changes

vdusek approved these changes Aug 18, 2025

View reviewed changes

janbuchar approved these changes Aug 19, 2025

View reviewed changes

Pijukatel merged commit dd03c4d into master Aug 19, 2025
23 checks passed

Pijukatel deleted the add-deduplication branch August 19, 2025 11:56

feat: Add deduplication to add_batch_of_requests #534

feat: Add deduplication to add_batch_of_requests #534

Uh oh!

Conversation

Pijukatel commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issues

Testing

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Pijukatel Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Mantisus left a comment

Choose a reason for hiding this comment

Uh oh!

vdusek left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

feat: Add deduplication to `add_batch_of_requests` #534

feat: Add deduplication to `add_batch_of_requests` #534

Pijukatel commented Aug 7, 2025 •

edited

Loading

Pijukatel Aug 15, 2025 •

edited

Loading

vdusek left a comment •

edited

Loading