Skip to content

feat: Add deduplication to add_batch_of_requests #534

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 60 commits into from
Aug 19, 2025
Merged

Conversation

Pijukatel
Copy link
Contributor

@Pijukatel Pijukatel commented Aug 7, 2025

Description

  • Ensure that already known requests are excluded from api_client.batch_add_requests calls to avoid expensive and pointless API calls.
  • Add all new requests to the cache when calling batch_add_requests.
  • Add test with real API usage measurement.

Issues

Testing

  • Added new integration tests to verify reduced API usage.
  • Comparing benchmark actor based on master vs this PR. Actor is a simple ParselCrawler that crawls the whole crawlee.dev, which contains many duplicate links, as the documentation is cross-linked thoroughly. Results:
    • Massive reduction of cost from request queue.
    • Significant overall speed up due to reduced API calls.
image

@Pijukatel Pijukatel changed the title Add deduplication to add_batch_of_requests and test feat: Add deduplication to add_batch_of_requests and test Aug 7, 2025
@Pijukatel Pijukatel added enhancement New feature or request. t-tooling Issues with this label are in the ownership of the tooling team. labels Aug 7, 2025
@Pijukatel Pijukatel changed the title feat: Add deduplication to add_batch_of_requests and test feat: Add deduplication to add_batch_of_requests Aug 7, 2025
Base automatically changed from new-apify-storage-clients to master August 12, 2025 16:45
@github-actions github-actions bot added this to the 121st sprint - Tooling team milestone Aug 13, 2025
@github-actions github-actions bot added the tested Temporary label used only programatically for some analytics. label Aug 13, 2025
@Pijukatel Pijukatel requested review from vdusek and Mantisus August 13, 2025 12:48
@Pijukatel Pijukatel marked this pull request as ready for review August 13, 2025 12:48
Comment on lines 145 to 146
await rq.add_request(request)
await rq.add_request(request)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you could make two distinct Request instances with the same uniqueKey here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good point. I also added one more test to make it explicit that deduplication works based on unique_key only and unless we use use_extended_unique_key argument, some attributes of the request might be ignored. Another test makes this behavior clearly intentional to avoid some confusion in the future.

await rq.add_requests(requests)

add_requests_workers = [asyncio.create_task(add_requests_worker()) for _ in range(10)]
await asyncio.gather(*add_requests_workers)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you made sure that these do in fact run in parallel? To the naked eye, 100 requests doesn't seem like much, I'd expect that the event loop may run the tasks in sequence.

Maybe you could add the requests in each worker in smaller batches and add some random delays? Or just add a comment saying that you verified parallel execution empirically 😁

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote the test for the implementation that did not take parallel execution into account, and it was failing consistently. So from that perspective, I consider the test sufficient.

Anyway, I added some chunking to make the test slightly more challenging. The parallel execution can be verified in the logs. For example, below. From the logs it can be seen that the add_batch_of_requests that was started first did not finish first - as it was "taken over" during it's await by another worker.

DEBUG Tried to add new requests: 10, succeeded to add new requests: 10, skipped already present requests: 10
DEBUG Tried to add new requests: 10, succeeded to add new requests: 10, skipped already present requests: 20
DEBUG Tried to add new requests: 10, succeeded to add new requests: 10, skipped already present requests: 90
DEBUG Tried to add new requests: 10, succeeded to add new requests: 10, skipped already present requests: 80
DEBUG Tried to add new requests: 10, succeeded to add new requests: 10, skipped already present requests: 0
DEBUG Tried to add new requests: 10, succeeded to add new requests: 10, skipped already present requests: 40
DEBUG Tried to add new requests: 10, succeeded to add new requests: 10, skipped already present requests: 50
DEBUG Tried to add new requests: 10, succeeded to add new requests: 10, skipped already present requests: 60
DEBUG Tried to add new requests: 10, succeeded to add new requests: 10, skipped already present requests: 30
DEBUG Tried to add new requests: 10, succeeded to add new requests: 10, skipped already present requests: 70
INFO  {'readCount': 0, 'writeCount': 100, 'deleteCount': 0, 'headItemReadCount': 0, 'storageBytes': 7400}

Comment on lines +235 to +243
with mock.patch(
'apify_client.clients.resource_clients.request_queue.RequestQueueClientAsync.batch_add_requests',
side_effect=return_unprocessed_requests,
):
# Simulate failed API call for adding requests. Request was not processed and should not be cached.
await apify_named_rq.add_requests(['http://example.com/1'])

# This will succeed.
await apify_named_rq.add_requests(['http://example.com/1'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance we could verify that the request was actually not cached between the two add_requests calls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is checked implicitly in the last line where it is asserted that there was exactly 1 writeCount difference. The first call is "hardcoded" to fail, even on all retries, so it never even sends the API request and thus has no chance of increasing the writeCount.

The second call can make the write only if it is not cached, as cached requests do not make the call (tested in other tests). So this means the request was not cached in between.

I could assert the state of the cache in between those calls, but since it is kind of an implementation detail, I would prefer not to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, can you explain this in a comment then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, added to the test description.


for request in requests:
if self._requests_cache.get(request.id):
# We are no sure if it was already handled at this point, and it is not worth calling API for it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean "We are now sure that it was already handled..."? I'm not sure 😁

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was not very clear. Updated

]

# Send requests to API.
response = await self._api_client.batch_add_requests(requests=requests_dict, forefront=forefront)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably out of the scope of the PR, but it might be worth it to validate the response with a Pydantic model.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That happens in the original code already few lines down: api_response = AddRequestsResponse.model_validate(response)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, I meant validating the whole response object with the two lists, so that you wouldn't need to do response['unprocessedRequests']

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, added.

@Pijukatel Pijukatel requested a review from janbuchar August 15, 2025 09:43
already_present_requests: list[ProcessedRequest] = []

for request in requests:
if self._requests_cache.get(request.id):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Judging by apify/crawlee#3120, a day may come when we try to limit the size of _requests_cache somehow. Perhaps we should think ahead and come up with a more space-efficient way of tracking already added requests?

EDIT: hollup a minute, do you use the ID here for deduplication instead of unique key?

Copy link
Contributor Author

@Pijukatel Pijukatel Aug 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there is this deterministic transformation function unique_key_to_request_id, which respects Apify platform way of creating IDs, this seems ok. If someone starts creating Requests with a custom id, then deduplication will most likely stop working.

There are two issues I created based on the discussion about this:

@Pijukatel Pijukatel requested a review from janbuchar August 15, 2025 12:39
Copy link
Collaborator

@Mantisus Mantisus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job! LGTM (let's wait for Honza's approval as well)

@Pijukatel Pijukatel merged commit dd03c4d into master Aug 19, 2025
23 checks passed
@Pijukatel Pijukatel deleted the add-deduplication branch August 19, 2025 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request. t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ensure that duplicate links are handled in a cost effective way when using Apify RequestQueue
4 participants