Skip to content

Conversation

Mantisus
Copy link
Collaborator

@Mantisus Mantisus commented Sep 3, 2025

Description

  • Updates the deduplication logic used in add_batch_of_requests to improve overall Queues performance.

Issues

@Mantisus Mantisus requested review from Pijukatel and vdusek September 3, 2025 04:27
@Mantisus Mantisus self-assigned this Sep 3, 2025
@Mantisus
Copy link
Collaborator Author

Mantisus commented Sep 3, 2025

MemoryStorageClient before

┌───────────────────────────────┬─────────────┐
│ requests_finished             │ 2363        │
│ requests_failed               │ 0           │
│ retry_histogram               │ [2363]      │
│ request_avg_failed_duration   │ None        │
│ request_avg_finished_duration │ 505.1ms     │
│ requests_finished_per_minute  │ 2248        │
│ requests_failed_per_minute    │ 0           │
│ request_total_duration        │ 19min 53.4s │
│ requests_total                │ 2363        │
│ crawler_runtime               │ 1min 3.1s   │
└───────────────────────────────┴─────────────┘

After

┌───────────────────────────────┬───────────┐
│ requests_finished             │ 2363      │
│ requests_failed               │ 0         │
│ retry_histogram               │ [2363]    │
│ request_avg_failed_duration   │ None      │
│ request_avg_finished_duration │ 206.6ms   │
│ requests_finished_per_minute  │ 4697      │
│ requests_failed_per_minute    │ 0         │
│ request_total_duration        │ 8min 8.1s │
│ requests_total                │ 2363      │
│ crawler_runtime               │ 30.19s    │
└───────────────────────────────┴───────────┘

FileSystemStorageClient before

┌───────────────────────────────┬──────────────────┐
│ requests_finished             │ 4512             │
│ requests_failed               │ 0                │
│ retry_histogram               │ [4511, 1]        │
│ request_avg_failed_duration   │ None             │
│ request_avg_finished_duration │ 2min 9.7s        │
│ requests_finished_per_minute  │ 81               │
│ requests_failed_per_minute    │ 0                │
│ request_total_duration        │ 162h 33min 11.4s │
│ requests_total                │ 4512             │
│ crawler_runtime               │ 55min 27.1s      │
└───────────────────────────────┴──────────────────┘

After

┌───────────────────────────────┬─────────────┐
│ requests_finished             │ 4512        │
│ requests_failed               │ 0           │
│ retry_histogram               │ [4512]      │
│ request_avg_failed_duration   │ None        │
│ request_avg_finished_duration │ 463.6ms     │
│ requests_finished_per_minute  │ 2633        │
│ requests_failed_per_minute    │ 0           │
│ request_total_duration        │ 34min 51.5s │
│ requests_total                │ 4512        │
│ crawler_runtime               │ 1min 42.8s  │

@vdusek vdusek requested review from janbuchar and removed request for Pijukatel September 3, 2025 08:55
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code (the same we ran on the platform):

import asyncio

from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
from crawlee.http_clients import HttpxHttpClient
from crawlee.storage_clients import FileSystemStorageClient


async def main() -> None:
    storage_client = FileSystemStorageClient()
    http_client = HttpxHttpClient()

    crawler = ParselCrawler(
        storage_client=storage_client,
        http_client=http_client,
    )

    @crawler.router.default_handler
    async def request_handler(context: ParselCrawlingContext) -> None:
        context.log.info(f'Processing URL: {context.request.url}...')
        data = {
            'url': context.request.url,
            'title': context.selector.css('title::text').get(),
        }
        await context.push_data(data)
        await context.enqueue_links()

    await crawler.run(['https://crawlee.dev'])


if __name__ == '__main__':
    asyncio.run(main())

And here are the statistics:

┌───────────────────────────────┬────────────┐
│ requests_finished             │ 2363       │
│ requests_failed               │ 0          │
│ retry_histogram               │ [2363]     │
│ request_avg_failed_duration   │ None       │
│ request_avg_finished_duration │ 214.1ms    │
│ requests_finished_per_minute  │ 1598       │
│ requests_failed_per_minute    │ 0          │
│ request_total_duration        │ 8min 26.0s │
│ requests_total                │ 2363       │
│ crawler_runtime               │ 1min 28.7s │
└───────────────────────────────┴────────────┘

So the results are similar to what we observed on the platform - good job.

Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the memory:

import asyncio

from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
from crawlee.http_clients import HttpxHttpClient
from crawlee.storage_clients import MemoryStorageClient


async def main() -> None:
    storage_client = MemoryStorageClient()
    http_client = HttpxHttpClient()

    crawler = ParselCrawler(
        storage_client=storage_client,
        http_client=http_client,
    )

    @crawler.router.default_handler
    async def request_handler(context: ParselCrawlingContext) -> None:
        context.log.info(f'Processing URL: {context.request.url}...')
        data = {
            'url': context.request.url,
            'title': context.selector.css('title::text').get(),
        }
        await context.push_data(data)
        await context.enqueue_links()

    await crawler.run(['https://crawlee.dev'])


if __name__ == '__main__':
    asyncio.run(main())

Logs:

┌───────────────────────────────┬────────────┐
│ requests_finished             │ 2363       │
│ requests_failed               │ 0          │
│ retry_histogram               │ [2363]     │
│ request_avg_failed_duration   │ None       │
│ request_avg_finished_duration │ 168.4ms    │
│ requests_finished_per_minute  │ 1611       │
│ requests_failed_per_minute    │ 0          │
│ request_total_duration        │ 6min 38.0s │
│ requests_total                │ 2363       │
│ crawler_runtime               │ 1min 28.0s │
└───────────────────────────────┴────────────┘

It's much better, of course, but how did you manage to run it in around 30 secs?

@Mantisus
Copy link
Collaborator Author

Mantisus commented Sep 3, 2025

It's much better, of course, but how did you manage to run it in around 30 secs?

I run with an initial concurrency of 20

import asyncio

from crawlee import ConcurrencySettings
from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
from crawlee.http_clients import HttpxHttpClient
from crawlee.storage_clients import MemoryStorageClient


async def main() -> None:
    storage_client = MemoryStorageClient()
    http_client = HttpxHttpClient()

    crawler = ParselCrawler(
        storage_client=storage_client,
        http_client=http_client,
        concurrency_settings=ConcurrencySettings(desired_concurrency=20),
    )

    @crawler.router.default_handler
    async def request_handler(context: ParselCrawlingContext) -> None:
        context.log.info(f'Processing URL: {context.request.url}...')
        data = {
            'url': context.request.url,
            'title': context.selector.css('title::text').get(),
        }
        await context.push_data(data)
        await context.enqueue_links()

    await crawler.run(['https://crawlee.dev'])


if __name__ == '__main__':
    asyncio.run(main())
[ParselCrawler] INFO  Final request statistics:
┌───────────────────────────────┬────────────┐
│ requests_finished             │ 2363       │
│ requests_failed               │ 0          │
│ retry_histogram               │ [2363]     │
│ request_avg_failed_duration   │ None       │
│ request_avg_finished_duration │ 220.0ms    │
│ requests_finished_per_minute  │ 4497       │
│ requests_failed_per_minute    │ 0          │
│ request_total_duration        │ 8min 39.9s │
│ requests_total                │ 2363       │
│ crawler_runtime               │ 31.53s     │
└───────────────────────────────┴────────────┘

Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small note, plus the add_batch_of_requests method is quite long and deeply nested. It would be good to break it into smaller, private helper functions later. Also, it seems, that the current performance bottleneck comes from the current default settings - we might want to investigate whether more suitable defaults could be chosen (for today's average machine?). Otherwise, good job!

Copy link
Collaborator

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, fix the type errors and you're good to go

@vdusek vdusek merged commit 7b2a44a into apify:master Sep 3, 2025
35 of 36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FileSystemStorageClient performance issues
3 participants