Skip to content

FileSystemStorageClient performance issues #1382

@Mantisus

Description

@Mantisus

FileSystemStorageClient shows a significant performance drop compared to version 0.6.12.

Test code

import asyncio

from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
from crawlee import ConcurrencySettings
from crawlee.storage_clients import FileSystemStorageClient


async def main() -> None:
    storage_client = FileSystemStorageClient()
    crawler = ParselCrawler(
        storage_client=storage_client,
        concurrency_settings=ConcurrencySettings(desired_concurrency=20),
    )

    @crawler.router.default_handler
    async def request_handler(context: ParselCrawlingContext) -> None:
        data = {
            'url': context.request.url,
            'title': context.selector.css('title::text').get(),
        }
        await context.push_data(data)
        await context.enqueue_links(strategy='same-domain')

    await crawler.run(['http://crawlee.dev/'])


if __name__ == '__main__':
    asyncio.run(main())

Do not set max_requests_per_crawl, as performance decreases as the number of processed links increases.

Results:

[crawlee._autoscaling.autoscaled_pool] INFO  Waiting for remaining tasks to finish
[ParselCrawler] INFO  Error analysis: total_errors=1 unique_errors=1
[ParselCrawler] INFO  Final request statistics:
┌───────────────────────────────┬──────────────────┐
│ requests_finished             │ 4512             │
│ requests_failed               │ 0                │
│ retry_histogram               │ [4511, 1]        │
│ request_avg_failed_duration   │ None             │
│ request_avg_finished_duration │ 2min 9.7s        │
│ requests_finished_per_minute  │ 81               │
│ requests_failed_per_minute    │ 0                │
│ request_total_duration        │ 162h 33min 11.4s │
│ requests_total                │ 4512             │
│ crawler_runtime               │ 55min 27.1s      │
└───────────────────────────────┴──────────────────┘

Test code for 0.6.12

import asyncio

from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
from crawlee import ConcurrencySettings
from crawlee.configuration import Configuration
from crawlee.storage_clients import MemoryStorageClient

async def main() -> None:
    storage_client = MemoryStorageClient.from_config(Configuration(write_metadata=True, persist_storage=True))
    crawler = ParselCrawler(
        storage_client=storage_client,
        concurrency_settings=ConcurrencySettings(desired_concurrency=20),
    )

    @crawler.router.default_handler
    async def request_handler(context: ParselCrawlingContext) -> None:
        data = {
            'url': context.request.url,
            'title': context.selector.css('title::text').get(),
        }
        await context.push_data(data)
        await context.enqueue_links(strategy='same-domain')

    await crawler.run(['http://crawlee.dev/'])


if __name__ == '__main__':
    asyncio.run(main())

Results:

[crawlee._autoscaling.autoscaled_pool] INFO  Waiting for remaining tasks to finish
[ParselCrawler] INFO  Final request statistics:
┌───────────────────────────────┬──────────────┐
│ requests_finished             │ 4512         │
│ requests_failed               │ 0            │
│ retry_histogram               │ [4512]       │
│ request_avg_failed_duration   │ None         │
│ request_avg_finished_duration │ 3.910247     │
│ requests_finished_per_minute  │ 874          │
│ requests_failed_per_minute    │ 0            │
│ request_total_duration        │ 17643.033593 │
│ requests_total                │ 4512         │
│ crawler_runtime               │ 309.715327   │
└───────────────────────────────┴──────────────┘

Metadata

Metadata

Assignees

Labels

t-toolingIssues with this label are in the ownership of the tooling team.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions