-
Notifications
You must be signed in to change notification settings - Fork 436
Closed
Labels
t-toolingIssues with this label are in the ownership of the tooling team.Issues with this label are in the ownership of the tooling team.
Description
FileSystemStorageClient
shows a significant performance drop compared to version 0.6.12.
Test code
import asyncio
from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
from crawlee import ConcurrencySettings
from crawlee.storage_clients import FileSystemStorageClient
async def main() -> None:
storage_client = FileSystemStorageClient()
crawler = ParselCrawler(
storage_client=storage_client,
concurrency_settings=ConcurrencySettings(desired_concurrency=20),
)
@crawler.router.default_handler
async def request_handler(context: ParselCrawlingContext) -> None:
data = {
'url': context.request.url,
'title': context.selector.css('title::text').get(),
}
await context.push_data(data)
await context.enqueue_links(strategy='same-domain')
await crawler.run(['http://crawlee.dev/'])
if __name__ == '__main__':
asyncio.run(main())
Do not set max_requests_per_crawl
, as performance decreases as the number of processed links increases.
Results:
[crawlee._autoscaling.autoscaled_pool] INFO Waiting for remaining tasks to finish
[ParselCrawler] INFO Error analysis: total_errors=1 unique_errors=1
[ParselCrawler] INFO Final request statistics:
┌───────────────────────────────┬──────────────────┐
│ requests_finished │ 4512 │
│ requests_failed │ 0 │
│ retry_histogram │ [4511, 1] │
│ request_avg_failed_duration │ None │
│ request_avg_finished_duration │ 2min 9.7s │
│ requests_finished_per_minute │ 81 │
│ requests_failed_per_minute │ 0 │
│ request_total_duration │ 162h 33min 11.4s │
│ requests_total │ 4512 │
│ crawler_runtime │ 55min 27.1s │
└───────────────────────────────┴──────────────────┘
Test code for 0.6.12
import asyncio
from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
from crawlee import ConcurrencySettings
from crawlee.configuration import Configuration
from crawlee.storage_clients import MemoryStorageClient
async def main() -> None:
storage_client = MemoryStorageClient.from_config(Configuration(write_metadata=True, persist_storage=True))
crawler = ParselCrawler(
storage_client=storage_client,
concurrency_settings=ConcurrencySettings(desired_concurrency=20),
)
@crawler.router.default_handler
async def request_handler(context: ParselCrawlingContext) -> None:
data = {
'url': context.request.url,
'title': context.selector.css('title::text').get(),
}
await context.push_data(data)
await context.enqueue_links(strategy='same-domain')
await crawler.run(['http://crawlee.dev/'])
if __name__ == '__main__':
asyncio.run(main())
Results:
[crawlee._autoscaling.autoscaled_pool] INFO Waiting for remaining tasks to finish
[ParselCrawler] INFO Final request statistics:
┌───────────────────────────────┬──────────────┐
│ requests_finished │ 4512 │
│ requests_failed │ 0 │
│ retry_histogram │ [4512] │
│ request_avg_failed_duration │ None │
│ request_avg_finished_duration │ 3.910247 │
│ requests_finished_per_minute │ 874 │
│ requests_failed_per_minute │ 0 │
│ request_total_duration │ 17643.033593 │
│ requests_total │ 4512 │
│ crawler_runtime │ 309.715327 │
└───────────────────────────────┴──────────────┘
Metadata
Metadata
Assignees
Labels
t-toolingIssues with this label are in the ownership of the tooling team.Issues with this label are in the ownership of the tooling team.