-
Notifications
You must be signed in to change notification settings - Fork 435
chore: Update add_batch_of_requests
for MemoryRequestQueueClient
and FileSystemRequestQueueClient
#1388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
MemoryStorageClient before ┌───────────────────────────────┬─────────────┐
│ requests_finished │ 2363 │
│ requests_failed │ 0 │
│ retry_histogram │ [2363] │
│ request_avg_failed_duration │ None │
│ request_avg_finished_duration │ 505.1ms │
│ requests_finished_per_minute │ 2248 │
│ requests_failed_per_minute │ 0 │
│ request_total_duration │ 19min 53.4s │
│ requests_total │ 2363 │
│ crawler_runtime │ 1min 3.1s │
└───────────────────────────────┴─────────────┘ After ┌───────────────────────────────┬───────────┐
│ requests_finished │ 2363 │
│ requests_failed │ 0 │
│ retry_histogram │ [2363] │
│ request_avg_failed_duration │ None │
│ request_avg_finished_duration │ 206.6ms │
│ requests_finished_per_minute │ 4697 │
│ requests_failed_per_minute │ 0 │
│ request_total_duration │ 8min 8.1s │
│ requests_total │ 2363 │
│ crawler_runtime │ 30.19s │
└───────────────────────────────┴───────────┘ FileSystemStorageClient before ┌───────────────────────────────┬──────────────────┐
│ requests_finished │ 4512 │
│ requests_failed │ 0 │
│ retry_histogram │ [4511, 1] │
│ request_avg_failed_duration │ None │
│ request_avg_finished_duration │ 2min 9.7s │
│ requests_finished_per_minute │ 81 │
│ requests_failed_per_minute │ 0 │
│ request_total_duration │ 162h 33min 11.4s │
│ requests_total │ 4512 │
│ crawler_runtime │ 55min 27.1s │
└───────────────────────────────┴──────────────────┘ After ┌───────────────────────────────┬─────────────┐
│ requests_finished │ 4512 │
│ requests_failed │ 0 │
│ retry_histogram │ [4512] │
│ request_avg_failed_duration │ None │
│ request_avg_finished_duration │ 463.6ms │
│ requests_finished_per_minute │ 2633 │
│ requests_failed_per_minute │ 0 │
│ request_total_duration │ 34min 51.5s │
│ requests_total │ 4512 │
│ crawler_runtime │ 1min 42.8s │ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code (the same we ran on the platform):
import asyncio
from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
from crawlee.http_clients import HttpxHttpClient
from crawlee.storage_clients import FileSystemStorageClient
async def main() -> None:
storage_client = FileSystemStorageClient()
http_client = HttpxHttpClient()
crawler = ParselCrawler(
storage_client=storage_client,
http_client=http_client,
)
@crawler.router.default_handler
async def request_handler(context: ParselCrawlingContext) -> None:
context.log.info(f'Processing URL: {context.request.url}...')
data = {
'url': context.request.url,
'title': context.selector.css('title::text').get(),
}
await context.push_data(data)
await context.enqueue_links()
await crawler.run(['https://crawlee.dev'])
if __name__ == '__main__':
asyncio.run(main())
And here are the statistics:
┌───────────────────────────────┬────────────┐
│ requests_finished │ 2363 │
│ requests_failed │ 0 │
│ retry_histogram │ [2363] │
│ request_avg_failed_duration │ None │
│ request_avg_finished_duration │ 214.1ms │
│ requests_finished_per_minute │ 1598 │
│ requests_failed_per_minute │ 0 │
│ request_total_duration │ 8min 26.0s │
│ requests_total │ 2363 │
│ crawler_runtime │ 1min 28.7s │
└───────────────────────────────┴────────────┘
So the results are similar to what we observed on the platform - good job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the memory:
import asyncio
from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
from crawlee.http_clients import HttpxHttpClient
from crawlee.storage_clients import MemoryStorageClient
async def main() -> None:
storage_client = MemoryStorageClient()
http_client = HttpxHttpClient()
crawler = ParselCrawler(
storage_client=storage_client,
http_client=http_client,
)
@crawler.router.default_handler
async def request_handler(context: ParselCrawlingContext) -> None:
context.log.info(f'Processing URL: {context.request.url}...')
data = {
'url': context.request.url,
'title': context.selector.css('title::text').get(),
}
await context.push_data(data)
await context.enqueue_links()
await crawler.run(['https://crawlee.dev'])
if __name__ == '__main__':
asyncio.run(main())
Logs:
┌───────────────────────────────┬────────────┐
│ requests_finished │ 2363 │
│ requests_failed │ 0 │
│ retry_histogram │ [2363] │
│ request_avg_failed_duration │ None │
│ request_avg_finished_duration │ 168.4ms │
│ requests_finished_per_minute │ 1611 │
│ requests_failed_per_minute │ 0 │
│ request_total_duration │ 6min 38.0s │
│ requests_total │ 2363 │
│ crawler_runtime │ 1min 28.0s │
└───────────────────────────────┴────────────┘
It's much better, of course, but how did you manage to run it in around 30 secs?
I run with an initial concurrency of 20 import asyncio
from crawlee import ConcurrencySettings
from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
from crawlee.http_clients import HttpxHttpClient
from crawlee.storage_clients import MemoryStorageClient
async def main() -> None:
storage_client = MemoryStorageClient()
http_client = HttpxHttpClient()
crawler = ParselCrawler(
storage_client=storage_client,
http_client=http_client,
concurrency_settings=ConcurrencySettings(desired_concurrency=20),
)
@crawler.router.default_handler
async def request_handler(context: ParselCrawlingContext) -> None:
context.log.info(f'Processing URL: {context.request.url}...')
data = {
'url': context.request.url,
'title': context.selector.css('title::text').get(),
}
await context.push_data(data)
await context.enqueue_links()
await crawler.run(['https://crawlee.dev'])
if __name__ == '__main__':
asyncio.run(main()) [ParselCrawler] INFO Final request statistics:
┌───────────────────────────────┬────────────┐
│ requests_finished │ 2363 │
│ requests_failed │ 0 │
│ retry_histogram │ [2363] │
│ request_avg_failed_duration │ None │
│ request_avg_finished_duration │ 220.0ms │
│ requests_finished_per_minute │ 4497 │
│ requests_failed_per_minute │ 0 │
│ request_total_duration │ 8min 39.9s │
│ requests_total │ 2363 │
│ crawler_runtime │ 31.53s │
└───────────────────────────────┴────────────┘ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small note, plus the add_batch_of_requests
method is quite long and deeply nested. It would be good to break it into smaller, private helper functions later. Also, it seems, that the current performance bottleneck comes from the current default settings - we might want to investigate whether more suitable defaults could be chosen (for today's average machine?). Otherwise, good job!
src/crawlee/storage_clients/_file_system/_request_queue_client.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Vlada Dusek <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, fix the type errors and you're good to go
Description
add_batch_of_requests
to improve overall Queues performance.Issues
FileSystemStorageClient
performance issues #1382