-
Notifications
You must be signed in to change notification settings - Fork 511
Description
I am using version 1.0.4 with SQL storage in Postgres. My queue has around 2 millions rows (only approx 10% have been handled yet) and I am using 10 crawler instances. At some point, all my instances stopped crawling. I found out it is because the query to fetch new requests is timing out (more than 1 minute). The query is:
SELECT *
FROM request_queue_records
where request_queue_id = [queue id]
and is_handled = false
AND (time_blocked_until IS NULL OR time_blocked_until < now())
ORDER BY sequence_number ASC
LIMIT 10
FOR UPDATE SKIP LOCKED;
It seems that postgres isn't using the index (run with EXPLAIN (ANALYZE, BUFFERS, VERBOSE) to see) and it does a full table scan. I think it is because of the filtering done on "time_blocked_until IS NULL" but I am not sure. I manually added my own index CREATE INDEX sorting_by_sequence_number ON public.request_queue_records USING btree (sequence_number) and now the query runs through the index (no table scan) and takes only a few milliseconds. Could you add this index (or fix the idx_fetch_available one)?
Also, as a side issue, when this query was timing out, crawlee didn't fire any warnings nor errors, just kept scheduling more tasks that eventually timed out but when looking to the logs, even in debug mode, nothing helped find the issue (I had to fork the project and add logs to see the root problem) I just saw no new pages were being crawled.
Thanks a lot!