Replies: 4 comments
-
Very interesting. I initially thought it was a basic poll X items for every cron run but it is mure more subtle and is a fetch during a period T, so you preserve the low latency and long polling. The obvious advantage of your approach (appart from controlling spawing) is the reload of newly deployed code. You can control the forward/backward compatibility window for a T period, but it is still there technically as you continue polling even if there is new code deployed. BUT Stopping polling on new code deploy completely eliminate the need for forward compatibility = (no old worker will get an new payload once you deploy). So this solution is better the lower T is but does not make deploy deterministic. IMO the biggest disadvantage is that you want T as small as possible, so it will be a mess for long running task that will technically be run only once by invocation. If T = 15s and job can last 1min30 then basically the worker will stay alive then depending on implementation, either you create new worker and you have the MAX_PARALLEL issue OR no new job is spawned for 1m15s even if small jobs could be run (we have concurrency in a single worker). Basically for me it seems interesting but I think there may be a design issue if max(job_time) > T. Costwise this new approach if T gets under 15s it will start costing a few USD a month so marginally more expensive. My personal opinion is that :
Has the same benefit, is a bit more reliable (it can survive when cron is not working 100% of time and does not have negative interaction when job duration exceed cron periodicity, no forward compatibility required), and is slightly cheaper if control of paralleism is built-in in cron polling (the same otherwise). That is a my 2 cents. The proposed approach is a good one. |
Beta Was this translation helpful? Give feedback.
-
hey @nel! thanks for the thoughtful feedback - you actually hit on something that made me rethink this whole thing... your point about jobs > T is spot on. if a job takes 60s and we poll every 15s, that's definitely a problem with the simple approach i outlined. but here's where it gets interesting - what if the worker uses its time budget more intelligently? the breakthrough: time-budgeted pollingwhat if each HTTP request sets new "time budget" for worker, so it knows for how long it can continue to poll? here's a concrete example with
the key insight: polling and job processing are decoupled. jobs run in the background while the polling loop continues. so even if a job takes 60s, the next HTTP request (at t=30s) just adds more polling iterations to keep finding work. why this changes everything
bonus: it unifies all approaches
the HTTP request just becomes a way to say "you are allowed to poll for next N seconds" rather than "spawn a new worker". how the budget workseach HTTP request resets the budget completely:
addressing your other points
yep, this approach makes it trivial - each HTTP request is handled once, no runaway spawning
still useful! but now deploys have natural boundaries at each HTTP request
cron is guaranteed to hit a newly deployed edge function and first request will start it what do you think? feels like this might give us the best of both worlds - the control of pg_cron with the efficiency of continuous polling, and no issues with long-running jobs. and for the deployment detection i think the best approach would be to be explicit about the new deployments and just wrap "functions deploy" with some pgflow cli command that will deprecate the existing workers by setting them some "deprecated_at" flag in pgflow.workers, so they can check it with each polling iteration and just stop polling if detect the deprecation |
Beta Was this translation helpful? Give feedback.
-
Hmm I am really sorry but I must admit I don't understand the proposal. When new HTTP request arrive there are 2 cases :
From your table table it seems you consider it's only 2) As far as I can tell we don't really have control over that (or maybe we do and it's just that currently we are no longer listening and that is why it is spawing worker uncontrollably, I don't know the supabase/EdgeRuntime logic to decide to spawn a new process) I am also not completely sure if you system is supporting multiple parallel worker for the same queue (I consider it a given but maybe I am over extending), if you do there is a third case depending on which process gets the request. |
Beta Was this translation helpful? Give feedback.
-
I tried to kill few birds with one stone (use http requests as kind of a dead man's snitch) but i think it is too complicated for this stage. The solution you are using is probably probably enough for 90% of cases. I need to rethink it, get some perspective and decide what to implement. It would be good to have a configuration table that explains the workers that we want to keep up and how many instances are acceptable (your's threshold), which overlaps with the idea for the cron worker. Will get back to it soon! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Current State & Problem
Edge-worker uses self-respawning: a single HTTP request starts a worker that polls continuously until receiving
beforeunload
, then attempts to restart itself. While reliable, it's not 100% bulletproof - workers may fail to respawn if they hit the execution time limit before sending the restart request.Recommended workaround: Add a pg_cron "safety net" that periodically checks and starts workers if needed.
The problem with workaround: These safety nets can cause Supabase's Edge Runtime to spawn up to MAX_PARALLEL workers simultaneously due to how it scales based on response latency (details), leading to:
Two Solutions
1. Smart pg_cron Safety Net
Keep current architecture but add worker count check before spawning (as implemented by Renaud):
Could also be implemented at EdgeWorker.start level (though "just die" isn't straightforward - would need to skip polling and return immediately, still consuming some startup resources).
2. pg_cron-Based Polling
Key change: Workers no longer poll indefinitely until
beforeunload
. Instead, each cron invocation processes messages for a defined time period, stopping just before the next cron call is expected.current_time + expected_processing_time > next_cron_time
read_with_poll
timeout on the last call before next HTTP request (shorter poll time to prevent overlap)Built-in overlap prevention: The pg_cron approach would include a worker configuration table that defines desired state (max workers, intervals, etc.). The cron job checks both this config AND the current worker count in
pgflow.workers
before sending HTTP requests, preventing overlap by design rather than relying on manual implementation.EdgeRuntime.waitUntil()
during processingComparison
Note
If you're already using pg_cron as a safety net (recommended), you've already handled vault setup and cron costs. The pg_cron approach just uses them more effectively while adding benefits.
Implementation & Discussion
Both approaches will be supported - self-respawning as default, pg_cron as opt-in alternative.
Question for Discussion:
How should
beforeunload
be handled in the pg_cron model? Should it:Beta Was this translation helpful? Give feedback.
All reactions