Worker Spawning Control: pg_cron-Based Alternative #170

jumski · 2025-07-01T13:10:44Z

jumski
Jul 1, 2025
Maintainer

Current State & Problem

Edge-worker uses self-respawning: a single HTTP request starts a worker that polls continuously until receiving beforeunload, then attempts to restart itself. While reliable, it's not 100% bulletproof - workers may fail to respawn if they hit the execution time limit before sending the restart request.

Recommended workaround: Add a pg_cron "safety net" that periodically checks and starts workers if needed.

The problem with workaround: These safety nets can cause Supabase's Edge Runtime to spawn up to MAX_PARALLEL workers simultaneously due to how it scales based on response latency (details), leading to:

Uncontrollable parallel instances
Unpredictable costs

Two Solutions

1. Smart pg_cron Safety Net

Keep current architecture but add worker count check before spawning (as implemented by Renaud):

WHERE (
  SELECT COUNT(DISTINCT worker_id) FROM pgflow.workers
  WHERE queue_name = '<QUEUE_NAME>' AND last_heartbeat_at > NOW() - INTERVAL '2 min'
) < 2;

Could also be implemented at EdgeWorker.start level (though "just die" isn't straightforward - would need to skip polling and return immediately, still consuming some startup resources).

2. pg_cron-Based Polling

Key change: Workers no longer poll indefinitely until beforeunload. Instead, each cron invocation processes messages for a defined time period, stopping just before the next cron call is expected.

Worker processes until current_time + expected_processing_time > next_cron_time
Ensures continuous processing without overlap
Naturally adapts to varying queue depths and processing speeds
Could dynamically adjust read_with_poll timeout on the last call before next HTTP request (shorter poll time to prevent overlap)

Built-in overlap prevention: The pg_cron approach would include a worker configuration table that defines desired state (max workers, intervals, etc.). The cron job checks both this config AND the current worker count in pgflow.workers before sending HTTP requests, preventing overlap by design rather than relying on manual implementation.

The worker is kept alive via EdgeRuntime.waitUntil() during processing
Each cron invocation passes configuration in the HTTP request body
Already running in production (pgflow demo)
Will be wrapped with pgflow.* functions for starting/stopping/configuring workers

Comparison

Note

If you're already using pg_cron as a safety net (recommended), you've already handled vault setup and cron costs. The pg_cron approach just uses them more effectively while adding benefits.

Aspect	Current Self-Respawning	pg_cron-Based
Resource Control	❌ Can spawn multiple workers uncontrollably	✅ Precise control over worker count
Dependencies	✅ No external dependencies	⚠️ Requires pg_cron + vault
Setup Complexity	✅ Simple HTTP request to start	⚠️ Need cron job + credentials
Runtime Config	❌ Requires redeploy	✅ Adjust via cron payload (no redeploy)
Stop/Start	⚠️ Manual/wait for timeout	✅ Unschedule cron
Auto-restart on Deploy	❌ Manual start needed	✅ Cron handles it
Mental Model	✅ Continuous process	✅ Still appears continuous
Latency	✅ Low latency polling	✅ Same latency maintained
Cost Predictability	❌ Variable based on load	✅ Predictable usage

Implementation & Discussion

Both approaches will be supported - self-respawning as default, pg_cron as opt-in alternative.

Question for Discussion:

How should beforeunload be handled in the pg_cron model? Should it:

Send final metrics/status before exit?
Attempt a quick restart if more work is detected?
Simply clean up and exit?
Other approaches?

nel · 2025-07-01T18:56:29Z

nel
Jul 1, 2025

Very interesting. I initially thought it was a basic poll X items for every cron run but it is mure more subtle and is a fetch during a period T, so you preserve the low latency and long polling.

The obvious advantage of your approach (appart from controlling spawing) is the reload of newly deployed code. You can control the forward/backward compatibility window for a T period, but it is still there technically as you continue polling even if there is new code deployed. BUT Stopping polling on new code deploy completely eliminate the need for forward compatibility = (no old worker will get an new payload once you deploy). So this solution is better the lower T is but does not make deploy deterministic.

IMO the biggest disadvantage is that you want T as small as possible, so it will be a mess for long running task that will technically be run only once by invocation. If T = 15s and job can last 1min30 then basically the worker will stay alive then depending on implementation, either you create new worker and you have the MAX_PARALLEL issue OR no new job is spawned for 1m15s even if small jobs could be run (we have concurrency in a single worker).

Basically for me it seems interesting but I think there may be a design issue if max(job_time) > T.

Costwise this new approach if T gets under 15s it will start costing a few USD a month so marginally more expensive.

My personal opinion is that :

continuous worker (current solution)
deploy detection + stop polling (cf https://github.com/orgs/pgflow-dev/discussions/171)
control of parallelism at EdgeFunction start or cron polling
Fix of batching delay (BatchProcessor waits for whole batch unnecessarily #114)
cron polling (to restart)

Has the same benefit, is a bit more reliable (it can survive when cron is not working 100% of time and does not have negative interaction when job duration exceed cron periodicity, no forward compatibility required), and is slightly cheaper if control of paralleism is built-in in cron polling (the same otherwise).

That is a my 2 cents. The proposed approach is a good one.

0 replies

jumski · 2025-07-02T17:11:48Z

jumski
Jul 2, 2025
Maintainer Author

hey @nel! thanks for the thoughtful feedback - you actually hit on something that made me rethink this whole thing...

your point about jobs > T is spot on. if a job takes 60s and we poll every 15s, that's definitely a problem with the simple approach i outlined.

but here's where it gets interesting - what if the worker uses its time budget more intelligently?

the breakthrough: time-budgeted polling

what if each HTTP request sets new "time budget" for worker, so it knows for how long it can continue to poll?
it will run read_with_poll repeatedly, with max_poll_seconds adjusted to not exceed the budget, so each consecutive call will have a lower value, and first poll will be called with max_poll_seconds=time_budget.

here's a concrete example with max_concurrent = 10:

Time	Event	Free Slots	Time Budget	Details
T+0	HTTP request arrives	10	30s	Worker receives: `{ time_budget: 30s }`
T+0	Call read_with_poll	10	30s	`read_with_poll(max_poll_seconds=30, batch_size=10)`
T+3	Poll returns	10 → 0	27s	Returns 10 messages, start processing all
T+3	Waiting for slots	0	27s	All slots busy, wait for any to free up
T+5	Slot freed	0 → 1	25s	1 message finished processing
T+5	Call read_with_poll	1 → 0	25s	`read_with_poll(max_poll_seconds=25, batch_size=1)`
T+10	Poll returns	0	20s	Returns 1 message, start processing
T+10	Waiting for slots	0	20s	All slots busy again
T+12	Slots freed	0 → 3	18s	3 messages finished processing
T+12	Call read_with_poll	3 → 1	18s	`read_with_poll(max_poll_seconds=18, batch_size=3)`
T+17	Poll returns	1	13s	Returns 2 messages, start processing
T+17	Call read_with_poll	1 → 1	13s	`read_with_poll(max_poll_seconds=13, batch_size=1)`
T+20	Poll timeout	1 → 6	10s	No messages, but 5 jobs finished during wait
T+20	Call read_with_poll	6	10s	`read_with_poll(max_poll_seconds=10, batch_size=6)`
T+25	HTTP request arrives	6	30s	NEW BUDGET! Worker receives: `{ time_budget: 30s }`
T+25	Poll interrupted & returns	6 → 2	30s	Returns 4 messages early due to new budget
T+25	Call read_with_poll	2	30s	`read_with_poll(max_poll_seconds=30, batch_size=2)`

the key insight: polling and job processing are decoupled. jobs run in the background while the polling loop continues. so even if a job takes 60s, the next HTTP request (at t=30s) just adds more polling iterations to keep finding work.

why this changes everything

jobs > T are fine - they keep running across multiple HTTP requests
no gaps - continuous polling coverage until next HTTP arrives
can stop polling on demand - just stop the http requests, or even better, provide a time_budget=0 to stop polling
same low latency - still using read_with_poll every 100ms internally
no over-fetching - batch_size always matches available slots

bonus: it unifies all approaches

want current behavior? → use budget=300s + self-respawn at 290s
want pg_cron control? → cron every 30s with budget=30s
want on-demand bursts? → manual HTTP with budget=120s

the HTTP request just becomes a way to say "you are allowed to poll for next N seconds" rather than "spawn a new worker".

how the budget works

each HTTP request resets the budget completely:

request at t=0: budget = 30s (polls until t=30)
request at t=25: budget = 30s (now polls until t=55)
no accumulation, just "poll for 30s from this moment"

addressing your other points

control of parallelism at EdgeFunction start

yep, this approach makes it trivial - each HTTP request is handled once, no runaway spawning

deploy detection + stop polling

still useful! but now deploys have natural boundaries at each HTTP request

cron polling (to restart)

cron is guaranteed to hit a newly deployed edge function and first request will start it

what do you think? feels like this might give us the best of both worlds - the control of pg_cron with the efficiency of continuous polling, and no issues with long-running jobs.

and for the deployment detection i think the best approach would be to be explicit about the new deployments and just wrap "functions deploy" with some pgflow cli command that will deprecate the existing workers by setting them some "deprecated_at" flag in pgflow.workers, so they can check it with each polling iteration and just stop polling if detect the deprecation

0 replies

nel · 2025-07-02T20:51:32Z

nel
Jul 2, 2025

Hmm I am really sorry but I must admit I don't understand the proposal.

When new HTTP request arrive there are 2 cases :

EdgeRuntime decides to start a new worker
An existing worker handle the request

From your table table it seems you consider it's only 2)

As far as I can tell we don't really have control over that (or maybe we do and it's just that currently we are no longer listening and that is why it is spawing worker uncontrollably, I don't know the supabase/EdgeRuntime logic to decide to spawn a new process)

I am also not completely sure if you system is supporting multiple parallel worker for the same queue (I consider it a given but maybe I am over extending), if you do there is a third case depending on which process gets the request.

0 replies

jumski · 2025-07-07T18:07:49Z

jumski
Jul 7, 2025
Maintainer Author

I tried to kill few birds with one stone (use http requests as kind of a dead man's snitch) but i think it is too complicated for this stage.

The solution you are using is probably probably enough for 90% of cases.

I need to rethink it, get some perspective and decide what to implement. It would be good to have a configuration table that explains the workers that we want to keep up and how many instances are acceptable (your's threshold), which overlaps with the idea for the cron worker.

Will get back to it soon!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pgflow

Worker Spawning Control: pg_cron-Based Alternative #170

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

pgflow

Worker Spawning Control: pg_cron-Based Alternative #170

Uh oh!

jumski Jul 1, 2025 Maintainer

Current State & Problem

Two Solutions

1. Smart pg_cron Safety Net

2. pg_cron-Based Polling

Comparison

Implementation & Discussion

Replies: 4 comments

Uh oh!

Uh oh!

nel Jul 1, 2025

Uh oh!

jumski Jul 2, 2025 Maintainer Author

the breakthrough: time-budgeted polling

why this changes everything

bonus: it unifies all approaches

how the budget works

addressing your other points

Uh oh!

Uh oh!

nel Jul 2, 2025

Uh oh!

jumski Jul 7, 2025 Maintainer Author

jumski
Jul 1, 2025
Maintainer

nel
Jul 1, 2025

jumski
Jul 2, 2025
Maintainer Author

nel
Jul 2, 2025

jumski
Jul 7, 2025
Maintainer Author