Skip to content

Conversation

@pranaygp
Copy link
Collaborator

@pranaygp pranaygp commented Nov 13, 2025

If a step fails because of a function timeout, the state never goes back to "pending", so we actually need to support the step being retried even if it's already "running" because it was gatally/non-graciously terminated

a future improvement would involve some sort of heartbeating/locking to try and prevent concurrenct step executions, and also trying to handle the telemetry of a pasy run gracefully after a step gets retried, but for now this seems sufficient

@changeset-bot
Copy link

changeset-bot bot commented Nov 13, 2025

🦋 Changeset detected

Latest commit: c46ce8b

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 11 packages
Name Type
@workflow/core Patch
@workflow/builders Patch
@workflow/cli Patch
@workflow/next Patch
@workflow/nitro Patch
@workflow/web-shared Patch
workflow Patch
@workflow/sveltekit Patch
@workflow/world-testing Patch
@workflow/nuxt Patch
@workflow/ai Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@vercel
Copy link
Contributor

vercel bot commented Nov 13, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
example-nextjs-workflow-turbopack Ready Ready Preview Comment Nov 18, 2025 11:39pm
example-nextjs-workflow-webpack Ready Ready Preview Comment Nov 18, 2025 11:39pm
example-workflow Ready Ready Preview Comment Nov 18, 2025 11:39pm
workbench-express-workflow Error Error Nov 18, 2025 11:39pm
workbench-hono-workflow Ready Ready Preview Comment Nov 18, 2025 11:39pm
workbench-nitro-workflow Ready Ready Preview Comment Nov 18, 2025 11:39pm
workbench-nuxt-workflow Ready Ready Preview Comment Nov 18, 2025 11:39pm
workbench-sveltekit-workflow Ready Ready Preview Comment Nov 18, 2025 11:39pm
workbench-vite-workflow Ready Ready Preview Comment Nov 18, 2025 11:39pm
workflow-docs Ready Ready Preview Comment Nov 18, 2025 11:39pm

Copy link
Collaborator Author

pranaygp commented Nov 13, 2025

This stack of pull requests is managed by Graphite. Learn more about stacking.

@pranaygp pranaygp force-pushed the pranaygp/11-12-fix_workflow_failing_because_of_function_timeouts branch from 1d9a6c3 to e6cc973 Compare November 13, 2025 01:58
@pranaygp pranaygp force-pushed the pranaygp/11-12-fix_workflow_failing_because_of_function_timeouts branch from e6cc973 to 8430eee Compare November 13, 2025 02:03
await world.steps.update(workflowRunId, stepId, {
status: 'completed',
output: result as Serializable,
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Problem now is that if we fail to create the event, e.g. time-out in between this update and the event creation, the run is eternally stuck while the step shows up as complete

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah true :( this really needs to be atomic. Let's :itg:

if (step.status !== 'pending') {
// We should only be running the step if it's pending
// (initial state, or state set on re-try), so the step has been
if (!['pending', 'running'].includes(step.status)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this gets called with status running, should we check the updatedAt timestamp to confirm we're at least a reasonable amount of time away from the last update?

I check logs for Vade, which e.g. sees about ~300 of these catch clauses per day, which would have all result in double step runs with this new code change

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I think that gets hairy and can cause nasty bugs down the line. What's a "aafe amount of time"?

I think better to err towards atleast once semantics than run the risk of hanging the workflow and never running the step

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(can be convinced otherwise)

Copy link
Member

@VaguelySerious VaguelySerious left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall unsure about how good this change is, seems fine overall, can't say I understand the full implications, but code LGTM, left some thoughts in comments above - up to you to merge

@VaguelySerious
Copy link
Member

SGTM, let's merge and then monitor some of our prod workflows once they upgrade packages, just to double check whether this causes any perceptible issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants