-
Notifications
You must be signed in to change notification settings - Fork 105
Fix workflow failing because of function timeouts #322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix workflow failing because of function timeouts #322
Conversation
🦋 Changeset detectedLatest commit: c46ce8b The changes in this PR will be included in the next version bump. This PR includes changesets to release 11 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
This stack of pull requests is managed by Graphite. Learn more about stacking. |
1d9a6c3 to
e6cc973
Compare
e6cc973 to
8430eee
Compare
| await world.steps.update(workflowRunId, stepId, { | ||
| status: 'completed', | ||
| output: result as Serializable, | ||
| }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Problem now is that if we fail to create the event, e.g. time-out in between this update and the event creation, the run is eternally stuck while the step shows up as complete
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah true :( this really needs to be atomic. Let's
| if (step.status !== 'pending') { | ||
| // We should only be running the step if it's pending | ||
| // (initial state, or state set on re-try), so the step has been | ||
| if (!['pending', 'running'].includes(step.status)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this gets called with status running, should we check the updatedAt timestamp to confirm we're at least a reasonable amount of time away from the last update?
I check logs for Vade, which e.g. sees about ~300 of these catch clauses per day, which would have all result in double step runs with this new code change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm I think that gets hairy and can cause nasty bugs down the line. What's a "aafe amount of time"?
I think better to err towards atleast once semantics than run the risk of hanging the workflow and never running the step
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(can be convinced otherwise)
VaguelySerious
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall unsure about how good this change is, seems fine overall, can't say I understand the full implications, but code LGTM, left some thoughts in comments above - up to you to merge
Co-authored-by: Peter Wielander <[email protected]>
|
SGTM, let's merge and then monitor some of our prod workflows once they upgrade packages, just to double check whether this causes any perceptible issues |
…of_function_timeouts


If a step fails because of a function timeout, the state never goes back to "pending", so we actually need to support the step being retried even if it's already "running" because it was gatally/non-graciously terminated
a future improvement would involve some sort of heartbeating/locking to try and prevent concurrenct step executions, and also trying to handle the telemetry of a pasy run gracefully after a step gets retried, but for now this seems sufficient