Skip to content

Conversation

@leo-amd
Copy link
Contributor

@leo-amd leo-amd commented Sep 19, 2025

Short Description

This PR addresses an intermittent CI failure that occurs on our self-hosted runners.

Motivation

The workflow has been failing sporadically with an EBUSY: resource busy or locked error during the actions/checkout step. This happens because our runners use an NFS mount for their workspace.
If a previous workflow run terminates abnormally (e.g., is cancelled or crashes), it can leave behind a stale .nfs lock file. When the next job starts on that same runner, the default clean: true behavior of actions/checkout attempts to delete all files and fails when it encounters this locked file.

Solution

This change resolves the issue by setting clean: false on the actions/checkout@v4 steps.
This prevents the action from trying to delete the workspace contents, thereby avoiding any conflict with stale NFS lock files. The checkout process will still overwrite all files in the workspace, ensuring a fresh source tree for the job without triggering the error.

@leo-amd leo-amd changed the base branch from master to rocm-jaxlib-v0.6.0 September 19, 2025 10:14
Copy link
Collaborator

@charleshofer charleshofer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NFS bug is annoying, but I'm not a fan of not cleaning up at all. That could lead to some odd and difficult to debug problems if the wrong files or caches hang around. Could we at least add something like an rm -rf || true. What we really want, I think, is a cleanup step that will just ignore the NFS lock file if it's hanging around.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants