Skip to content

Conversation

@twz123
Copy link
Member

@twz123 twz123 commented Aug 20, 2025

Description

Previously, the shutdown code looped endlessly until the child process finished, requesting graceful termination over and over again. Change this to a single request-termination -> wait -> bail-out logic. This is to ensure that k0s won't hang when the supervised processes can't be terminated for whichever reason: the code will terminate, at least after the timeout expired.

Use a buffered channel for the wait result, so that the goroutine will be able to exit, even if nothing reads from the channel anymore. Introduce fine-grained error reporting to differentiate shutdown outcomes (graceful shutdown, forced kill, failure, and so on).

See:

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update

How Has This Been Tested?

  • Manual test
  • Auto test added

Checklist

  • My code follows the style guidelines of this project
  • My commit messages are signed-off
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • I have checked my code and corrected any misspellings

@twz123 twz123 added the enhancement New feature or request label Aug 20, 2025
@twz123 twz123 force-pushed the bail-out-after-stop-timeout branch from ed12b89 to 12a506e Compare August 21, 2025 07:05
@github-actions
Copy link
Contributor

The PR is marked as stale since no activity has been recorded in 30 days

@github-actions github-actions bot added Stale and removed Stale labels Sep 20, 2025
@twz123 twz123 force-pushed the bail-out-after-stop-timeout branch 2 times, most recently from 110ae95 to a0ddd30 Compare September 22, 2025 16:12
@twz123 twz123 marked this pull request as ready for review September 23, 2025 07:07
@twz123 twz123 requested review from a team as code owners September 23, 2025 07:07
@twz123 twz123 requested review from jnummelin and ncopa September 23, 2025 07:07
@twz123 twz123 added this to the 1.34 milestone Oct 1, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Oct 6, 2025

This pull request has merge conflicts that need to be resolved.

@twz123 twz123 modified the milestones: 1.34, 1.35 Oct 8, 2025
The os.Process API is strange in that it returns an error instead of
(ProcessState, error). This makes it difficult to distinguish between
"regular" process errors and failures that occur while actually waiting
on the process.

Nevertheless, try to distinguish between these two cases to produce
more accurate log messages: If the error is nil or unwraps into
an *exec.ExitErr, treat it as a "regular" process error. Consider
everything else an error indicating a problem with waiting.

Signed-off-by: Tom Wieczorek <[email protected]>
Previously, the shutdown code looped endlessly until the child process
finished, requesting graceful termination over and over again. Change
this to a single request-termination -> wait -> bail-out logic. This
is to ensure that k0s won't hang when the supervised processes can't be
terminated for whichever reason: the code will terminate, at least after
the timeout expired.

Use a buffered channel for the wait result, so that the goroutine
will be able to exit, even if nothing reads from the channel anymore.
Introduce fine-grained error reporting to differentiate shutdown
outcomes (graceful shutdown, forced kill, failure, and so on).

Signed-off-by: Tom Wieczorek <[email protected]>
@twz123 twz123 force-pushed the bail-out-after-stop-timeout branch from a0ddd30 to 0827218 Compare October 30, 2025 15:31
@twz123 twz123 marked this pull request as draft October 31, 2025 09:04
@twz123
Copy link
Member Author

twz123 commented Oct 31, 2025

Oh, this PR uncovered an oversight when implementing #6429. The k0s api subcommand doesn't have any context handling or graceful shutdown logic in general. Before #6429, there were no signal handlers, and the program just terminated. After #6429, there's now a global signal handler that relays signals to the context created in main.go. As a result, k0s api no longer terminates on the first SIGTERM. Luckily, the signal handler will be unregistered after the first received signal, and in the next iteration, it will finally terminate (without any graceful termination):

time="2025-10-31 09:12:48" level=info msg="Starting to supervise" component=k0s-control-api
time="2025-10-31 09:12:48" level=info msg="Started successfully, go nuts pid 1008" component=k0s-control-api
time="2025-10-31 09:12:48" level=info msg="time=\"2025-10-31 09:12:48\" level=info msg=\"Reading runtime configuration from standard input ...\"" component=k0s-control-api stream=stdout
time="2025-10-31 09:14:10" level=info msg="Requested graceful termination" component=k0s-control-api
time="2025-10-31 09:14:15" level=info msg="Requested graceful termination" component=k0s-control-api
time="2025-10-31 09:14:15" level=error msg="Failed to wait for process" component=k0s-control-api error="signal: terminated"

@twz123
Copy link
Member Author

twz123 commented Oct 31, 2025

The fix for the k0s api subcommand is here: #6572

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant