-
Notifications
You must be signed in to change notification settings - Fork 1.6k
KEP-4781 restarting kubelet does not change pod status #5493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: HirazawaUi The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
eca940b
to
acbdb7e
Compare
By preserving the old state without immediate health checks, there is a delay in recognizing containers that have become unhealthy during or after kubelet's downtime. Services relying on Pod readiness for service discovery might continue directing traffic to Pods with containers that are no longer healthy but are still reported as Ready. | ||
We plan to immediately trigger a probe after that to reduce the risk caused by such delays. | ||
|
||
## Design Details |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not refer to the implementation approach of the previous KEP. After reviewing the POC PR related to that KEP, I found the implementation process somewhat cumbersome, and it also presented some potential edge case issues.
After tracing the pod status transition process, I adopted a new implementation method to achieve the goal: consistently relying on the detection results of the probeManager. This approach simplifies the implementation and helps us avoid certain edge cases. And in this section, the behavioral differences of kubelet under several scenarios are also analyzed. Could you please take a look?
My POC PR: kubernetes/kubernetes#133676
303bb56
to
8f0a0c4
Compare
|
||
2. We ensure that if the `Started` field in the container status is true, the container is considered started (since the startupProbe only runs during container startup and will not execute again once completed). | ||
|
||
3. If the Kubelet restart occurs within the `nodeMonitorGracePeriod` and the Pod’s Ready condition is set to false, we will set the container’s ready status to false. It will remain in this state until subsequent probes reset it to true. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the Kubelet restart occurs within the
nodeMonitorGracePeriod
Does this mean the case where the kubelet has been down for longer than nodeMonitorGracePeriod
and then restarts afterward?
So basically, the Node Lifecycle Controller notices that the Lease hasn’t been updated past nodeMonitorGracePeriod
, marks the Node as NotReady, and flips the Pods’ Ready condition to False. After the kubelet restarts, it fetches the Pod info for its own Node from the API server, and the prober manager simply carries over that Ready condition value, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scenario here indeed warrants a more detailed explanation.
Since we cannot delay waiting for the prober manager to trigger a probe before updating the container status in the syncPod
process, the pod status update always occurs before the probe. This means that when we first update the container status, we do not know the actual state of the container.
- For a short kubelet restart, we can confidently assume that the container's state has not changed. Therefore, we can retain the container's state and let the prober manager trigger a probe for the container to correctly update its state in the pod.
- However, for a prolonged kubelet restart, when the node is already in a NotReady state, we can no longer assume that the container's state in the pod remains unchanged. In this case, we follow the previous behavior by setting the container's Ready field to false (as mentioned in the KEP, before applying the changes in this KEP, after a pod is initially added to the prober manager, the probe result is set to an initial value, and the initial value for the readiness probe is Failure, which sets the container's Ready field to false). Then, the probe is performed, and the container's state is correctly updated.
In summary:
- For a short kubelet restart, we choose to inherit the container's state from before the kubelet restart.
- For a prolonged kubelet restart, we follow the pre-change behavior by first setting the container's Ready field to false, waiting for the actual probe result, and eventually driving the container's state to its actual value. However, compared to the pre-change behavior, this approach still has an advantage: it avoids unnecessary state transitions for the container's Ready field (from
true
->false
->true
) and instead transitions directly fromfalse -> true
. This prevents meaningless state flapping, reducing unnecessary reconciliation work for various controllers that watch container status, such as the EndpointSlice controller or external controllers that depend on EndpointSlice, thereby alleviating their workload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clear explanation!
For a prolonged kubelet restart, we follow the pre-change behavior by first setting the container's Ready field to false, waiting for the actual probe result, and eventually driving the container's state to its actual value.
I think I finally understand the part I was a bit unclear about regarding how the UPDATE PodStatus
step in your diagram determines readiness. If the kubelet is down longer than nodeMonitorGracePeriod
, the container’s ready condition is set to false. In that case, in your PoC, the section below is where the ready state becomes false, right?
https://github.com/kubernetes/kubernetes/blob/d207ce94fe550ec35ff6a6b120faf759b8cb9fae/pkg/kubelet/prober/prober_manager.go#L336-L339
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes.
Can we include the history of this? kubernetes/kubernetes#100277 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am strongly in favor of this KEP, but I leave the specific details for people most familiar with Kubelet to iron out :)
419c3c9
to
aecf8a1
Compare
aecf8a1
to
62df41c
Compare
I don’t have many ideas for now, so I’ve simply placed these links in the Motivation section. If you feel the wording needs further description or that some context should be added to the links, please let me know — I’ll be happy to make the necessary changes. |
Uh oh!
There was an error while loading. Please reload this page.