-
Notifications
You must be signed in to change notification settings - Fork 499
OTA-541: enhancements/update/do-not-block-on-degraded: New enhancement proposal #1719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,211 @@ | ||
--- | ||
title: do-not-block-on-degraded-true-clusteroperators | ||
authors: | ||
- "@wking" | ||
reviewers: | ||
- "@PratikMahajan, update team lead" | ||
- "@sdodson, update staff engineer" | ||
approvers: | ||
- "@PratikMahajan, update team lead" | ||
api-approvers: | ||
- None | ||
creation-date: 2024-11-25 | ||
last-updated: 2025-01-31 | ||
tracking-link: | ||
- https://issues.redhat.com/browse/OTA-540 | ||
--- | ||
|
||
# Do not block on Degraded=True ClusterOperators | ||
|
||
## Summary | ||
|
||
The cluster-version operator (CVO) uses an update-mode when transitioning between releases, where the manifest operands are [sorted into a task-node graph](/dev-guide/cluster-version-operator/user/reconciliation.md#manifest-graph), and the CVO walks the graph reconciling. | ||
Since 4.1, the cluster-version operator has blocked during update and reconcile modes (but not during install mode) on `Degraded=True` ClusterOperator. | ||
This enhancement proposes ignoring `Degraded` when deciding whether to block on a ClusterOperator manifest. | ||
|
||
## Motivation | ||
|
||
The goal of blocking on manifests with sad resources is to avoid further destabilization. | ||
For example, if we have not reconciled a namespace manifest or ServiceAccount RoleBinding, there's no point in trying to update the consuming operator Deployment. | ||
Or if we are unable to update the Kube-API-server operator, we don't want to inject [unsupported kubelet skew][kubelet-skew] by asking the machine-config operator to update nodes. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we observed another kind of upgrade blocker here. Applying the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I've been trying to talk folks into the narrow |
||
|
||
However, blocking the update on a sad resource has the downside that later manifest-graph task-nodes are not reconciled, while the CVO waits for the sad resource to return to happiness. | ||
We maximize safety by blocking when progress would be risky, while continuing when progress would be safe, and possibly helpful. | ||
|
||
Our experience with `Degraded=True` blocks turns up cases like: | ||
|
||
* 4.6 `Degraded=True` on an unreachable, user-provided node, with monitoring reporting `UpdatingnodeExporterFailed`, network reporting `RolloutHung`, and machine-config reporting `MachineConfigDaemonFailed`. | ||
But those ClusterOperator were all still `Available=True`, and in 4.10 and later, monitoring workloads are guarded by PodDisruptionBudgets (PDBs) | ||
|
||
### User Stories | ||
|
||
* As a cluster administrator, I want the ability to defer recovering `Degraded=True` ClusterOperators without slowing ClusterVersion updates. | ||
|
||
### Goals | ||
|
||
ClusterVersion updates will no longer block on ClusterOperators solely based on `Degraded=True`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. does it mean, if no operator is unavailable, then the upgrade should always complete? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ClusterOperators aren't the only CVO-manifested resources, and if something else breaks like we fail to reconcile a RoleBinding or whatever, that will block further update progress. And for ClusterOperators, we'll still block on $ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release:4.18.0-rc.0-x86_64
Extracted release payload from digest sha256:054e75395dd0879e8c29cd059cf6b782742123177a303910bf78f28880431d1c created at 2024-12-02T21:11:00Z
$ yaml2json <manifests/0000_20_kube-apiserver-operator_07_clusteroperator.yaml | jq -c '.status.versions[]'
{"name":"operator","version":"4.18.0-rc.0"}
{"name":"raw-internal","version":"4.18.0-rc.0"}
{"name":"kube-apiserver","version":"1.31.3"} A recent example of this being useful is openshift/machine-config-operator#4637, which got the CVO to block until the MCO had rolled out a single-arch -> multi-arch transition, without the MCO needing to touch its There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. so could I say, if failing=true for an upgrade, the reason should not be There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, we'll still propagate |
||
|
||
### Non-Goals | ||
|
||
* Adjusting how the cluster-version operator treats `Available` and `versions` in ClusterOperator status. | ||
The CVO will still block on `Available=False` ClusterOperator, and will also still block on `status.versions` reported in the ClusterOperator's release manifest. | ||
|
||
* Adjusting whether `Degraded` ClusterOperator conditions propagated through to the ClusterVersion `Failing` condition. | ||
As with the current install mode, the sad condition will be propagated through to `Failing=True`, unless outweighed by a more serious condition like `Available=False`. | ||
|
||
## Proposal | ||
|
||
The cluster-version operator currently has [a mode switch][cvo-degraded-mode-switch] that makes `Degraded` ClusterOperator a non-blocking condition that is still propagated through to `Failing`. | ||
This enhancement proposes making that an unconditional `UpdateEffectReport`, regardless of the CVO's current mode (installing, updating, reconciling, etc.). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. openshift/cluster-version-operator#482 is in flight with this change, if folks want to test pre-merge. |
||
|
||
### Workflow Description | ||
|
||
Cluster administrators will be largely unaware of this feature. | ||
They will no longer have ClusterVersion update progress slowed by `Degraded=True` ClusterOperators, so there will be less admin involvement there. | ||
They will continue to be notified of `Degraded=True` ClusterOperators via [the `warning` `ClusterOperatorDegraded` alert][ClusterOperatorDegraded] and the `Failing=True` ClusterVersion condition. | ||
|
||
### API Extensions | ||
|
||
No API extensions are needed for this proposal. | ||
|
||
### Topology Considerations | ||
|
||
#### Hypershift / Hosted Control Planes | ||
|
||
HyperShift's ClusterOperator context is the same as standalone, so it will receive the same benefits from the same cluster-version operator code change, and does not need special consideration. | ||
|
||
#### Standalone Clusters | ||
|
||
Yes, the enhancement is expected to improve the update experience on standalone, by decoupling ClusterVersion update completion from recovering `Degraded=True` ClusterOperators, granting the cluster administrator the flexibility to address update speed and operator degradation independently. | ||
|
||
#### Single-node Deployments or MicroShift | ||
|
||
Single-node's ClusterOperator context is the same as standalone, so it will receive the same benefits from the same cluster-version operator code change, and does not need special consideration. | ||
This change is a minor tweak to existing CVO code, so it is not expected to impact resource consumption. | ||
|
||
MicroShift updates are managed via RPMs, without a cluster-version operator, so it is not exposed to the ClusterVersion updates this enhancement is refining, and not affected by the changes proposed in this enhancement. | ||
|
||
### Implementation Details/Notes/Constraints | ||
|
||
The code change is expected to be a handful of lines, as discussed in [the *Proposal* section](#proposal), so there are no further implementation details needed. | ||
|
||
### Risks and Mitigations | ||
|
||
The risk would be that there are some ClusterOperators who currently rely on the cluster-version operator blocking during updates on ClusterOperators that are `Available=True`, `Degraded=True`, and which set the release manifest's expected `versions`. | ||
As discussed in [the *Motivation* section](#motivation), we're not currently aware of any such ClusterOperators. | ||
If any turn up, we can mitigate by [declaring conditional update risks](targeted-update-edge-blocking.md) using the existing `cluster_operator_conditions{condition="Degraded"}` PromQL metric, while teaching the relevant operators to set `Available=False` and/or without their `versions` bumps until the issue that needs to block further ClusterVersion update progress has been resolved. | ||
|
||
How will security be reviewed and by whom? | ||
Unclear. Feedback welcome. | ||
|
||
How will UX be reviewed and by whom? | ||
Unclear. Feedback welcome. | ||
|
||
### Drawbacks | ||
|
||
As discussed in [the *Risks* section](#risks-and-mitigations), the main drawback is changing behavior that we've had in place for many years. | ||
But we do not expect much customer pushback based on "hey, my update completed?! I expected it to stick on this sad component...". | ||
We do expect it to reduce customer frustration when they want the update to complete, but for reasons like administrative siloes do not have the ability to recover a component from minor degradation themselves. | ||
|
||
## Test Plan | ||
|
||
**Note:** *Section not required until targeted at a release.* | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The enhancement and the tracking card OTA-541 are not targeted at a release. However, changes in the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not strongly opinionated on what the test plan looks like. We don't do a lot of intentional-sad-path update testing today in CI, and I'm fuzzy on what QE does in that space that could be expanded into this new space (or maybe they already test pushing a ClusterOperator component to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
+1, that's also what I want to explore during test. I also had some other immature checkpoints in my mind when I read this enhancement doc at the first time, but I still need some inputs from @wking to help me tidy up them. For example #1719 (comment). |
||
|
||
Consider the following in developing a test plan for this enhancement: | ||
- Will there be e2e and integration tests, in addition to unit tests? | ||
- How will it be tested in isolation vs with other components? | ||
- What additional testing is necessary to support managed OpenShift service-based offerings? | ||
|
||
No need to outline all of the test cases, just the general strategy. Anything | ||
that would count as tricky in the implementation and anything particularly | ||
challenging to test should be called out. | ||
|
||
All code is expected to have adequate tests (eventually with coverage | ||
expectations). | ||
|
||
## Graduation Criteria | ||
|
||
There are no API changes proposed by this enhancement, which only affects sad-path handling, so we expect the code change to go straight to the next generally-available release, without feature gating or staged graduation. | ||
|
||
### Enhancement merge | ||
|
||
To ensure we do not surprise anyone with this shift, we are collecting acks from at least one maintainer for each ClusterOperator approving this pull request, and the [existing semantics for `status.versions[name=operator]`](/dev-guide/cluster-version-operator/dev/clusteroperator.md#version): | ||
|
||
> There MUST be a version with the name `operator`, which is watched by the CVO to know if a cluster operator has achieved the new level. | ||
|
||
* [ ] authentication | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Checking internal org docs, the Auth team seems like they might be responsible for the
or whatever, assuming they are ok making that assertion for the operators they maintain. Also fine if they want to say "I'm a maintainer for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Before I ack the authentication operator, I'd like to clarify the existing semantics for As far as I understand, the operator sets its
AFAIU, this does not guarantee the mixed state, as described in the semantics:
When the operator starts running on the new version during an upgrade, it seems that it will update its version in the CO status, probably even before the operands have been upgraded to their new versions via the workload controllers. This seems to offend the mixed-version-state requirement as described above. @wking any thoughts on this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not familiar with the Auth operator's implementation, but checking CI to see if there's externally-measurable evidence of how it's currently working: https://amd64.ocp.releases.ci.openshift.org/ -> 4.18.0-rc.10 -> update from 4.17.17 -> Artifacts -> ... -> template-job artifacts: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/release-openshift-origin-installer-e2e-aws-upgrade/1891342561033850880/artifacts/e2e-aws-upgrade/clusteroperators.json | jq -c '.items[] | select(.metadata.name == "authentication").status.versions[]'
{"name":"operator","version":"4.18.0-rc.10"}
{"name":"oauth-apiserver","version":"4.18.0-rc.10"}
{"name":"oauth-openshift","version":"4.18.0-rc.10_openshift"} You're currently asking the CVO to wait on $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/release-openshift-origin-installer-e2e-aws-upgrade/1891342561033850880/artifacts/e2e-aws-upgrade/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-authentication-operator" and .reason == "OperatorStatusChanged") | .firstTimestamp + " " + (.count | tostring) + " " + .message' | grep versions
2025-02-17T04:42:52Z 1 Status for clusteroperator/authentication changed: Degraded set to Unknown (""),Progressing set to Unknown (""),Available set to Unknown (""),Upgradeable set to Unknown (""),EvaluationConditionsDetected set to Unknown (""),status.relatedObjects changed from [] to [{"operator.openshift.io" "authentications" "" "cluster"} {"config.openshift.io" "authentications" "" "cluster"} {"config.openshift.io" "infrastructures" "" "cluster"} {"config.openshift.io" "oauths" "" "cluster"} {"route.openshift.io" "routes" "openshift-authentication" "oauth-openshift"} {"" "services" "openshift-authentication" "oauth-openshift"} {"" "namespaces" "" "openshift-config"} {"" "namespaces" "" "openshift-config-managed"} {"" "namespaces" "" "openshift-authentication"} {"" "namespaces" "" "openshift-authentication-operator"} {"" "namespaces" "" "openshift-ingress"} {"" "namespaces" "" "openshift-oauth-apiserver"}],status.versions changed from [] to [{"operator" "4.17.17"}]
2025-02-17T04:44:44Z 1 Status for clusteroperator/authentication changed: status.versions changed from [{"operator" "4.17.17"}] to [{"operator" "4.17.17"} {"oauth-apiserver" "4.17.17"}]
2025-02-17T04:55:56Z 1 Status for clusteroperator/authentication changed: status.versions changed from [{"operator" "4.17.17"} {"oauth-apiserver" "4.17.17"}] to [{"operator" "4.17.17"} {"oauth-apiserver" "4.17.17"} {"oauth-openshift" "4.17.17_openshift"}]
2025-02-17T05:38:09Z 1 Status for clusteroperator/authentication changed: status.versions changed from [{"operator" "4.17.17"} {"oauth-apiserver" "4.17.17"} {"oauth-openshift" "4.17.17_openshift"}] to [{"operator" "4.18.0-rc.10"} {"oauth-apiserver" "4.17.17"} {"oauth-openshift" "4.17.17_openshift"}]
2025-02-17T05:40:08Z 1 Status for clusteroperator/authentication changed: status.versions changed from [{"operator" "4.18.0-rc.10"} {"oauth-apiserver" "4.17.17"} {"oauth-openshift" "4.17.17_openshift"}] to [{"operator" "4.18.0-rc.10"} {"oauth-apiserver" "4.17.17"} {"oauth-openshift" "4.18.0-rc.10_openshift"}]
2025-02-17T05:41:43Z 1 Status for clusteroperator/authentication changed: status.versions changed from [{"operator" "4.18.0-rc.10"} {"oauth-apiserver" "4.17.17"} {"oauth-openshift" "4.18.0-rc.10_openshift"}] to [{"operator" "4.18.0-rc.10"} {"oauth-apiserver" "4.18.0-rc.10"} {"oauth-openshift" "4.18.0-rc.10_openshift"}] So:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. service-ca-operator looks good. The version is set, once the operand hits the expected generation in at all replicas. |
||
* [ ] baremetal | ||
* [ ] cloud-controller-manager | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To my knowledge there is no reason that a degradation of the CCMO would have an impact on later components in a way that would be spectacularly bad. The most likely case would be that the ingress operator needs to recreate a service, and a broken CCM prevents that from happening. I believe this would just delay effectuation of a change, rather than actually break the cluster though. Though should that inability be better represented through the available condition? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Scoping on
But for the purpose of updates, I don't expect things like "the ingress operator needs to recreate a service" to be update-triggered. I'd expect that to just be either "new customer workload wants a new Service" or "unrelated to the update, something deleted an existing Service, and now we need to create a replacement". Both There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed, I think we are probably ok here The one potential area for concern would be if the CIO suddenly decide that it was going to use the ALBO in the future (this is actually on the table) and started requiring services to be recreated during/post upgrade. Normally an admin would be responsible for actually deleting the service, but CCM would be needed to remove the actual LB and finalizer. Odds of this being a problem that this EP exacerbates I think are low though |
||
* [ ] cloud-credential | ||
* [ ] cluster-api | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is tech preview currently, anything that gets broken by this EP can be fixed before we go GA |
||
* [ ] cluster-autoscaler | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we care about the ability to autoscale new capacity when MCO is trying to reboot new nodes? In theory, if the autoscaler wasn't working, then we may not be able to bring in new capacity, which may then halt the MCO roll out I guess in the case that autoscaling was completely unavailable, we would want to be Available=False and block the update that way instead? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah, same There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
If your cluster is tightly packed, and you have no capacity to move the workloads to be able to complete a drain, then yes, autoscaling might come into an upgrade. But yes, I think we are probably ok on the degraded front, and we need to make sure we are setting available correctly |
||
* [ ] config-operator | ||
* [ ] console | ||
* [ ] control-plane-machine-set | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If this CO goes degraded, it means the control plane is in a bad state. (Too many replicas, too few replicas, not enough ready replicas). I think that would be a good reason to stop the upgrade. Should we be leveraging available for some of these? I suspect in most cases something else (KAS?) would likely go unavailable before we did |
||
* [ ] csi-snapshot-controller | ||
* [ ] dns | ||
* [ ] etcd | ||
* [ ] image-registry | ||
* [ ] ingress | ||
* [ ] insights | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ack. I think we should be fine when this enhancement is implemented. The degraded Insights Operator doesn't affect any other components in the cluster. It's marked as degraded when it cannot upload the Insights data to the console.redhat.com Ingress (here an option is to disable the data gathering) or if any other connection (entitlements or cluster transfer) can't be made (here it depends on the HTTP response code > 500) |
||
* [ ] kube-apiserver | ||
* [ ] kube-controller-manager | ||
* [ ] kube-scheduler | ||
* [ ] kube-storage-version-migrator | ||
* [ ] machine-api | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Degradation here likely doesn't impact other COs, though would tie into the cluster-autoscalers ability to scale up |
||
* [ ] machine-approver | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This CO sets it's status available true and degraded false and has no concept of changing them once set There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. that seems optimistic 😅 Is there nothing that could go wrong in the machine-approver component that might call for admin intervention? Or is all the current admin-summoning here routed through alerts or reliant on other components (e.g. if we stop approving new Machines, something on the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
We've had around 6 years of that optimism so far and it's going well I guess? We have alerting through alerts when there are too many pending CSRs, which pokes the admin sufficiently, and AFAIK, that's the only real failure mode we have that needs admin intervention, so I guess we are ok for the moment |
||
* [x] machine-config, [acked](https://github.com/openshift/enhancements/pull/1719#discussion_r1934857504) by @yuqi-zhang | ||
* [ ] marketplace | ||
* [ ] monitoring | ||
* [ ] network | ||
* [ ] node-tuning | ||
* [ ] olm | ||
* [ ] openshift-apiserver | ||
* [ ] openshift-controller-manager | ||
* [ ] openshift-samples | ||
* [ ] operator-lifecycle-manager | ||
* [ ] operator-lifecycle-manager-catalog | ||
* [ ] operator-lifecycle-manager-packageserver | ||
* [ ] service-ca | ||
* [ ] storage | ||
|
||
### Dev Preview -> Tech Preview | ||
|
||
Not applicable. | ||
|
||
### Tech Preview -> GA | ||
|
||
Not applicable. | ||
|
||
### Removing a deprecated feature | ||
|
||
Not applicable. | ||
|
||
## Upgrade / Downgrade Strategy | ||
|
||
This enhancement only affects the cluster-version operator's internal processing of longstanding ClusterOperator APIs, so there are no skew or compatibility issues. | ||
|
||
## Version Skew Strategy | ||
|
||
This enhancement only affects the cluster-version operator's internal processing of longstanding ClusterOperator APIs, so there are no skew or compatibility issues. | ||
|
||
## Operational Aspects of API Extensions | ||
|
||
There are no API changes proposed by this enhancement. | ||
|
||
## Support Procedures | ||
|
||
This enhancement is a small pivot in how the cluster-version operator processes ClusterOperator manifests during updates. | ||
As discussed in [the *Drawbacks* section](#drawbacks), we do not expect cluster admins to open support cases related to this change. | ||
|
||
## Alternatives | ||
|
||
We could continue with the current approach, and absorb the occasional friction it causes. | ||
|
||
## Infrastructure Needed | ||
|
||
No additional infrastructure is needed for this enhancement. | ||
|
||
[ClusterOperatorDegraded]: https://github.com/openshift/cluster-version-operator/blob/820b74aa960717aae5431f783212066736806785/install/0000_90_cluster-version-operator_02_servicemonitor.yaml#L106-L124 | ||
[cvo-degraded-mode-switch]: https://github.com/openshift/cluster-version-operator/blob/820b74aa960717aae5431f783212066736806785/pkg/cvo/internal/operatorstatus.go#L241-L245 | ||
[kubelet-skew]: https://kubernetes.io/releases/version-skew-policy/#kubelet |
Uh oh!
There was an error while loading. Please reload this page.