-
Notifications
You must be signed in to change notification settings - Fork 4.7k
OCPBUGS-56281: gatewayapicontroller: Clean up resources when done #29900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
OCPBUGS-56281: gatewayapicontroller: Clean up resources when done #29900
Conversation
Check whether the slice of parent resource references in an httproute's status is empty before indexing the slice. Before this commit, the "Ensure HTTPRoute object is created" test sometimes panicked with "runtime error: index out of range [0] with length 0". Similarly, check whether the slice of load-balancer ingress points in a service's status is empty before indexing it. * test/extended/router/gatewayapicontroller.go (buildGateway) (createHttpRoute): Add checks.
@Miciah: This pull request references Jira Issue OCPBUGS-56281, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
fc08232
to
bf853bf
Compare
Job Failure Risk Analysis for sha: bf853bf
|
LGTM, @melvinjoseph86 PTAL |
/lgtm |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: melvinjoseph86, Miciah The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
Job Failure Risk Analysis for sha: 1967dd2
|
Delete the gatewayclass and uninstall OSSM after all the Gateway API controller tests are done. Before this commit, the Gateway API controller tests left OSSM installed, including the subscription, CSV, installplan, bundled CRDs, RBAC resources, deployment, service, serviceaccount, etc., when the tests were finished. This clutter could cause problems for other tests, or for the same test if it was run again. The new cleanup logic uses the OperatorsV1 client from github.com/operator-framework/operator-lifecycle-manager. Importing this package requires a replace stanza for openshift/api in go.mod. This vendors github.com/operator-framework/operator-lifecycle-manager v0.30.1-0.20250114164243-1b6752ec65fa rather than the newest revision in order to avoid bringing in additional problematic vendor bumps that the newest revision would bring in. This commit fixes OCPBUGS-56281. https://issues.redhat.com/browse/OCPBUGS-56281 * test/extended/router/gatewayapicontroller.go: Delete the gatewayclass that the test creates. Use the OperatorsV1 client to look up the Operator object for OSSM, and delete all the resources that the Operator object references. * go.mod: Vendor the operatorsv1 client code from github.com/operator-framework/operator-lifecycle-manager. * go.sum: * vendor/*: Regenerate.
* test/extended/router/gatewayapicontroller.go: Add the error value to some log messages that were missing it.
1967dd2
to
ab81b79
Compare
New changes are detected. LGTM label has been removed. |
Job Failure Risk Analysis for sha: ab81b79
|
ab81b79
to
1dcc98a
Compare
https://github.com/openshift/origin/compare/1967dd22c83963e780eb9953bc38da760e090dc8..1dcc98a3c2ec7c38dcee818e750e14ce57d70892 made these changes:
Before these changes, Also, comparing |
Job Failure Risk Analysis for sha: 1dcc98a
|
/payload-aggregate periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-gcp-ovn-rt-upgrade 5 |
@abhat: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/101e8ee0-4acb-11f0-928a-4bd1c2be89d0-0 |
e2e.Failf("Failed to delete GatewayClass %q", gatewayClassName) | ||
} | ||
|
||
g.By("Deleting the OSSM Operator resources") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious, why we don't use an owner reference for Subscription
? We could owner reference the gatewayclass and let Kube do the cascading deletion.
Upd: Deletion of Subscription doesn't delete CSV or CRDs. The CRD part is understandable: there can be some data loss. But CSV is kinda interesting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few reasons not to put or rely on an owner reference on the subscription:
- You could create the subscription manually; we cannot assume that the operator created it.
- You could have multiple gatewayclasses with our controller name, and then it isn't clear how we would configure the owner references on the subscription. Would we add only the first gatewayclass with our controller name? Would we add all gatewayclasses with our controller name? If we added more than one owner reference, would we need to delete old owner references when the corresponding gatewayclasses were deleted? If we did delete stale owner references, would that prevent garbage collection, or would we always leave one non-stale reference to trigger garbage collection?
- I don't know for sure that OLM doesn't look at the owner reference. We would need to check this.
- I am not confident that an owner reference would cause the subscription to be deleted as the owner reference on the Istio CR didn't cause it to be deleted (see OCPBUGS-56281: gatewayapicontroller: Clean up resources when done #29900 (comment)).
- Deleting the Istio CR only requires changing the test, it is more explicit than relying on garbage collection, and it is more obviously safe to backport.
g.By("Deleting the Istio CR") | ||
|
||
o.Expect(oc.AsAdmin().Run("delete").Args("--ignore-not-found=true", "istio", istioName).Execute()).Should(o.Succeed()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Istio CR is supposed to be garbage collected since its owner reference is gatewayclass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The owner reference on the Istio CR didn't cause it to be deleted (see #29900 (comment)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The owner reference on the Istio CR didn't cause it to be deleted
I didn't manage to reproduce this behavior. I saw Istio
CR gets deleted after GatewayClass
:
$ oc get gc
NAME CONTROLLER ACCEPTED AGE
openshift-default openshift.io/gateway-controller/v1 True 4m12s
04:57:08 $ oc get istio
NAME REVISIONS READY IN USE ACTIVE REVISION STATUS VERSION AGE
openshift-gateway 1 1 0 openshift-gateway Healthy v1.24.3 4m18s
04:57:14 $ oc get istio openshift-gateway -o yaml | yq .metadata.ownerReferences[0]
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
name: openshift-default
uid: 3f6ef6ed-9e6b-4821-9706-221ff0bca83e
04:57:34 $ oc -n openshift-ingress get pods
NAME READY STATUS RESTARTS AGE
istiod-openshift-gateway-7b567bc8b4-z9972 1/1 Running 0 4m48s
router-default-76c4888886-fmtzq 1/1 Running 0 77m
router-default-76c4888886-nm9mb 1/1 Running 2 (78m ago) 89m
04:57:52 $ oc delete gc openshift-default
gatewayclass.gateway.networking.k8s.io "openshift-default" deleted
04:58:07 $ oc get istio
No resources found
04:58:14 $ oc -n openshift-ingress get pods
NAME READY STATUS RESTARTS AGE
router-default-76c4888886-fmtzq 1/1 Running 0 78m
router-default-76c4888886-nm9mb 1/1 Running 2 (78m ago) 89m
if err != nil && strings.Contains(err.Error(), "not found") { | ||
e2e.Logf("Subscription %q not found; retrying...", expectedSubscriptionName) | ||
return false, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we should be consistent among all the polls we do in this block. I personally prefer how it's done for the OSSM deployment below:
if err != nil {
e2e.Logf("Failed to get OSSM operator deployment %q: %v; retrying...", deploymentOSSMName, err)
return false, nil
}
No assertions, just a retry for any error until the timeout is triggered. I think that some errors (not only "Not Found") can be temporary or intermittent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to keep my changes more narrowly focused. All right, I can make the polling loop for the subscription retry on all errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Log errors and then retry in the polling loops for the Subscription and Istio CRs. Before this commit, the gatewayapicontroller tests sometimes failed because OSSM was still installing when these polling loops ran, and the polling loops would fail on a "not found" error (if the CR had not yet been created) or a "server doesn't have a resource type" error (if the CRD had not yet been created). In order to make the tests more reliable, they need to retry on these errors. For consistency with other polling loops, this commit makes these polling loops retry on all errors (not just "not found" or "doesn't have a resource type" errors). * test/extended/router/gatewayapicontroller.go: Retry when the test fails to get the OSSM subscription or the Istio CR.
* test/extended/router/gatewayapicontroller.go: Increase the timeouts on some polling loops that have been observed to flake but then succeed on retry.
* test/extended/router/gatewayapicontroller.go (ingressNamespace): New const. (waitForIstioHealthy, createAndCheckGateway) (assertGatewayLoadbalancerReady, assertDNSRecordStatus, createHttpRoute) (assertHttpRouteSuccessful): Use the new const instead of function-level variables or string literals.
* test/extended/router/gatewayapicontroller.go: Omit the namespace when getting the Istio CR, which is cluster-scoped.
* test/extended/router/gatewayapicontroller.go (istioName): Declare const. (waitForIstioHealthy): Use the new const instead of a string literal.
* test/extended/router/gatewayapicontroller.go: Delete the Istio CR and wait for the istiod pod to be deleted as part of the test cleanup.
1dcc98a
to
38d8018
Compare
LGTM, holding off for @alebedev87 comments |
@Miciah: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Job Failure Risk Analysis for sha: 38d8018
|
The aggregated jobs each failed while buliding the tests-openshift.origin-amd64 image, with the error message, "Error: Unable to find a match: python3-cinderclient" (missing RPM package). I'll retry in case it was glitch with the Yum repository. /payload-aggregate periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-gcp-ovn-rt-upgrade 5 |
@Miciah: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/aad73360-4cbf-11f0-9efa-6a57a5235fed-0 |
This time all the aggregated jobs failed to build the image with the erorr message, "Error: Unable to find a match: realtime-tests rteval". /payload-aggregate periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-gcp-ovn-rt-upgrade 5 |
@Miciah: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/c0124e40-4d46-11f0-9623-e230cc269dc8-0 |
This time all the aggregated jobs failed with, "Error: Unable to find a match: python3-cinderclient realtime-tests rteval". I have filed OCPBUGS-57921 for these failures. |
/payload-aggregate periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-gcp-ovn-rt-upgrade 5 |
@Miciah: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/7029ee70-4e03-11f0-8a1d-a44ec557b951-0 |
gatewayapicontroller: Add checks for empty slices
Check whether the slice of parent resource references in an httproute's status is empty before indexing the slice.
Before this commit, the "Ensure HTTPRoute object is created" test sometimes panicked with "runtime error: index out of range [0] with length 0".
Similarly, check whether the slice of load-balancer ingress points in a service's status is empty before indexing it.
gatewayapicontroller: Clean up resources when done
Delete the gatewayclass and uninstall OSSM after all the Gateway API controller tests are done.
Before this change, the Gateway API controller tests left OSSM installed, including the subscription, CSV, installplan, bundled CRDs, RBAC resources, deployment, service, serviceaccount, etc., when the tests were finished. This clutter could cause problems for other tests, or for the same test if it was run again.
The new cleanup logic uses the
OperatorsV1
client fromgithub.com/operator-framework/operator-lifecycle-manager
. Importing this package requires areplace
stanza foropenshift/api
ingo.mod
.This vendors
github.com/operator-framework/operator-lifecycle-manager v0.30.1-0.20250114164243-1b6752ec65fa
rather than the newest revision in order to avoid bringing in additional problematic vendor bumps that the newest revision would bring in.gatewayapicontroller: Always log errors
Add the error value to some log messages that were missing it.