-
Notifications
You must be signed in to change notification settings - Fork 98
OCPBUGS-56736: Improve error messages for project Delete errors #520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@vrutkovs: This pull request references Jira Issue OCPBUGS-56736, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/payload-job periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview-serial |
@vrutkovs: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/41a86440-3bcb-11f0-9228-f720f943c789-0 |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: vrutkovs The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
b1385d3
to
6983f0d
Compare
/payload-job periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview-serial |
@vrutkovs: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b4464210-3c57-11f0-8efa-bdf704c2be55-0 |
/payload-aggregate periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview-serial 10 |
@vrutkovs: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/023d2050-3c86-11f0-9770-8c3264ee69bc-0 |
6983f0d
to
7d73533
Compare
/payload-aggregate periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview-serial 10 |
@vrutkovs: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/795a9cd0-3ca4-11f0-8236-ddb4864c0c81-0 |
…g an object" This reverts commit 28133f9.
7d73533
to
5f18128
Compare
5f18128
to
a5ca6cb
Compare
/payload-aggregate periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview-serial 10 |
@vrutkovs: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/48de3cd0-3d1f-11f0-9dae-2794b1500ac8-0 |
/jira refresh |
@vrutkovs: This pull request references Jira Issue OCPBUGS-56736, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
a5ca6cb
to
3fdbc85
Compare
return false, fmt.Errorf("validation func failed: %w", err) | ||
} | ||
} | ||
err := s.client.Delete(ctx, name, opts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't retry unless the error is a conflict. Also, I see that the Delete call is inside the retried func even when the validation func is nil -- if there is any scenario today or in the future that could cause the delete to be retried without preconditions, that would be bad. I'd feel better having two separate delete paths to prevent it, with a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea
// Delete deletes a Project specified by its name | ||
func (s *REST) Delete(ctx context.Context, name string, objectFunc rest.ValidateObjectFunc, options *metav1.DeleteOptions) (runtime.Object, bool, error) { | ||
var opts metav1.DeleteOptions | ||
if options != nil { | ||
opts = *options | ||
} | ||
return &metav1.Status{Status: metav1.StatusSuccess}, false, s.client.Delete(ctx, name, opts) | ||
attempt := 0 | ||
err := wait.ExponentialBackoffWithContext(ctx, wait.Backoff{Steps: maxRetriesOnConflict}, func(ctx context.Context) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this is appropriate retry function? How do you want to configure the backoff? Edit: It would be nice to see a comment with some reasoning about how we arrive at specific numbers for tuning this. This is happening on the critical path for project deletes and will have implications on client-perceived latency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've capped this to up to 1 second with additional comment to make sure it doesn't get lost
err := s.client.Delete(ctx, name, opts) | ||
return err == nil, err | ||
}) | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ExponentialBackoffWithContext [..] immediately returns an error if the condition returns an error
I don't understand how this helped. Isn't the doc for this function telling me that it would return immediately if there were a delete conflict?
// Delete deletes a Project specified by its name | ||
func (s *REST) Delete(ctx context.Context, name string, objectFunc rest.ValidateObjectFunc, options *metav1.DeleteOptions) (runtime.Object, bool, error) { | ||
var opts metav1.DeleteOptions | ||
if options != nil { | ||
opts = *options | ||
} | ||
return &metav1.Status{Status: metav1.StatusSuccess}, false, s.client.Delete(ctx, name, opts) | ||
attempt := 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to remove or use this? I see it being incremented but not used anywhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it was part of extended debugging
3fdbc85
to
d429563
Compare
/payload-aggregate periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview-serial 10 |
@vrutkovs: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/2a4a5710-3d6d-11f0-898d-a24e04f40707-0 |
d429563
to
cc4321e
Compare
/payload-aggregate periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview-serial 10 |
@vrutkovs: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/a2233090-3f89-11f0-954b-910aecaa9a62-0 |
/payload-aggregate periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview-serial 10 |
@vrutkovs: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/21ccadd0-3f9e-11f0-81cd-6839ae5e2905-0 |
opts.Preconditions.UID = &projectObj.UID | ||
opts.Preconditions.ResourceVersion = &projectObj.ResourceVersion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the client already specified either precondition in its request options? We need to check for that and return a conflict response directly instead of ignoring them. Please add a test case for this too.
// Delete deletes a Project specified by its name | ||
func (s *REST) Delete(ctx context.Context, name string, objectFunc rest.ValidateObjectFunc, options *metav1.DeleteOptions) (runtime.Object, bool, error) { | ||
var opts metav1.DeleteOptions | ||
if options != nil { | ||
opts = *options | ||
} | ||
return &metav1.Status{Status: metav1.StatusSuccess}, false, s.client.Delete(ctx, name, opts) | ||
var lastErr error | ||
err := wait.ExponentialBackoffWithContext(ctx, wait.Backoff{Steps: maxRetriesOnConflict, Duration: maxDuration}, func(ctx context.Context) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm understanding the Godoc for Duration correctly, this means that we will sleep for one second between retries. That seems high to me. I bet it is a lot longer than a typical total latency of both namespace requests combined.
We can configure the other fields for exponential backoff so that initial retry is fairly fast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, right, somehow I though "Duration" is max duration we're allowed to spend. I think wait.Backoff{Steps: maxRetriesOnConflict, Factor: 1/maxRetriesOnConflict, Cap: maxDuration, Duration: maxDuration/maxRetriesOnConflict}
would make it "up to 1 second" and ensure it has several retries
case err == nil: | ||
return true, nil | ||
case kerrors.IsConflict(err): | ||
lastErr = err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add tests showing that retry happens on conflict, no retry happens on non-conflict, and one where retries are exhausted please?
if err != nil && wait.ErrorInterrupted(err) != nil { | ||
return &metav1.Status{Status: metav1.StatusFailure}, false, lastErr | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The response should indicate timeout if our wait loop times out. If you use https://github.com/kubernetes/kubernetes/blob/62f72addf26d2fd25e060554bcd8cf5bdc10e50c/staging/src/k8s.io/apimachinery/pkg/api/errors/errors.go#L364 as the returned error and nil
for the returned runtime.Object
, does the client see what we want?
return false, err | ||
} | ||
}) | ||
if err != nil && wait.ErrorInterrupted(err) != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't wait.ErrorInterrupted(err) != nil
always be true? I think you meant wait.Interrupted(err)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, updated
} | ||
}) | ||
if err != nil && wait.ErrorInterrupted(err) != nil { | ||
return &metav1.Status{Status: metav1.StatusFailure}, false, lastErr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this will plumb non-conflict errors to the client. For example, if the namespace is not found then we should return project not found -- is there a test for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a test for that
cc4321e
to
15d0b7d
Compare
/test e2e-aws-ovn-serial |
@vrutkovs: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
if err != nil && wait.Interrupted(err) { | ||
return &metav1.Status{Status: metav1.StatusFailure}, false, lastErr | ||
} | ||
return &metav1.Status{Status: metav1.StatusSuccess}, false, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when err != nil && !wait.Interrupted(err)
?
if opts.Preconditions.UID == nil { | ||
opts.Preconditions.UID = &projectObj.UID | ||
} | ||
if opts.Preconditions.ResourceVersion == nil { | ||
opts.Preconditions.ResourceVersion = &projectObj.ResourceVersion | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to retry conflicts that are caused by client-provided preconditions (they are probably doomed unless the request changes).
If we might have propagated one precondition from the request, and added a second precondition here, it becomes hard to robustly determine which precondition caused a conflict. One way to solve this might be to inspect the fresh namespace returned from Get and enforce any client-provided preconditions immediately. After that, we know that both preconditions passed to the namespace Delete came from this code and that a retry might succeed with a newer UID/RV.
No description provided.