Skip to content

OCPBUGS-56736: Improve error messages for project Delete errors #520

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

vrutkovs
Copy link
Member

No description provided.

@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 28, 2025
@openshift-ci-robot
Copy link

@vrutkovs: This pull request references Jira Issue OCPBUGS-56736, which is invalid:

  • expected the bug to target the "4.20.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@vrutkovs
Copy link
Member Author

/payload-job periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview-serial

Copy link
Contributor

openshift-ci bot commented May 28, 2025

@vrutkovs: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview-serial

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/41a86440-3bcb-11f0-9228-f720f943c789-0

@openshift-ci openshift-ci bot requested review from deads2k and derekwaynecarr May 28, 2025 13:56
Copy link
Contributor

openshift-ci bot commented May 28, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: vrutkovs
Once this PR has been reviewed and has the lgtm label, please assign deads2k for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@vrutkovs vrutkovs force-pushed the validate-delete-error-message branch from b1385d3 to 6983f0d Compare May 29, 2025 06:39
@vrutkovs
Copy link
Member Author

/payload-job periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview-serial

Copy link
Contributor

openshift-ci bot commented May 29, 2025

@vrutkovs: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview-serial

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b4464210-3c57-11f0-8efa-bdf704c2be55-0

@vrutkovs
Copy link
Member Author

/payload-aggregate periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview-serial 10

Copy link
Contributor

openshift-ci bot commented May 29, 2025

@vrutkovs: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview-serial

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/023d2050-3c86-11f0-9770-8c3264ee69bc-0

@vrutkovs vrutkovs force-pushed the validate-delete-error-message branch from 6983f0d to 7d73533 Compare May 29, 2025 15:48
@vrutkovs
Copy link
Member Author

/payload-aggregate periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview-serial 10

Copy link
Contributor

openshift-ci bot commented May 29, 2025

@vrutkovs: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview-serial

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/795a9cd0-3ca4-11f0-8236-ddb4864c0c81-0

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 29, 2025
@vrutkovs vrutkovs force-pushed the validate-delete-error-message branch from 7d73533 to 5f18128 Compare May 30, 2025 06:26
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 30, 2025
@vrutkovs vrutkovs force-pushed the validate-delete-error-message branch from 5f18128 to a5ca6cb Compare May 30, 2025 06:27
@vrutkovs
Copy link
Member Author

/payload-aggregate periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview-serial 10

Copy link
Contributor

openshift-ci bot commented May 30, 2025

@vrutkovs: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview-serial

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/48de3cd0-3d1f-11f0-9dae-2794b1500ac8-0

@vrutkovs
Copy link
Member Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 30, 2025
@openshift-ci-robot
Copy link

@vrutkovs: This pull request references Jira Issue OCPBUGS-56736, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.20.0) matches configured target version for branch (4.20.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @wangke19

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from wangke19 May 30, 2025 13:44
@vrutkovs vrutkovs force-pushed the validate-delete-error-message branch from a5ca6cb to 3fdbc85 Compare May 30, 2025 13:46
return false, fmt.Errorf("validation func failed: %w", err)
}
}
err := s.client.Delete(ctx, name, opts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't retry unless the error is a conflict. Also, I see that the Delete call is inside the retried func even when the validation func is nil -- if there is any scenario today or in the future that could cause the delete to be retried without preconditions, that would be bad. I'd feel better having two separate delete paths to prevent it, with a comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea

// Delete deletes a Project specified by its name
func (s *REST) Delete(ctx context.Context, name string, objectFunc rest.ValidateObjectFunc, options *metav1.DeleteOptions) (runtime.Object, bool, error) {
var opts metav1.DeleteOptions
if options != nil {
opts = *options
}
return &metav1.Status{Status: metav1.StatusSuccess}, false, s.client.Delete(ctx, name, opts)
attempt := 0
err := wait.ExponentialBackoffWithContext(ctx, wait.Backoff{Steps: maxRetriesOnConflict}, func(ctx context.Context) (bool, error) {
Copy link
Contributor

@benluddy benluddy May 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this is appropriate retry function? How do you want to configure the backoff? Edit: It would be nice to see a comment with some reasoning about how we arrive at specific numbers for tuning this. This is happening on the critical path for project deletes and will have implications on client-perceived latency.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've capped this to up to 1 second with additional comment to make sure it doesn't get lost

err := s.client.Delete(ctx, name, opts)
return err == nil, err
})
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExponentialBackoffWithContext [..] immediately returns an error if the condition returns an error

I don't understand how this helped. Isn't the doc for this function telling me that it would return immediately if there were a delete conflict?

// Delete deletes a Project specified by its name
func (s *REST) Delete(ctx context.Context, name string, objectFunc rest.ValidateObjectFunc, options *metav1.DeleteOptions) (runtime.Object, bool, error) {
var opts metav1.DeleteOptions
if options != nil {
opts = *options
}
return &metav1.Status{Status: metav1.StatusSuccess}, false, s.client.Delete(ctx, name, opts)
attempt := 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to remove or use this? I see it being incremented but not used anywhere else.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it was part of extended debugging

@vrutkovs vrutkovs force-pushed the validate-delete-error-message branch from 3fdbc85 to d429563 Compare May 30, 2025 15:45
@vrutkovs
Copy link
Member Author

/payload-aggregate periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview-serial 10

Copy link
Contributor

openshift-ci bot commented May 30, 2025

@vrutkovs: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview-serial

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/2a4a5710-3d6d-11f0-898d-a24e04f40707-0

@vrutkovs vrutkovs force-pushed the validate-delete-error-message branch from d429563 to cc4321e Compare June 2, 2025 08:14
@vrutkovs
Copy link
Member Author

vrutkovs commented Jun 2, 2025

/payload-aggregate periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview-serial 10

Copy link
Contributor

openshift-ci bot commented Jun 2, 2025

@vrutkovs: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview-serial

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/a2233090-3f89-11f0-954b-910aecaa9a62-0

@vrutkovs
Copy link
Member Author

vrutkovs commented Jun 2, 2025

/payload-aggregate periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview-serial 10

Copy link
Contributor

openshift-ci bot commented Jun 2, 2025

@vrutkovs: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview-serial

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/21ccadd0-3f9e-11f0-81cd-6839ae5e2905-0

Comment on lines 245 to 246
opts.Preconditions.UID = &projectObj.UID
opts.Preconditions.ResourceVersion = &projectObj.ResourceVersion
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the client already specified either precondition in its request options? We need to check for that and return a conflict response directly instead of ignoring them. Please add a test case for this too.

// Delete deletes a Project specified by its name
func (s *REST) Delete(ctx context.Context, name string, objectFunc rest.ValidateObjectFunc, options *metav1.DeleteOptions) (runtime.Object, bool, error) {
var opts metav1.DeleteOptions
if options != nil {
opts = *options
}
return &metav1.Status{Status: metav1.StatusSuccess}, false, s.client.Delete(ctx, name, opts)
var lastErr error
err := wait.ExponentialBackoffWithContext(ctx, wait.Backoff{Steps: maxRetriesOnConflict, Duration: maxDuration}, func(ctx context.Context) (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm understanding the Godoc for Duration correctly, this means that we will sleep for one second between retries. That seems high to me. I bet it is a lot longer than a typical total latency of both namespace requests combined.

We can configure the other fields for exponential backoff so that initial retry is fairly fast.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right, somehow I though "Duration" is max duration we're allowed to spend. I think wait.Backoff{Steps: maxRetriesOnConflict, Factor: 1/maxRetriesOnConflict, Cap: maxDuration, Duration: maxDuration/maxRetriesOnConflict} would make it "up to 1 second" and ensure it has several retries

case err == nil:
return true, nil
case kerrors.IsConflict(err):
lastErr = err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add tests showing that retry happens on conflict, no retry happens on non-conflict, and one where retries are exhausted please?

Comment on lines 263 to 273
if err != nil && wait.ErrorInterrupted(err) != nil {
return &metav1.Status{Status: metav1.StatusFailure}, false, lastErr
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The response should indicate timeout if our wait loop times out. If you use https://github.com/kubernetes/kubernetes/blob/62f72addf26d2fd25e060554bcd8cf5bdc10e50c/staging/src/k8s.io/apimachinery/pkg/api/errors/errors.go#L364 as the returned error and nil for the returned runtime.Object, does the client see what we want?

return false, err
}
})
if err != nil && wait.ErrorInterrupted(err) != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't wait.ErrorInterrupted(err) != nil always be true? I think you meant wait.Interrupted(err).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, updated

}
})
if err != nil && wait.ErrorInterrupted(err) != nil {
return &metav1.Status{Status: metav1.StatusFailure}, false, lastErr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will plumb non-conflict errors to the client. For example, if the namespace is not found then we should return project not found -- is there a test for that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test for that

@vrutkovs vrutkovs force-pushed the validate-delete-error-message branch from cc4321e to 15d0b7d Compare June 10, 2025 09:03
@vrutkovs
Copy link
Member Author

/test e2e-aws-ovn-serial

Copy link
Contributor

openshift-ci bot commented Jun 11, 2025

@vrutkovs: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn 15d0b7d link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-aws-ovn-serial 15d0b7d link true /test e2e-aws-ovn-serial

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

if err != nil && wait.Interrupted(err) {
return &metav1.Status{Status: metav1.StatusFailure}, false, lastErr
}
return &metav1.Status{Status: metav1.StatusSuccess}, false, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when err != nil && !wait.Interrupted(err)?

Comment on lines +248 to +253
if opts.Preconditions.UID == nil {
opts.Preconditions.UID = &projectObj.UID
}
if opts.Preconditions.ResourceVersion == nil {
opts.Preconditions.ResourceVersion = &projectObj.ResourceVersion
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to retry conflicts that are caused by client-provided preconditions (they are probably doomed unless the request changes).

If we might have propagated one precondition from the request, and added a second precondition here, it becomes hard to robustly determine which precondition caused a conflict. One way to solve this might be to inspect the fresh namespace returned from Get and enforce any client-provided preconditions immediately. After that, we know that both preconditions passed to the namespace Delete came from this code and that a retry might succeed with a newer UID/RV.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants