clientv3: backoff resetting LeaseKeepAlive stream #20718

elias-dbx · 2025-09-24T16:39:35Z

A large number of client leases can cause cascading failures within the etcd cluster. Currently, when the keepalive stream has an error we will always wait 500ms and then try to recreate the stream with LeaseKeepAlive(). Since there is no backoff or jitter, if the lease streams originally broke due to overload on the servers the retries can cause a cascading failure and put more load on the servers.

We can backoff and jitter -- similar to what is done in watch streams -- in order to alleviate server load in the case where leases are causing the overload.

Related to: #20717

A large number of client leases can cause cascading failures within the etcd cluster. Currently, when the keepalive stream has an error we will always wait 500ms and then try to recreate the stream with LeaseKeepAlive(). Since there is no backoff or jitter, if the lease streams originally broke due to overload on the servers the retries can cause a cascading failure and put more load on the servers. We can backoff and jitter -- similar to what is done in watch streams -- in order to alleviate server load in the case where leases are causing the overload. Signed-off-by: Elias Carter <[email protected]>

k8s-ci-robot · 2025-09-24T16:39:42Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: elias-dbx
Once this PR has been reviewed and has the lgtm label, please assign ahrtr for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-09-24T16:39:46Z

Hi @elias-dbx. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ronaldngounou · 2025-10-01T04:08:16Z

client/v3/lease.go


-	// retryConnWait is how long to wait before retrying request due to an error
-	retryConnWait = 500 * time.Millisecond
+	// retryConnMinBackoff is the starting backoff when retrying a request due to an error


How were these values chosen?

They were chosen to be in line with the default exponential backoff parameters of other widely used client side libraries. For example, the aws-sdk-go-v2 library has a default max backoff of 20 seconds: https://github.com/aws/aws-sdk-go-v2/blob/main/aws/retry/standard.go#L31

I kept the initial backoff to 500ms as that is the current backoff time.

ronaldngounou · 2025-10-03T20:25:44Z

/ok-to-test

k8s-ci-robot · 2025-10-03T20:58:05Z

@elias-dbx: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-etcd-coverage-report	`54153a4`	link	true	`/test pull-etcd-coverage-report`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

codecov · 2025-10-03T20:58:16Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 69.20%. Comparing base (4805687) to head (54153a4).
⚠️ Report is 50 commits behind head on main.

Additional details and impacted files

Files with missing lines	Coverage Δ
client/v3/lease.go	`91.06% <100.00%> (+0.12%)`	⬆️
client/v3/utils.go	`100.00% <100.00%> (ø)`

... and 28 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #20718      +/-   ##
==========================================
- Coverage   69.20%   69.20%   -0.01%     
==========================================
  Files         420      422       +2     
  Lines       34817    34826       +9     
==========================================
+ Hits        24096    24101       +5     
  Misses       9329     9329              
- Partials     1392     1396       +4

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4805687...54153a4. Read the comment docs.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

k8s-ci-robot added the area/clientv3 label Sep 24, 2025

k8s-ci-robot added the needs-ok-to-test label Sep 24, 2025

k8s-ci-robot added the size/S label Sep 24, 2025

ronaldngounou reviewed Oct 1, 2025

View reviewed changes

elias-dbx requested a review from ronaldngounou October 1, 2025 16:07

k8s-ci-robot added ok-to-test and removed needs-ok-to-test labels Oct 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

clientv3: backoff resetting LeaseKeepAlive stream #20718

clientv3: backoff resetting LeaseKeepAlive stream #20718

elias-dbx commented Sep 24, 2025

Uh oh!

k8s-ci-robot commented Sep 24, 2025

Uh oh!

k8s-ci-robot commented Sep 24, 2025

Uh oh!

ronaldngounou Oct 1, 2025

Uh oh!

elias-dbx Oct 1, 2025

Uh oh!

ronaldngounou commented Oct 3, 2025

Uh oh!

k8s-ci-robot commented Oct 3, 2025

Uh oh!

codecov bot commented Oct 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

clientv3: backoff resetting LeaseKeepAlive stream #20718

Are you sure you want to change the base?

clientv3: backoff resetting LeaseKeepAlive stream #20718

Conversation

elias-dbx commented Sep 24, 2025

Uh oh!

k8s-ci-robot commented Sep 24, 2025

Uh oh!

k8s-ci-robot commented Sep 24, 2025

Uh oh!

ronaldngounou Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

elias-dbx Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

ronaldngounou commented Oct 3, 2025

Uh oh!

k8s-ci-robot commented Oct 3, 2025

Uh oh!

codecov bot commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

codecov bot commented Oct 3, 2025 •

edited

Loading