-
Notifications
You must be signed in to change notification settings - Fork 10.2k
clientv3: backoff resetting LeaseKeepAlive stream #20718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
A large number of client leases can cause cascading failures within the etcd cluster. Currently, when the keepalive stream has an error we will always wait 500ms and then try to recreate the stream with LeaseKeepAlive(). Since there is no backoff or jitter, if the lease streams originally broke due to overload on the servers the retries can cause a cascading failure and put more load on the servers. We can backoff and jitter -- similar to what is done in watch streams -- in order to alleviate server load in the case where leases are causing the overload. Signed-off-by: Elias Carter <[email protected]>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: elias-dbx The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @elias-dbx. Thanks for your PR. I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
||
// retryConnWait is how long to wait before retrying request due to an error | ||
retryConnWait = 500 * time.Millisecond | ||
// retryConnMinBackoff is the starting backoff when retrying a request due to an error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How were these values chosen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They were chosen to be in line with the default exponential backoff parameters of other widely used client side libraries. For example, the aws-sdk-go-v2 library has a default max backoff of 20 seconds: https://github.com/aws/aws-sdk-go-v2/blob/main/aws/retry/standard.go#L31
I kept the initial backoff to 500ms as that is the current backoff time.
/ok-to-test |
@elias-dbx: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files
... and 28 files with indirect coverage changes @@ Coverage Diff @@
## main #20718 +/- ##
==========================================
- Coverage 69.20% 69.20% -0.01%
==========================================
Files 420 422 +2
Lines 34817 34826 +9
==========================================
+ Hits 24096 24101 +5
Misses 9329 9329
- Partials 1392 1396 +4 Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
A large number of client leases can cause cascading failures within the etcd cluster. Currently, when the keepalive stream has an error we will always wait 500ms and then try to recreate the stream with LeaseKeepAlive(). Since there is no backoff or jitter, if the lease streams originally broke due to overload on the servers the retries can cause a cascading failure and put more load on the servers.
We can backoff and jitter -- similar to what is done in watch streams -- in order to alleviate server load in the case where leases are causing the overload.
Related to: #20717