-
-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Redo ratelimit handling #10287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Redo ratelimit handling #10287
Conversation
Thanks for the PR. I admit though this is a bit of a thinker for me. When I originally wrote the rate limit handling class this was more or less the test that me and other contributors ran to ensure that it worked fine. It seems at some point one of the changes made it so 429s slip because this would not 429 before, especially that much. I don't have too much time to review this changeset (or even test it) but I don't really like the use of Anyway this is failing CI, and I don't think I can test this code anyway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a few comments. These comments are specifically for fixing issues that cause this PR to not be testable.
And one comment about a docstring concern
24b619a still hits 429s for test 2. The remaining CI errors are typehinting or formatting issues which I can't fix. |
I remember now that the old rate limit, on Discord's side, got rewritten about 2-3 years ago to use GRCA instead of the previous windowed rate limit and that's probably where the discrepancy is. I know for a fact we tested this for weeks so something definitely changed.
The original code used a deque of futures because it was slightly more resource efficient since this is a low level part of the library. In
I guess this is fair.
I'm not convinced about this yet. Sleeping and checking whether your value changed under you is just brittle to me.
This is literally not at all what I said. I mentioned earlier we tested the
You can fix your CI failure by running |
@Rapptz According to Discord's public changelog, the last update to ratelimit system was 5 years ago, before the current handler was written. Btw, I was referring to code readability. But even then, doesn't the old handler make multiple futures whereas this only has 1 in the deque at a time because of the lock? Also, there's no simpler way to sleep longer after a sleep was finished other than using |
@Rapptz Ok I didn't have insider info. What will it take to merge? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Truthfully, I was willing to merge and test this out earlier despite this being a large changeset made with no prior discussion whatsoever but your behaviour earlier made me way less willing to do so. Please remember that at the end of the day the person responsible for maintaining this stuff long term is me. For future purposes, please learn to communicate better. Either way, I'll do a review of this code real quick.
- This code breaks the
Client.assume_unsync_clock
option. That's what theuse_clock
parameter was for and is a public documented option. If you want to change this you either need to support that parameter still or you need to deprecate it. As it stands looking at this behaviour you're changing it always beTrue
. - I still don't like the
Event
usage but I'll set it slide for this since the resource usage difference is a singlebool
. - I still don't like the strange design where you're sleeping and expecting the value to change right under you.
The rest of these comments are essentially questions.
discord/http.py
Outdated
async def _wait_global(self, start_time: float): | ||
# Sleep up to 3 times, to account for global reset at overwriting during sleeps | ||
for i in range(3): | ||
seconds = self.http.global_reset_at - start_time | ||
if seconds > 0: | ||
await asyncio.sleep(seconds) | ||
continue | ||
break | ||
else: | ||
raise ValueError('Global reset at changed more than 3 times') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to the way the global rate limit has been rewritten it is no longer as possible to refactor this out into a customisable lock for multi-process set ups. It's always been part of the bucket list in order to implement this. If this changeset was communicated earlier in the Discord I would have also told you about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Global rate limit is not a lock anymore, it's just a changeable timestamp that all requests must sleep to first before continuing. So, just replicate the value of http.global_reset_at
to other processes when it is hit and it will still work. Btw, will need to change loop.time()
to use time.time()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not possible to sync the value between processes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought processes can send each other messages and from that you can do http.global_reset_at = new value
discord/http.py
Outdated
continue # sleep again | ||
break | ||
else: | ||
raise ValueError('Reset at changed more than 3 times') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get why this is a failure condition that has to bubble out. Ditto for the rest of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect reset_at
to change during the sleep 3 times at most. More than that and I think the client is experiencing extreme lag or something, so the request should not continue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather it surface another exception either from aiohttp
or HTTPException
instead. ValueError
is a terrible exception to raise for this stuff.
discord/http.py
Outdated
if tries == 4 or e.errno not in (54, 10054): | ||
raise ValueError('Connection reset by peer') | ||
retry_seconds: int = 1 + tries * 2 | ||
fmt = 'OS error for %s %s. Retrying in %d seconds.' | ||
_log.warning(fmt, method, url, retry_seconds) | ||
await asyncio.sleep(retry_seconds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why'd you swap this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid the continue
and have 1 less line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code clarity matters more than line count. Also, in the case where the OSError isn't a connection reset, your code now gives the wrong error message.
discord/http.py
Outdated
# We've run out of retries, raise | ||
if response is not None: | ||
raise HTTPException(response, data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why'd you remove the 500 raise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot to add a tries
check to fix waiting 9 extra seconds on the last try for no reason. Unless an aiohttp error happens on the last try, it's never met.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change it back.
if copy == self.reset_at: | ||
self.reset() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If reset_at
hasn't changed, then we can set remaining
= limit
. A successful sleep is where we don't need to sleep again can refill the ratelimit remaining to the top.
If you mean the seemingly redundancy of the copy
variable, it's needed in case of a bucket change. It will unnecessarily call Ratelimit.reset()
when reset_at
has changed and there are self.remaining
left. Saving a call is good also in case you want to replicate ratelimit state to other processes too.
# If the previous hash was an actual Discord hash then this means the | ||
# hash has changed sporadically. | ||
# This can be due to two reasons | ||
# 1. It's a sub-ratelimit which is hard to handle | ||
# 2. The rate limit information genuinely changed | ||
# There is no good way to discern these, Discord doesn't provide a way to do so. | ||
# At best, there will be some form of logging to help catch it. | ||
# Alternating sub-ratelimits means that the requests oscillate between | ||
# different underlying rate limits -- this can lead to unexpected 429s | ||
# It is unavoidable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that it matters that much but why did you remove my comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It conflicts with other comments explaining the strategy.
self._buckets[ratelimit.key] = ratelimit | ||
|
||
# Global rate limit 429 wont have ratelimit headers (also can't tell if it's one-shot) | ||
elif response.headers.get('X-RateLimit-Global'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we using this instead of the global
inner key which to my knowledge is more accurate and actually consistently given?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We haven't called json_or_text
yet, so can only use headers at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reorder the code so it is called. Retry-After
header is in whole seconds, retry_after
in the JSON is in float seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't tell you to remove Client.assume_unsync_clock
.
discord/http.py
Outdated
continue # sleep again | ||
break | ||
else: | ||
raise ValueError('Reset at changed more than 3 times') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather it surface another exception either from aiohttp
or HTTPException
instead. ValueError
is a terrible exception to raise for this stuff.
self._buckets[ratelimit.key] = ratelimit | ||
|
||
# Global rate limit 429 wont have ratelimit headers (also can't tell if it's one-shot) | ||
elif response.headers.get('X-RateLimit-Global'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reorder the code so it is called. Retry-After
header is in whole seconds, retry_after
in the JSON is in float seconds.
discord/http.py
Outdated
# We've run out of retries, raise | ||
if response is not None: | ||
raise HTTPException(response, data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change it back.
@Rapptz |
@Rapptz A future replaced the event because you mentioned resource issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previous comments still apply.
- I did not ask for
Client.assume_unsync_clock
to be removed. - I still think bubbling this error as a
RuntimeError
, or even at all, is not good design.
I raise RuntimeError
under there because it's an unreachable exception. It is not meant to ever get hit and if it does then it's bizarre that it happened -- it's purely there to shut up the type checker and not publicly facing.
Anyway, I will be going on a trip for the month of September so this is my last review of this PR until I get home and I don't really have time to do an even deeper review right now since I'm preoccupied preparing for said trip.
self.dirty = False | ||
self.reset_at = 0.0 | ||
|
||
def update(self, response: aiohttp.ClientResponse, data: Union[Dict[str, Any], str]) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably only accept Dict[str, Any]
or None
by restricting it in the calling site so you can get rid of the # type: ignore
. I guess it'd also be good to you make a TypedDict
with the rate limit expected response type since that's what this is accepting and narrow to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to fix the typehint issue but that method adds extra calls, so type: ignore
is the simplest fix.
discord/http.py
Outdated
self.reset_at = 0.0 | ||
|
||
def update(self, response: aiohttp.ClientResponse, data: Union[Dict[str, Any], str]) -> bool: | ||
# Shared scope 429 has longer "reset_at", determined using the retry-after field |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is forced to be true.
A shared scope is just a shared resource like a channel message rate limit where you have a set number of messages sent per channel but that doesn't necessarily mean that it's always going to be longer though right now I can't think of any shared resources that are small in size though I do know that Emoji resources are only shared.
def update(self, response: aiohttp.ClientResponse, data: Union[Dict[str, Any], str]) -> bool: | ||
# Shared scope 429 has longer "reset_at", determined using the retry-after field | ||
limit = int(response.headers['X-Ratelimit-Limit']) | ||
if response.headers.get('X-RateLimit-Scope') == 'shared': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should have a comment that this header is only returned on 429 responses.
discord/http.py
Outdated
|
||
async def _wait(self): | ||
# Consider waiting if none is remaining | ||
if not self.remaining: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest using explicit checks instead of passive boolean ones for all the integer comparisons.
if not self.remaining: | |
if self.remaining == 0: |
discord/http.py
Outdated
if not self.remaining: | ||
# If reset_at is not set yet, wait for the last request, if outgoing, to finish first | ||
# for up to 3 seconds instead of using aiohttp's default 5 min timeout. | ||
if not self.reset_at and (not self._last_request or self.http.loop.time() - self._last_request < 3): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if not self.reset_at and (not self._last_request or self.http.loop.time() - self._last_request < 3): | |
if self.reset_at == 0.0 and (not self._last_request or self.http.loop.time() - self._last_request < 3): |
delta = self.http.loop.time() - self._last_request | ||
return delta >= 300 and (self.one_shot or (self.outgoing == 0 and self.pending == 0)) | ||
|
||
def _wake(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first request will always have this future set to None
, not sure if this is what you intended.
Co-authored-by: Danny <[email protected]>
Co-authored-by: Danny <[email protected]>
Co-authored-by: Danny <[email protected]>
@Rapptz RuntimeError can be avoided with a |
Summary
The old handler hits 429s for concurrent requests and unknown subratelimits. This new handler prioritizes a lower remaining value, a sooner reset_at value, and accounts if the token window has changed during a sleep so it still maintains the same throughput the old handler had.
Test code:
Old results:
Test 1: Finished in 213.27s with 0 429s.
Test 2: Finished in 46.29s with 86 429s.
New results:
Test 1: Finished in 208.48s with 0 429s.
Test 2: Finished in 47.15 with 0 429s.
Checklist