Skip to content

Conversation

imptype
Copy link

@imptype imptype commented Aug 29, 2025

Summary

The old handler hits 429s for concurrent requests and unknown subratelimits. This new handler prioritizes a lower remaining value, a sooner reset_at value, and accounts if the token window has changed during a sleep so it still maintains the same throughput the old handler had.

Test code:

import time
import asyncio
import discord
from discord.ext import commands

intents = discord.Intents.default()
bot = commands.Bot(intents = intents, command_prefix = '.')

# Test 1: Send 200 messages sequentially in the same channel
@bot.command()
async def test1(ctx):
    start = time.time()
    for i in range(200):
        await ctx.send(f'hi {i}')
    print('Finished test1 in:', time.time() - start)

# Test 2: Send 50 messages concurrently in the same channel
@bot.command()
async def test2(ctx):
    start = time.time()
    await asyncio.gather(*[
        ctx.send(f'hi {i}') for i in range(50)
    ])
    print('Finished test2 in:', time.time() - start)

bot.run('TOKEN')

Old results:
Test 1: Finished in 213.27s with 0 429s.
Test 2: Finished in 46.29s with 86 429s.

New results:
Test 1: Finished in 208.48s with 0 429s.
Test 2: Finished in 47.15 with 0 429s.

Checklist

  • If code changes were made then they have been tested.
    • I have updated the documentation to reflect the changes.
  • This PR fixes an issue.
  • This PR adds something new (e.g. new method or parameters).
  • This PR is a breaking change (e.g. methods or parameters removed/renamed)
  • This PR is not a code change (e.g. documentation, README, ...)

@Rapptz
Copy link
Owner

Rapptz commented Aug 30, 2025

Thanks for the PR.

I admit though this is a bit of a thinker for me. When I originally wrote the rate limit handling class this was more or less the test that me and other contributors ran to ensure that it worked fine. It seems at some point one of the changes made it so 429s slip because this would not 429 before, especially that much.

I don't have too much time to review this changeset (or even test it) but I don't really like the use of Event here since the original was made with flattening of asyncio resources in mind. It also seems weird to me to handle global rate limits within that class instead of outside since the global rate limit affects all requests. It also feels incredibly brittle to me that almost all of this code is essentially just retrying things multiple times just in case something concurrently happened in the background instead of properly synchronizing it.

Anyway this is failing CI, and I don't think I can test this code anyway.

Copy link

@HEROgold HEROgold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a few comments. These comments are specifically for fixing issues that cause this PR to not be testable.

And one comment about a docstring concern

@imptype
Copy link
Author

imptype commented Aug 30, 2025

Thanks for the PR.

I admit though this is a bit of a thinker for me. When I originally wrote the rate limit handling class this was more or less the test that me and other contributors ran to ensure that it worked fine. It seems at some point one of the changes made it so 429s slip because this would not 429 before, especially that much.

I don't have too much time to review this changeset (or even test it) but I don't really like the use of Event here since the original was made with flattening of asyncio resources in mind. It also seems weird to me to handle global rate limits within that class instead of outside since the global rate limit affects all requests. It also feels incredibly brittle to me that almost all of this code is essentially just retrying things multiple times just in case something concurrently happened in the background instead of properly synchronizing it.

Anyway this is failing CI, and I don't think I can test this code anyway.

24b619a still hits 429s for test 2. asyncio.Event will never cause a deadlock because it's dismissed by asyncio.wait after 3 seconds, and personally I think lock + event is simpler than lock + future. The global wait is inside the class in case the global ratelimit was hit during the wait to acquire, so this avoids more global 429s. During a sleep, the window can change due to requests behind being late, then it must check if there is no remaining left in the new window and if so sleep again - the simplest way is to sleep and check than make it event based or something; it still works fine. The use case for concurrency is more frequent and relevant than you realize, asyncio.gather(50) in the same channel is also the same as 50 people running a ping command in the same channel at the same time, which currently will hit 86 429s.

The remaining CI errors are typehinting or formatting issues which I can't fix.

@Rapptz
Copy link
Owner

Rapptz commented Aug 30, 2025

24b619a still hits 429s for test 2.

I remember now that the old rate limit, on Discord's side, got rewritten about 2-3 years ago to use GRCA instead of the previous windowed rate limit and that's probably where the discrepancy is. I know for a fact we tested this for weeks so something definitely changed.

asyncio.Event will never cause a deadlock because it's dismissed by asyncio.wait after 3 seconds, and personally I think lock + event is simpler than lock + future.

The original code used a deque of futures because it was slightly more resource efficient since this is a low level part of the library. In asyncio both Lock and Event are just interfaces over a deque of futures and a boolean with no control over the the wakers compared to a semaphore. I don't really get the confusion of a Future here.

The global wait is inside the class in case the global ratelimit was hit during the wait to acquire, so this avoids more global 429s.

I guess this is fair.

During a sleep, the window can change due to requests behind being late, then it must check if there is no remaining left in the new window and if so sleep again - the simplest way is to sleep and check than make it event based or something; it still works fine.

I'm not convinced about this yet. Sleeping and checking whether your value changed under you is just brittle to me.

The use case for concurrency is more frequent and relevant than you realize, asyncio.gather(50) in the same channel is also the same as 50 people running a ping command in the same channel at the same time, which currently will hit 86 429s.

This is literally not at all what I said. I mentioned earlier we tested the gather use case, why do you assume I don't know it's importance? I'm mentioning the concurrency handling within your implementation using sleep is poor.

The remaining CI errors are typehinting or formatting issues which I can't fix.

You can fix your CI failure by running ruff format on the discord directory.

@imptype
Copy link
Author

imptype commented Aug 30, 2025

@Rapptz According to Discord's public changelog, the last update to ratelimit system was 5 years ago, before the current handler was written. Btw, I was referring to code readability. But even then, doesn't the old handler make multiple futures whereas this only has 1 in the deque at a time because of the lock? Also, there's no simpler way to sleep longer after a sleep was finished other than using asyncio.sleep and checking to sleep again.

@Rapptz
Copy link
Owner

Rapptz commented Aug 31, 2025

According to Discord's public changelog, the last update to ratelimit system was 5 years ago, before the current handler was written.

I don't get why you're doubting me on this when I speak to employees on a regular basis. The changelog is inaccurate. It doesn't even mention the X-RateLimit-Scope header addition, nor does it mention the change to GCRA back in late 2022, which took until late 2023 for it to reach all endpoints to my knowledge.

DiscordCanary_5b0eodrdic

@imptype
Copy link
Author

imptype commented Aug 31, 2025

@Rapptz Ok I didn't have insider info. What will it take to merge?

Copy link
Owner

@Rapptz Rapptz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Truthfully, I was willing to merge and test this out earlier despite this being a large changeset made with no prior discussion whatsoever but your behaviour earlier made me way less willing to do so. Please remember that at the end of the day the person responsible for maintaining this stuff long term is me. For future purposes, please learn to communicate better. Either way, I'll do a review of this code real quick.

  1. This code breaks the Client.assume_unsync_clock option. That's what the use_clock parameter was for and is a public documented option. If you want to change this you either need to support that parameter still or you need to deprecate it. As it stands looking at this behaviour you're changing it always be True.
  2. I still don't like the Event usage but I'll set it slide for this since the resource usage difference is a single bool.
  3. I still don't like the strange design where you're sleeping and expecting the value to change right under you.

The rest of these comments are essentially questions.

discord/http.py Outdated
Comment on lines 435 to 444
async def _wait_global(self, start_time: float):
# Sleep up to 3 times, to account for global reset at overwriting during sleeps
for i in range(3):
seconds = self.http.global_reset_at - start_time
if seconds > 0:
await asyncio.sleep(seconds)
continue
break
else:
raise ValueError('Global reset at changed more than 3 times')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to the way the global rate limit has been rewritten it is no longer as possible to refactor this out into a customisable lock for multi-process set ups. It's always been part of the bucket list in order to implement this. If this changeset was communicated earlier in the Discord I would have also told you about this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Global rate limit is not a lock anymore, it's just a changeable timestamp that all requests must sleep to first before continuing. So, just replicate the value of http.global_reset_at to other processes when it is hit and it will still work. Btw, will need to change loop.time() to use time.time().

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not possible to sync the value between processes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought processes can send each other messages and from that you can do http.global_reset_at = new value

discord/http.py Outdated
continue # sleep again
break
else:
raise ValueError('Reset at changed more than 3 times')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get why this is a failure condition that has to bubble out. Ditto for the rest of them.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect reset_at to change during the sleep 3 times at most. More than that and I think the client is experiencing extreme lag or something, so the request should not continue.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather it surface another exception either from aiohttp or HTTPException instead. ValueError is a terrible exception to raise for this stuff.

discord/http.py Outdated
Comment on lines 742 to 747
if tries == 4 or e.errno not in (54, 10054):
raise ValueError('Connection reset by peer')
retry_seconds: int = 1 + tries * 2
fmt = 'OS error for %s %s. Retrying in %d seconds.'
_log.warning(fmt, method, url, retry_seconds)
await asyncio.sleep(retry_seconds)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why'd you swap this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid the continue and have 1 less line.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code clarity matters more than line count. Also, in the case where the OSError isn't a connection reset, your code now gives the wrong error message.

discord/http.py Outdated
Comment on lines 749 to 751
# We've run out of retries, raise
if response is not None:
raise HTTPException(response, data)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why'd you remove the 500 raise?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to add a tries check to fix waiting 9 extra seconds on the last try for no reason. Unless an aiohttp error happens on the last try, it's never met.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change it back.

Comment on lines +473 to +474
if copy == self.reset_at:
self.reset()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this for?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If reset_at hasn't changed, then we can set remaining = limit. A successful sleep is where we don't need to sleep again can refill the ratelimit remaining to the top.

If you mean the seemingly redundancy of the copy variable, it's needed in case of a bucket change. It will unnecessarily call Ratelimit.reset() when reset_at has changed and there are self.remaining left. Saving a call is good also in case you want to replicate ratelimit state to other processes too.

Comment on lines -669 to -678
# If the previous hash was an actual Discord hash then this means the
# hash has changed sporadically.
# This can be due to two reasons
# 1. It's a sub-ratelimit which is hard to handle
# 2. The rate limit information genuinely changed
# There is no good way to discern these, Discord doesn't provide a way to do so.
# At best, there will be some form of logging to help catch it.
# Alternating sub-ratelimits means that the requests oscillate between
# different underlying rate limits -- this can lead to unexpected 429s
# It is unavoidable.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that it matters that much but why did you remove my comment?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It conflicts with other comments explaining the strategy.

self._buckets[ratelimit.key] = ratelimit

# Global rate limit 429 wont have ratelimit headers (also can't tell if it's one-shot)
elif response.headers.get('X-RateLimit-Global'):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we using this instead of the global inner key which to my knowledge is more accurate and actually consistently given?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haven't called json_or_text yet, so can only use headers at this point.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reorder the code so it is called. Retry-After header is in whole seconds, retry_after in the JSON is in float seconds.

Copy link
Owner

@Rapptz Rapptz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't tell you to remove Client.assume_unsync_clock.

discord/http.py Outdated
continue # sleep again
break
else:
raise ValueError('Reset at changed more than 3 times')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather it surface another exception either from aiohttp or HTTPException instead. ValueError is a terrible exception to raise for this stuff.

self._buckets[ratelimit.key] = ratelimit

# Global rate limit 429 wont have ratelimit headers (also can't tell if it's one-shot)
elif response.headers.get('X-RateLimit-Global'):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reorder the code so it is called. Retry-After header is in whole seconds, retry_after in the JSON is in float seconds.

discord/http.py Outdated
Comment on lines 749 to 751
# We've run out of retries, raise
if response is not None:
raise HTTPException(response, data)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change it back.

@imptype
Copy link
Author

imptype commented Sep 1, 2025

@Rapptz unsync_clock was unused so it was removed.

@imptype
Copy link
Author

imptype commented Sep 2, 2025

@Rapptz A future replaced the event because you mentioned resource issues.

Copy link
Owner

@Rapptz Rapptz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous comments still apply.

  1. I did not ask for Client.assume_unsync_clock to be removed.
  2. I still think bubbling this error as a RuntimeError, or even at all, is not good design.

I raise RuntimeError under there because it's an unreachable exception. It is not meant to ever get hit and if it does then it's bizarre that it happened -- it's purely there to shut up the type checker and not publicly facing.

Anyway, I will be going on a trip for the month of September so this is my last review of this PR until I get home and I don't really have time to do an even deeper review right now since I'm preoccupied preparing for said trip.

self.dirty = False
self.reset_at = 0.0

def update(self, response: aiohttp.ClientResponse, data: Union[Dict[str, Any], str]) -> bool:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably only accept Dict[str, Any] or None by restricting it in the calling site so you can get rid of the # type: ignore. I guess it'd also be good to you make a TypedDict with the rate limit expected response type since that's what this is accepting and narrow to that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to fix the typehint issue but that method adds extra calls, so type: ignore is the simplest fix.

discord/http.py Outdated
self.reset_at = 0.0

def update(self, response: aiohttp.ClientResponse, data: Union[Dict[str, Any], str]) -> bool:
# Shared scope 429 has longer "reset_at", determined using the retry-after field
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is forced to be true.

A shared scope is just a shared resource like a channel message rate limit where you have a set number of messages sent per channel but that doesn't necessarily mean that it's always going to be longer though right now I can't think of any shared resources that are small in size though I do know that Emoji resources are only shared.

def update(self, response: aiohttp.ClientResponse, data: Union[Dict[str, Any], str]) -> bool:
# Shared scope 429 has longer "reset_at", determined using the retry-after field
limit = int(response.headers['X-Ratelimit-Limit'])
if response.headers.get('X-RateLimit-Scope') == 'shared':
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have a comment that this header is only returned on 429 responses.

discord/http.py Outdated

async def _wait(self):
# Consider waiting if none is remaining
if not self.remaining:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest using explicit checks instead of passive boolean ones for all the integer comparisons.

Suggested change
if not self.remaining:
if self.remaining == 0:

discord/http.py Outdated
if not self.remaining:
# If reset_at is not set yet, wait for the last request, if outgoing, to finish first
# for up to 3 seconds instead of using aiohttp's default 5 min timeout.
if not self.reset_at and (not self._last_request or self.http.loop.time() - self._last_request < 3):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if not self.reset_at and (not self._last_request or self.http.loop.time() - self._last_request < 3):
if self.reset_at == 0.0 and (not self._last_request or self.http.loop.time() - self._last_request < 3):

delta = self.http.loop.time() - self._last_request
return delta >= 300 and (self.one_shot or (self.outgoing == 0 and self.pending == 0))

def _wake(self) -> None:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first request will always have this future set to None, not sure if this is what you intended.

@imptype
Copy link
Author

imptype commented Sep 3, 2025

@Rapptz RuntimeError can be avoided with a while true, but that may be preferred to prevent an infinite loop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants