-
-
Notifications
You must be signed in to change notification settings - Fork 445
Improved stampede prevention with empty config cache under high loads #3530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
what do you think about https://github.com/fballiano/magento-cache-regeneration-lock? |
@fballiano Ahh, very nice! I thought about using Pros:
Cons:
Notes:
|
@colinmollenhour at the time of developing that patch (around 2015 or a bit before, I was using it for many customers before releasing it) I tried many different ways but the get_lock seems the easiest and still most effective way :-) cons:
pros:
when I developed it I tried to keep it as simple as possible to make it work across multiple versions of magento1, but if we try to implement it in the core we can make it a bit better (like with the try/catch/finally) do you think I should make a PR out of it? it is possible that the get_lock could create problems to some systems? could it be "disabled" on some databases? |
7df245e
to
0b0d05f
Compare
@fballiano I updated to use GET_LOCK - effectively the same thing your patch does - and removed the old cache-based mechanism and reorganized a bit. It uses a 3 second timeout for http and a 60 second timeout for CLI - reasoning that for http you don't want to risk overloading your http server in a high load scenario and it's better to basically act like maintenance mode. For CLI there shouldn't be hundreds or thousands of processes so it is ok to wait longer. Here is a screenshot showing what happens when I added some |
@fballiano What do you think of the new PR code? I think it is a great improvement - I hope you don't mind me adding the PR - credit is yours - just trying to move it along. :) |
Co-authored-by: Colin Mollenhour <[email protected]>
apart from my last comments it looks fine to me. I ran basic tests and found no problem |
Co-authored-by: Fabrizio Balliano <[email protected]>
@colinmollenhour I can't find your comment about the timeout anymore but when you say that "if the timeout is 60 but after 3 the cache is generated and requests still keep piling up" I don't think it should be that way, if the lock gets released after 3 seconds then all the piled requests get executed right away, I think. |
actually I think it's good to have a slightly bigger timeout because it forbids other requests (when timeout expires) from restarting the cache generation, that's what crashes servers, much more than requests piling up, at least in my experience |
@colinmollenhour there's now a conflict |
@fballiano This one should have been merged before #3533 because that one extends this one.. so most if not all of this is already merged now.. I will take a closer look. |
… for tme override.
The changes in this PR are very minor since the bulk of the code was merged with #3533. It now just contains a README update, some syntax fixes and the commits and discussions addressing the concerns of @midlan and @fballiano. |
we need another review in order to merge |
Description (*)
In a situation where generating the config takes a long time and the store is receiving heavy traffic - although there is already a basic stampede-prevention mechanism in place, it is not completely effective. There are three aggravating factors that this PR aims to fix:
One idea I did not implement is to just respond with a 503 error when the hold time is exceeded to avoid having a large number of processes trying to generate a fresh cache all at once. But this would be a more drastic change so I didn't implement that for now.
Manual testing scenarios (*)
I tested point 2 using 20 concurrent requests and some logging lines sprinkled in the code:

This proves that the mechanism caused only one process to save the cache where four processes would have previously.
Questions or comments
If anyone has a production environment that experiences cache stampedes, I'd be grateful if you gave it a try and report your findings.
NOTE: For the fix for point 1 to work, the Redis cache library needs to be updated with this as yet unmerged update. I plan to update it soon and it is fully BC and this patch will still work without it, just without having any effect on the issue in point 1.
colinmollenhour/Cm_Cache_Backend_Redis#177