Improved stampede prevention with empty config cache under high loads #3530

colinmollenhour · 2023-09-20T05:49:28Z

Description (*)

In a situation where generating the config takes a long time and the store is receiving heavy traffic - although there is already a basic stampede-prevention mechanism in place, it is not completely effective. There are three aggravating factors that this PR aims to fix:

When using read replicas, sending the reads to the replicas to check if the config saving is locked makes it susceptible to race conditions caused by replication lag. Reads in this instance should always go to the master.
The locking mechanism does a read first, then a write, then assumes it is safe to proceed, but if many processes do the read at the same time before the first one does a write, there can be multiple processes that "have" the lock. Now, a second read will ensure that only one process has the lock, the others will skip saving the config cache which reduces load on the backend and frontends (which have to serialize the XML which can be quite large).
There is a 5 second "hold" time that will cause processes to wait for the cache to be warmed. I've increased this to 10 seconds since if it takes significantly longer than 5 seconds to save the cache in a high load environment that can result in a large number of other processes generating the cache when it's faster to just wait for it to be saved then read the cache.

One idea I did not implement is to just respond with a 503 error when the hold time is exceeded to avoid having a large number of processes trying to generate a fresh cache all at once. But this would be a more drastic change so I didn't implement that for now.

Manual testing scenarios (*)

I tested point 2 using 20 concurrent requests and some logging lines sprinkled in the code:

This proves that the mechanism caused only one process to save the cache where four processes would have previously.

Questions or comments

If anyone has a production environment that experiences cache stampedes, I'd be grateful if you gave it a try and report your findings.

NOTE: For the fix for point 1 to work, the Redis cache library needs to be updated with this as yet unmerged update. I plan to update it soon and it is fully BC and this patch will still work without it, just without having any effect on the issue in point 1.
colinmollenhour/Cm_Cache_Backend_Redis#177

fballiano · 2023-09-20T17:36:36Z

what do you think about https://github.com/fballiano/magento-cache-regeneration-lock?

colinmollenhour · 2023-09-20T19:55:47Z

what do you think about https://github.com/fballiano/magento-cache-regeneration-lock?

@fballiano Ahh, very nice! I thought about using GET_LOCK() shortly after pushing this, it is a good strategy.

Pros:

Less load on the cache backend
Fully atomic so there are no race conditions
Simple and fast

Cons:

Breaks the database abstraction - we only support MySQL variants anyway so this probably shouldn't be a deal-breaker - it's also used already by Mage_Index so is not exactly new although there is at least the appearance of abstraction there.
Will not work perfectly on MySQL multi-master setups unless all writes go the same node. I don't know of anyone using multiple write nodes successfully anyway - do you? If someone was using a multi-master cluster it would probably still work ok, just there would be as many processes writing config cache as there are db masters which is at least guaranteed to be a small number.

Notes:

Your patch implements the locking in a different location so I think it needs to also be added to Mage_Core_Model_Config::init to cover all methods of initialization - in a similar manner or some refactoring to prevent code duplication.
Minor improvement: I recommend using try { ... } finally { ... } to release locks so that unhandled throwables cannot cause the lock release to be skipped.
The 504 error is not exactly accurate - 500 or 503 would be more accurate - since this page could be visible to end users it might warrant using require_once Mage::getBaseDir() . DS . 'errors' . DS . '503.php'; die(); instead - consistent with when session initialization fails.

fballiano · 2023-09-20T20:11:15Z

@colinmollenhour at the time of developing that patch (around 2015 or a bit before, I was using it for many customers before releasing it) I tried many different ways but the get_lock seems the easiest and still most effective way :-)

cons:

exactly, it's already used in mage_index, but also, IMHO, we'll never support anything different from mysql (and derivatives)
noup, I don't even know somebody really using slaves DB but no multimaster

pros:

it doesn't depend on redis which not everybody may use (probably they don't need the lock because they'll have small stores)
since it's managed by mysql itself, it should be pretty reliable

when I developed it I tried to keep it as simple as possible to make it work across multiple versions of magento1, but if we try to implement it in the core we can make it a bit better (like with the try/catch/finally)

do you think I should make a PR out of it? it is possible that the get_lock could create problems to some systems? could it be "disabled" on some databases?

colinmollenhour · 2023-09-21T15:42:32Z

@fballiano I updated to use GET_LOCK - effectively the same thing your patch does - and removed the old cache-based mechanism and reorganized a bit.

It uses a 3 second timeout for http and a 60 second timeout for CLI - reasoning that for http you don't want to risk overloading your http server in a high load scenario and it's better to basically act like maintenance mode. For CLI there shouldn't be hundreds or thousands of processes so it is ok to wait longer.

Here is a screenshot showing what happens when I added some Mage::log calls during a flood of requests with a cold cache:

colinmollenhour · 2023-09-26T14:03:49Z

@fballiano What do you think of the new PR code? I think it is a great improvement - I hope you don't mind me adding the PR - credit is yours - just trying to move it along. :)

app/code/core/Mage/Core/Model/Config.php

app/code/core/Mage/Core/Model/App.php

app/code/core/Mage/Core/Model/Config.php

Co-authored-by: Colin Mollenhour <[email protected]>

app/code/core/Mage/Core/Model/Config.php

fballiano · 2023-09-27T14:40:12Z

apart from my last comments it looks fine to me. I ran basic tests and found no problem

Co-authored-by: Fabrizio Balliano <[email protected]>

fballiano · 2023-09-27T19:24:20Z

@colinmollenhour I can't find your comment about the timeout anymore but when you say that "if the timeout is 60 but after 3 the cache is generated and requests still keep piling up" I don't think it should be that way, if the lock gets released after 3 seconds then all the piled requests get executed right away, I think.

fballiano · 2023-09-27T19:26:42Z

actually I think it's good to have a slightly bigger timeout because it forbids other requests (when timeout expires) from restarting the cache generation, that's what crashes servers, much more than requests piling up, at least in my experience

app/code/core/Mage/Core/Model/Config.php

github-actions · 2023-10-02T13:08:06Z

Test Results

7 tests 7 ✔️ 0s ⏱️
3 suites 0 💤
1 files 0 ❌

Results for commit 0cc1f80.

♻️ This comment has been updated with latest results.

fballiano · 2023-10-03T12:21:39Z

@colinmollenhour there's now a conflict

colinmollenhour · 2023-10-03T13:15:22Z

@fballiano This one should have been merged before #3533 because that one extends this one.. so most if not all of this is already merged now.. I will take a closer look.

…evention

… for tme override.

app/code/core/Mage/Core/Model/Config.php

colinmollenhour · 2023-10-30T15:09:04Z

The changes in this PR are very minor since the bulk of the code was merged with #3533. It now just contains a README update, some syntax fixes and the commits and discussions addressing the concerns of @midlan and @fballiano.

fballiano · 2023-12-19T15:35:28Z

we need another review in order to merge

github-actions bot added the Component: Core Relates to Mage_Core label Sep 20, 2023

Replace using cache record for locking with MySQL GET_LOCK function.

0b0d05f

colinmollenhour force-pushed the improve-stampede-prevention branch from 7df245e to 0b0d05f Compare September 21, 2023 15:30

colinmollenhour mentioned this pull request Sep 21, 2023

Reinit fresh config before flushing cache and immediately save config… #3533

Merged

Fix PHPCS complaints (old habits)

3c79913

colinmollenhour mentioned this pull request Sep 22, 2023

Support 'safe_load' directive colinmollenhour/Cm_Cache_Backend_Redis#177

Open

colinmollenhour requested review from tmotyl, kiatng, midlan, addison74 and Adel-Magebinary September 26, 2023 14:04

colinmollenhour commented Sep 26, 2023

View reviewed changes

app/code/core/Mage/Core/Model/Config.php Outdated Show resolved Hide resolved

fballiano reviewed Sep 26, 2023

View reviewed changes

app/code/core/Mage/Core/Model/App.php Outdated Show resolved Hide resolved

fballiano reviewed Sep 27, 2023

View reviewed changes

app/code/core/Mage/Core/Model/Config.php Outdated Show resolved Hide resolved

MAGE_CONFIG_CACHE_LOCK_WAIT

64fe616

Co-authored-by: Colin Mollenhour <[email protected]>

fballiano reviewed Sep 27, 2023

View reviewed changes

app/code/core/Mage/Core/Model/Config.php Outdated Show resolved Hide resolved

Update app/code/core/Mage/Core/Model/Config.php

290fa07

Co-authored-by: Fabrizio Balliano <[email protected]>

midlan reviewed Sep 29, 2023

View reviewed changes

app/code/core/Mage/Core/Model/Config.php Show resolved Hide resolved

app/code/core/Mage/Core/Model/Config.php Outdated Show resolved Hide resolved

app/code/core/Mage/Core/Model/Config.php Outdated Show resolved Hide resolved

Merge branch 'main' into improve-stampede-prevention

9a1b661

colinmollenhour added 2 commits October 3, 2023 09:32

Merge remote-tracking branch 'openmage/main' into improve-stampede-pr…

63a232e

…evention

Update README.md

0d28c08

github-actions bot added the documentation label Oct 3, 2023

Add logging when confic cache lock wait timed out and restore env var…

35d0058

… for tme override.

fballiano reviewed Oct 5, 2023

View reviewed changes

app/code/core/Mage/Core/Model/Config.php Show resolved Hide resolved

colinmollenhour commented Oct 6, 2023

View reviewed changes

app/code/core/Mage/Core/Model/Config.php Outdated Show resolved Hide resolved

colinmollenhour added 2 commits October 6, 2023 02:24

Update app/code/core/Mage/Core/Model/Config.php

22b3b47

Merge branch 'main' into improve-stampede-prevention

9123ef8

fballiano approved these changes Oct 30, 2023

View reviewed changes

midlan approved these changes Dec 19, 2023

View reviewed changes

Merge branch 'main' into improve-stampede-prevention

0cc1f80

fballiano requested a review from elidrissidev December 19, 2023 15:35

Sekiphp approved these changes Dec 19, 2023

View reviewed changes

fballiano changed the title ~~Improve stampede prevention with empty config cache under high loads.~~ Improved stampede prevention with empty config cache under high loads Dec 19, 2023

fballiano merged commit e5ddf69 into main Dec 19, 2023

fballiano deleted the improve-stampede-prevention branch December 19, 2023 21:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved stampede prevention with empty config cache under high loads #3530

Improved stampede prevention with empty config cache under high loads #3530

colinmollenhour commented Sep 20, 2023

fballiano commented Sep 20, 2023

colinmollenhour commented Sep 20, 2023

fballiano commented Sep 20, 2023

colinmollenhour commented Sep 21, 2023

colinmollenhour commented Sep 26, 2023

fballiano commented Sep 27, 2023

fballiano commented Sep 27, 2023

fballiano commented Sep 27, 2023

github-actions bot commented Oct 2, 2023 •

edited

Loading

fballiano commented Oct 3, 2023

colinmollenhour commented Oct 3, 2023

colinmollenhour commented Oct 30, 2023

fballiano commented Dec 19, 2023

Improved stampede prevention with empty config cache under high loads #3530

Improved stampede prevention with empty config cache under high loads #3530

Conversation

colinmollenhour commented Sep 20, 2023

Description (*)

Manual testing scenarios (*)

Questions or comments

fballiano commented Sep 20, 2023

colinmollenhour commented Sep 20, 2023

fballiano commented Sep 20, 2023

colinmollenhour commented Sep 21, 2023

colinmollenhour commented Sep 26, 2023

fballiano commented Sep 27, 2023

fballiano commented Sep 27, 2023

fballiano commented Sep 27, 2023

github-actions bot commented Oct 2, 2023 • edited Loading

Test Results

fballiano commented Oct 3, 2023

colinmollenhour commented Oct 3, 2023

colinmollenhour commented Oct 30, 2023

fballiano commented Dec 19, 2023

github-actions bot commented Oct 2, 2023 •

edited

Loading