-
Notifications
You must be signed in to change notification settings - Fork 102
[Deepin-Kernel-SIG] [linux 6.6-y] [Upstream] arm64: Use load LSE atomics for the non-return per-CPU atomic operations #1296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: linux-6.6.y
Are you sure you want to change the base?
Conversation
mainline inclusion
from mainline-v6.18-rc6
category: performance
The non-return per-CPU this_cpu_*() atomic operations are implemented as
STADD/STCLR/STSET when FEAT_LSE is available. On many microarchitecture
implementations, these instructions tend to be executed "far" in the
interconnect or memory subsystem (unless the data is already in the L1
cache). This is in general more efficient when there is contention as it
avoids bouncing cache lines between CPUs. The load atomics (e.g. LDADD
without XZR as destination), OTOH, tend to be executed "near" with the
data loaded into the L1 cache.
STADD executed back to back as in srcu_read_{lock,unlock}*() incur an
additional overhead due to the default posting behaviour on several CPU
implementations. Since the per-CPU atomics are unlikely to be used
concurrently on the same memory location, encourage the hardware to to
execute them "near" by issuing load atomics - LDADD/LDCLR/LDSET - with
the destination register unused (but not XZR).
Signed-off-by: Catalin Marinas <[email protected]>
Link: https://lore.kernel.org/r/e7d539ed-ced0-4b96-8ecd-048a5b803b85@paulmck-laptop
Reported-by: Paul E. McKenney <[email protected]>
Tested-by: Paul E. McKenney <[email protected]>
Cc: Will Deacon <[email protected]>
Reviewed-by: Palmer Dabbelt <[email protected]>
[will: Add comment and link to the discussion thread]
Signed-off-by: Will Deacon <[email protected]>
(cherry picked from commit 535fdfc5a228524552ee8810c9175e877e127c27)
Signed-off-by: WangYuli <[email protected]>
Signed-off-by: Wentao Guan <[email protected]>
Reviewer's guide (collapsed on small PRs)Reviewer's GuideSwitch ARM64 non-return per-CPU atomics from store-based LSE instructions to load-based LSE atomics by adjusting the inline assembly template, adding a temporary register operand, and updating the PERCPU_OP macros with explanatory comments and a link to the discussion. Class diagram for updated per-CPU atomic operation macrosclassDiagram
class PERCPU_OP {
+add
+andnot
+or
// Previously used stadd, stclr, stset for LSE atomics
// Now uses ldadd, ldclr, ldset for LSE atomics
}
PERCPU_OP <|-- PERCPU_RET_OP : inherits
class PERCPU_RET_OP {
+add
// Uses ldadd for value-returning atomic add
}
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
deepin pr auto review我来对这个diff进行详细分析:
总的来说,这是一个经过深思熟虑的、基于性能考虑的优化改动,通过使用更强的原子操作来提升per-CPU操作的性能。改动本身是安全的,并且有充分的理由支持。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Suggested-by: WangYuli [email protected] |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Avenger-285714 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR cherry-picks a performance optimization from mainline kernel v6.18-rc6 that improves ARM64 per-CPU atomic operations. The change switches from store LSE atomics (STADD/STCLR/STSET) to load LSE atomics (LDADD/LDCLR/LDSET) for non-return per-CPU operations. This optimization reduces overhead from back-to-back atomic operations by encouraging hardware to execute them "near" the CPU (e.g., in L1 cache) rather than "far" in the memory subsystem, which is particularly beneficial for CPU-local operations like those used in srcu_read_{lock,unlock}*().
Key Changes:
- Modified inline assembly template to use load atomics with destination register for non-return per-CPU operations
- Updated PERCPU_OP macro invocations to use ldadd, ldclr, and ldset instead of stadd, stclr, and stset
- Added comprehensive comment documenting the rationale with reference to upstream discussion
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
mainline inclusion
from mainline-v6.18-rc6
category: performance
The non-return per-CPU this_cpu_*() atomic operations are implemented as STADD/STCLR/STSET when FEAT_LSE is available. On many microarchitecture implementations, these instructions tend to be executed "far" in the interconnect or memory subsystem (unless the data is already in the L1 cache). This is in general more efficient when there is contention as it avoids bouncing cache lines between CPUs. The load atomics (e.g. LDADD without XZR as destination), OTOH, tend to be executed "near" with the data loaded into the L1 cache.
STADD executed back to back as in srcu_read_{lock,unlock}*() incur an additional overhead due to the default posting behaviour on several CPU implementations. Since the per-CPU atomics are unlikely to be used concurrently on the same memory location, encourage the hardware to to execute them "near" by issuing load atomics - LDADD/LDCLR/LDSET - with the destination register unused (but not XZR).
Link: https://lore.kernel.org/r/e7d539ed-ced0-4b96-8ecd-048a5b803b85@paulmck-laptop
Reported-by: Paul E. McKenney [email protected]
Tested-by: Paul E. McKenney [email protected]
Cc: Will Deacon [email protected]
Reviewed-by: Palmer Dabbelt [email protected]
[will: Add comment and link to the discussion thread]
(cherry picked from commit 535fdfc5a228524552ee8810c9175e877e127c27)
Summary by Sourcery
Enhancements: