[Deepin-Kernel-SIG] [linux 6.6-y] [Upstream] arm64: Use load LSE atomics for the non-return per-CPU atomic operations #1296

opsiff · 2025-11-18T14:55:23Z

mainline inclusion
from mainline-v6.18-rc6
category: performance

The non-return per-CPU this_cpu_*() atomic operations are implemented as STADD/STCLR/STSET when FEAT_LSE is available. On many microarchitecture implementations, these instructions tend to be executed "far" in the interconnect or memory subsystem (unless the data is already in the L1 cache). This is in general more efficient when there is contention as it avoids bouncing cache lines between CPUs. The load atomics (e.g. LDADD without XZR as destination), OTOH, tend to be executed "near" with the data loaded into the L1 cache.

STADD executed back to back as in srcu_read_{lock,unlock}*() incur an additional overhead due to the default posting behaviour on several CPU implementations. Since the per-CPU atomics are unlikely to be used concurrently on the same memory location, encourage the hardware to to execute them "near" by issuing load atomics - LDADD/LDCLR/LDSET - with the destination register unused (but not XZR).

Link: https://lore.kernel.org/r/e7d539ed-ced0-4b96-8ecd-048a5b803b85@paulmck-laptop
Reported-by: Paul E. McKenney [email protected]
Tested-by: Paul E. McKenney [email protected]
Cc: Will Deacon [email protected]
Reviewed-by: Palmer Dabbelt [email protected]
[will: Add comment and link to the discussion thread]

(cherry picked from commit 535fdfc5a228524552ee8810c9175e877e127c27)

Summary by Sourcery

Enhancements:

Replace per-CPU add/andnot/or atomic operations to use LDADD/LDCLR/LDSET load LSE atomics instead of STADD/STCLR/STSET for improved performance

mainline inclusion from mainline-v6.18-rc6 category: performance The non-return per-CPU this_cpu_*() atomic operations are implemented as STADD/STCLR/STSET when FEAT_LSE is available. On many microarchitecture implementations, these instructions tend to be executed "far" in the interconnect or memory subsystem (unless the data is already in the L1 cache). This is in general more efficient when there is contention as it avoids bouncing cache lines between CPUs. The load atomics (e.g. LDADD without XZR as destination), OTOH, tend to be executed "near" with the data loaded into the L1 cache. STADD executed back to back as in srcu_read_{lock,unlock}*() incur an additional overhead due to the default posting behaviour on several CPU implementations. Since the per-CPU atomics are unlikely to be used concurrently on the same memory location, encourage the hardware to to execute them "near" by issuing load atomics - LDADD/LDCLR/LDSET - with the destination register unused (but not XZR). Signed-off-by: Catalin Marinas <[email protected]> Link: https://lore.kernel.org/r/e7d539ed-ced0-4b96-8ecd-048a5b803b85@paulmck-laptop Reported-by: Paul E. McKenney <[email protected]> Tested-by: Paul E. McKenney <[email protected]> Cc: Will Deacon <[email protected]> Reviewed-by: Palmer Dabbelt <[email protected]> [will: Add comment and link to the discussion thread] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 535fdfc5a228524552ee8810c9175e877e127c27) Signed-off-by: WangYuli <[email protected]> Signed-off-by: Wentao Guan <[email protected]>

sourcery-ai · 2025-11-18T14:55:32Z

Reviewer's guide (collapsed on small PRs)

Reviewer's Guide

Switch ARM64 non-return per-CPU atomics from store-based LSE instructions to load-based LSE atomics by adjusting the inline assembly template, adding a temporary register operand, and updating the PERCPU_OP macros with explanatory comments and a link to the discussion.

Class diagram for updated per-CPU atomic operation macros

classDiagram
    class PERCPU_OP {
        +add
        +andnot
        +or
        // Previously used stadd, stclr, stset for LSE atomics
        // Now uses ldadd, ldclr, ldset for LSE atomics
    }
    PERCPU_OP <|-- PERCPU_RET_OP : inherits
    class PERCPU_RET_OP {
        +add
        // Uses ldadd for value-returning atomic add
    }

File-Level Changes

Change	Details	Files
Refine inline LSE atomic template to include a tmp register operand	Modify the #op_lse instruction to pass the destination register tmp Add tmp to the inline assembly operand constraints	`arch/arm64/include/asm/percpu.h`
Replace store-based per-CPU ops with load LSE atomics and document rationale	Remove PERCPU_OP entries using stadd/stclr/stset Add new PERCPU_OP entries using ldadd/ldclr/ldset Insert comment block explaining the change with a link to the patch discussion	`arch/arm64/include/asm/percpu.h`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

deepin-ci-robot · 2025-11-18T14:55:56Z

deepin pr auto review

我来对这个diff进行详细分析：

代码逻辑分析：
这个修改主要涉及ARM64架构下的per-CPU原子操作。主要的改动是将原本使用的store-release原子操作(stadd/stclr/stset)改为load-acquire-return原子操作(ldadd/ldclr/ldset)。
性能改进：
这个改动是基于性能考虑的优化：

原来的store-only操作(stadd等)不会返回值，而新的load-return操作(ldadd等)会返回更新后的值
注释中提到，value-returning原子操作更有可能在CPU附近执行（例如在L1缓存中），这样可以提高访问速度
这个改动参考了Paul McKenney的分析，说明这是一个经过深思熟虑的性能优化

安全性分析：

从功能角度看，这个改动是安全的，因为load-return操作提供了更强的内存语义
新的操作保持了原有的原子性，同时提供了更多的信息（返回值）
修改不会影响现有的内存模型保证

代码质量：

代码修改很清晰，只是简单地替换了原子操作类型
添加了很好的注释说明修改的原因
修改保持了一致的代码风格

改进建议：
虽然这个改动本身很好，但我建议：

可以考虑在注释中添加更多关于性能提升的具体数据或基准测试结果
可以考虑添加一个性能测试用例来验证这个优化确实带来了预期的性能提升
建议在代码审查时关注这个改动对不同ARM64芯片的兼容性

总的来说，这是一个经过深思熟虑的、基于性能考虑的优化改动，通过使用更强的原子操作来提升per-CPU操作的性能。改动本身是安全的，并且有充分的理由支持。

sourcery-ai

Hey there - I've reviewed your changes and they look great!

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

Avenger-285714 · 2025-11-19T01:14:05Z

Suggested-by: WangYuli [email protected]

Avenger-285714 · 2025-11-19T01:14:15Z

/approve

deepin-ci-robot · 2025-11-19T01:14:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Avenger-285714

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~deepin/OWNERS~~ [Avenger-285714]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot

Pull Request Overview

This PR cherry-picks a performance optimization from mainline kernel v6.18-rc6 that improves ARM64 per-CPU atomic operations. The change switches from store LSE atomics (STADD/STCLR/STSET) to load LSE atomics (LDADD/LDCLR/LDSET) for non-return per-CPU operations. This optimization reduces overhead from back-to-back atomic operations by encouraging hardware to execute them "near" the CPU (e.g., in L1 cache) rather than "far" in the memory subsystem, which is particularly beneficial for CPU-local operations like those used in srcu_read_{lock,unlock}*().

Key Changes:

Modified inline assembly template to use load atomics with destination register for non-return per-CPU operations
Updated PERCPU_OP macro invocations to use ldadd, ldclr, and ldset instead of stadd, stclr, and stset
Added comprehensive comment documenting the rationale with reference to upstream discussion

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

deepin-ci-robot requested a review from BLumia November 18, 2025 14:55

sourcery-ai bot reviewed Nov 18, 2025

View reviewed changes

Avenger-285714 requested review from Avenger-285714 and Copilot November 19, 2025 01:13

deepin-ci-robot added the approved label Nov 19, 2025

Copilot started reviewing on behalf of Avenger-285714 November 19, 2025 01:15 View session

Copilot finished reviewing on behalf of Avenger-285714 November 19, 2025 01:16

Copilot AI reviewed Nov 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Deepin-Kernel-SIG] [linux 6.6-y] [Upstream] arm64: Use load LSE atomics for the non-return per-CPU atomic operations #1296

[Deepin-Kernel-SIG] [linux 6.6-y] [Upstream] arm64: Use load LSE atomics for the non-return per-CPU atomic operations #1296

Uh oh!

opsiff commented Nov 18, 2025 •

edited by sourcery-ai bot

Loading

Uh oh!

sourcery-ai bot commented Nov 18, 2025 •

edited

Loading

Reviewer's Guide

Class diagram for updated per-CPU atomic operation macros

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

deepin-ci-robot commented Nov 18, 2025

Uh oh!

sourcery-ai bot left a comment

Uh oh!

Avenger-285714 commented Nov 19, 2025

Uh oh!

Avenger-285714 commented Nov 19, 2025

Uh oh!

deepin-ci-robot commented Nov 19, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[Deepin-Kernel-SIG] [linux 6.6-y] [Upstream] arm64: Use load LSE atomics for the non-return per-CPU atomic operations #1296

Are you sure you want to change the base?

[Deepin-Kernel-SIG] [linux 6.6-y] [Upstream] arm64: Use load LSE atomics for the non-return per-CPU atomic operations #1296

Uh oh!

Conversation

opsiff commented Nov 18, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Class diagram for updated per-CPU atomic operation macros

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

deepin-ci-robot commented Nov 18, 2025

deepin pr auto review

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Avenger-285714 commented Nov 19, 2025

Uh oh!

Avenger-285714 commented Nov 19, 2025

Uh oh!

deepin-ci-robot commented Nov 19, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

opsiff commented Nov 18, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Nov 18, 2025 •

edited

Loading