Skip to content

Conversation

@opsiff
Copy link
Member

@opsiff opsiff commented Nov 18, 2025

mainline inclusion
from mainline-v6.18-rc6
category: performance

The non-return per-CPU this_cpu_*() atomic operations are implemented as STADD/STCLR/STSET when FEAT_LSE is available. On many microarchitecture implementations, these instructions tend to be executed "far" in the interconnect or memory subsystem (unless the data is already in the L1 cache). This is in general more efficient when there is contention as it avoids bouncing cache lines between CPUs. The load atomics (e.g. LDADD without XZR as destination), OTOH, tend to be executed "near" with the data loaded into the L1 cache.

STADD executed back to back as in srcu_read_{lock,unlock}*() incur an additional overhead due to the default posting behaviour on several CPU implementations. Since the per-CPU atomics are unlikely to be used concurrently on the same memory location, encourage the hardware to to execute them "near" by issuing load atomics - LDADD/LDCLR/LDSET - with the destination register unused (but not XZR).

Link: https://lore.kernel.org/r/e7d539ed-ced0-4b96-8ecd-048a5b803b85@paulmck-laptop
Reported-by: Paul E. McKenney [email protected]
Tested-by: Paul E. McKenney [email protected]
Cc: Will Deacon [email protected]
Reviewed-by: Palmer Dabbelt [email protected]
[will: Add comment and link to the discussion thread]

(cherry picked from commit 535fdfc5a228524552ee8810c9175e877e127c27)

Summary by Sourcery

Enhancements:

  • Replace per-CPU add/andnot/or atomic operations to use LDADD/LDCLR/LDSET load LSE atomics instead of STADD/STCLR/STSET for improved performance

mainline inclusion
from mainline-v6.18-rc6
category: performance

The non-return per-CPU this_cpu_*() atomic operations are implemented as
STADD/STCLR/STSET when FEAT_LSE is available. On many microarchitecture
implementations, these instructions tend to be executed "far" in the
interconnect or memory subsystem (unless the data is already in the L1
cache). This is in general more efficient when there is contention as it
avoids bouncing cache lines between CPUs. The load atomics (e.g. LDADD
without XZR as destination), OTOH, tend to be executed "near" with the
data loaded into the L1 cache.

STADD executed back to back as in srcu_read_{lock,unlock}*() incur an
additional overhead due to the default posting behaviour on several CPU
implementations. Since the per-CPU atomics are unlikely to be used
concurrently on the same memory location, encourage the hardware to to
execute them "near" by issuing load atomics - LDADD/LDCLR/LDSET - with
the destination register unused (but not XZR).

Signed-off-by: Catalin Marinas <[email protected]>
Link: https://lore.kernel.org/r/e7d539ed-ced0-4b96-8ecd-048a5b803b85@paulmck-laptop
Reported-by: Paul E. McKenney <[email protected]>
Tested-by: Paul E. McKenney <[email protected]>
Cc: Will Deacon <[email protected]>
Reviewed-by: Palmer Dabbelt <[email protected]>
[will: Add comment and link to the discussion thread]
Signed-off-by: Will Deacon <[email protected]>
(cherry picked from commit 535fdfc5a228524552ee8810c9175e877e127c27)
Signed-off-by: WangYuli <[email protected]>
Signed-off-by: Wentao Guan <[email protected]>
@sourcery-ai
Copy link

sourcery-ai bot commented Nov 18, 2025

Reviewer's guide (collapsed on small PRs)

Reviewer's Guide

Switch ARM64 non-return per-CPU atomics from store-based LSE instructions to load-based LSE atomics by adjusting the inline assembly template, adding a temporary register operand, and updating the PERCPU_OP macros with explanatory comments and a link to the discussion.

Class diagram for updated per-CPU atomic operation macros

classDiagram
    class PERCPU_OP {
        +add
        +andnot
        +or
        // Previously used stadd, stclr, stset for LSE atomics
        // Now uses ldadd, ldclr, ldset for LSE atomics
    }
    PERCPU_OP <|-- PERCPU_RET_OP : inherits
    class PERCPU_RET_OP {
        +add
        // Uses ldadd for value-returning atomic add
    }
Loading

File-Level Changes

Change Details Files
Refine inline LSE atomic template to include a tmp register operand
  • Modify the #op_lse instruction to pass the destination register tmp
  • Add tmp to the inline assembly operand constraints
arch/arm64/include/asm/percpu.h
Replace store-based per-CPU ops with load LSE atomics and document rationale
  • Remove PERCPU_OP entries using stadd/stclr/stset
  • Add new PERCPU_OP entries using ldadd/ldclr/ldset
  • Insert comment block explaining the change with a link to the patch discussion
arch/arm64/include/asm/percpu.h

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@deepin-ci-robot
Copy link

deepin pr auto review

我来对这个diff进行详细分析:

  1. 代码逻辑分析:
    这个修改主要涉及ARM64架构下的per-CPU原子操作。主要的改动是将原本使用的store-release原子操作(stadd/stclr/stset)改为load-acquire-return原子操作(ldadd/ldclr/ldset)。

  2. 性能改进:
    这个改动是基于性能考虑的优化:

  • 原来的store-only操作(stadd等)不会返回值,而新的load-return操作(ldadd等)会返回更新后的值
  • 注释中提到,value-returning原子操作更有可能在CPU附近执行(例如在L1缓存中),这样可以提高访问速度
  • 这个改动参考了Paul McKenney的分析,说明这是一个经过深思熟虑的性能优化
  1. 安全性分析:
  • 从功能角度看,这个改动是安全的,因为load-return操作提供了更强的内存语义
  • 新的操作保持了原有的原子性,同时提供了更多的信息(返回值)
  • 修改不会影响现有的内存模型保证
  1. 代码质量:
  • 代码修改很清晰,只是简单地替换了原子操作类型
  • 添加了很好的注释说明修改的原因
  • 修改保持了一致的代码风格
  1. 改进建议:
    虽然这个改动本身很好,但我建议:
  • 可以考虑在注释中添加更多关于性能提升的具体数据或基准测试结果
  • 可以考虑添加一个性能测试用例来验证这个优化确实带来了预期的性能提升
  • 建议在代码审查时关注这个改动对不同ARM64芯片的兼容性

总的来说,这是一个经过深思熟虑的、基于性能考虑的优化改动,通过使用更强的原子操作来提升per-CPU操作的性能。改动本身是安全的,并且有充分的理由支持。

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes and they look great!


Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@Avenger-285714
Copy link
Member

Suggested-by: WangYuli [email protected]

@Avenger-285714
Copy link
Member

/approve

@deepin-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Avenger-285714

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR cherry-picks a performance optimization from mainline kernel v6.18-rc6 that improves ARM64 per-CPU atomic operations. The change switches from store LSE atomics (STADD/STCLR/STSET) to load LSE atomics (LDADD/LDCLR/LDSET) for non-return per-CPU operations. This optimization reduces overhead from back-to-back atomic operations by encouraging hardware to execute them "near" the CPU (e.g., in L1 cache) rather than "far" in the memory subsystem, which is particularly beneficial for CPU-local operations like those used in srcu_read_{lock,unlock}*().

Key Changes:

  • Modified inline assembly template to use load atomics with destination register for non-return per-CPU operations
  • Updated PERCPU_OP macro invocations to use ldadd, ldclr, and ldset instead of stadd, stclr, and stset
  • Added comprehensive comment documenting the rationale with reference to upstream discussion

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants