Skip to content

Conversation

@ameryhung
Copy link
Contributor

No description provided.

To prepare for changing bpf_local_storage_map_bucket::lock to rqspinlock,
convert bpf_selem_unlink_map() to failable. It still always succeeds and
returns 0 for now.

Since some operations updating local storage cannot fail in the middle,
open-code bpf_selem_unlink_map() to take the b->lock before the
operation. There are two such locations:

- bpf_local_storage_alloc()

  The first selem will be unlinked from smap if cmpxchg owner_storage_ptr
  fails, which should not fail. Therefore, hold b->lock when linking
  until allocation complete. Helpers that assume b->lock is held by
  callers are introduced: bpf_selem_link_map_nolock() and
  bpf_selem_unlink_map_nolock().

- bpf_local_storage_update()

  The three step update process: link_map(new_selem),
  link_storage(new_selem), and unlink_map(old_selem) should not fail in
  the middle.

In bpf_selem_unlink(), bpf_selem_unlink_map() and
bpf_selem_unlink_storage() should either all succeed or fail as a whole
instead of failing in the middle. So, return if unlink_map() failed.

In bpf_local_storage_destroy(), since it cannot deadlock with itself or
bpf_local_storage_map_free() who the function might be racing with,
retry if bpf_selem_unlink_map() fails due to rqspinlock returning
errors.

Signed-off-by: Amery Hung <[email protected]>
To prepare for changing bpf_local_storage_map_bucket::lock to rqspinlock,
convert bpf_selem_link_map() to failable. It still always succeeds and
returns 0 until the change happens. No functional change.

__must_check is added to the function declaration locally to make sure
all the callers are accounted for during the conversion.

Signed-off-by: Amery Hung <[email protected]>
To prepare for changing bpf_local_storage::lock to rqspinlock, open code
bpf_selem_unlink_storage() in the only caller, bpf_selem_unlink(), since
unlink_map and unlink_storage must be done together after all the
necessary locks are acquired.

Signed-off-by: Amery Hung <[email protected]>
To prepare changing both bpf_local_storage_map_bucket::lock and
bpf_local_storage::lock to rqspinlock, convert bpf_selem_unlink() to
failable. It still always succeeds and returns 0 until the change
happens. No functional change.

For bpf_local_storage_map_free(), WARN_ON() for now as no real error
will happen until we switch to rqspinlock.

__must_check is added to the function declaration locally to make sure
all callers are accounted for during the conversion.

Signed-off-by: Amery Hung <[email protected]>
Change bpf_local_storage::lock and bpf_local_storage_map_bucket::lock to
from raw_spin_lock to rqspinlock.

Finally, propagate errors from raw_res_spin_lock_irqsave() to syscall
return or BPF helper return.

In bpf_local_storage_destroy(), WARN_ON for now. A later patch will
handle this properly.

For, __bpf_local_storage_map_cache(), instead of handling the error,
skip updating the cache.

Signed-off-by: Amery Hung <[email protected]>
The percpu counter in task local storage is no longer needed as the
underlying bpf_local_storage can now handle deadlock with the help of
rqspinlock. Remove the percpu counter and related migrate_{disable,
enable}.

Since the percpu counter is removed, merge back bpf_task_storage_get()
and bpf_task_storage_get_recur(). This will allow the bpf syscalls and
helpers to run concurrently on the same CPU, removing the spurious
-EBUSY error. bpf_task_storage_get(..., F_CREATE) will now always
succeed with enough free memory unless being called recursively.

Signed-off-by: Amery Hung <[email protected]>
The percpu counter in cgroup local storage is no longer needed as the
underlying bpf_local_storage can now handle deadlock with the help of
rqspinlock. Remove the percpu counter and related migrate_{disable,
enable}.

Signed-off-by: Amery Hung <[email protected]>
Percpu locks have been removed from cgroup and task local storage. Now
that all local storage no longer use percpu variables as locks preventing
recursion, there is no need to pass them to bpf_local_storage_map_free().
Remove the argument from the function.

Signed-off-by: Amery Hung <[email protected]>
A later patch will introduce bpf_selem_unlink_lockless() to handle
rqspinlock errors. bpf_selem_unlink_lockless() will allow an selem
to be partially unlinked from map or local storage. Therefore,
bpf_selem_free() needs to be decoupled from map and local storage
as SDATA(selem)->smap or selem->local_storage may be NULL.
Decoupling from local storage is already done when local storage
migrated from BPF memory allocator to kmalloc_nolock(). This patch
prepares to decouple from map.

Currently, map is still needed in bpf_selem_free() to:

  1. Uncharge memory
    a. map->ops->map_local_storage_uncharge
    b. map->elem_size
  2. Infer how memory should be freed
    a. map->use_kmalloc_nolock
  3. Free special fields
    a. map->record

The dependency of 1.a will be addressed by a later patch by returning
the amount of memory to uncharge directly to the owner who calls
bpf_local_storage_destroy().

The dependency of 3.a will be addressed by a later patch by freeing
special fields under b->lock, when the map is still alive.

This patch handles 1.b and 2.a by simply saving the informnation in
bpf_local_storage_elem.

Signed-off-by: Amery Hung <[email protected]>
Introduce bpf_selem_unlink_lockless() to properly handle errors returned
from rqspinlock in bpf_local_storage_map_free() and
bpf_local_storage_destroy().

The idea of bpf_selem_unlink_lockless() is to allow an selem to be
partially linked and use refcount to determine when and who can free the
selem. An selem initially is fully linked to a map and a local storage
and therefore selem->linkcnt is set to 2. Under normal circumstances,
bpf_selem_unlink_lockless() will be able to grab both locks and unlink
an selem from map and local storage just like bpf_selem_unlink().
However, if any of the lock attempts fails, it will only clear
SDATA(selem)->smap or selem->local_storage depending on the caller and
decrement linkcnt to signal that the corresponding data structure
holding a reference to the selem is gone. Then, only when both map and
local storage are gone, an selem can be free by the last caller that
turns the linkcnt to 0.

To make sure bpf_obj_free_fields() is done only once and when map is
still present, it is called when unlinking an selem from b->list under
b->lock.

To make sure uncharging memory is only done once and when owner is still
present, only unlink selem from local_storage->list in
bpf_local_storage_destroy() and return the amount of memory to uncharge
to the caller (i.e., owner) since the map associated with an selem may
already be gone and map->ops->map_local_storage_uncharge can no longer
be referenced.

Caller of bpf_selem_unlink_lockless() will be iterating b->list or
local_storage->list without lock. Therefore, an selem can only be freed
after an RCU grace period (i.e., bpf_selem_free(selem,
reuse_now = false)).

Finally, SDATA(selem)->map and selem->local_storage need to be protected
by RCU as well since bpf_local_storage_map_free() or
bpf_local_storage_destroy() can free itself after partially unlink an
selem. That means, reuse_now now has to be false when calling
bpf_local_storage_free(). The local storage map is already protected as
bpf_local_storage_map_free() waits for an RCU grace period before
freeing itself.

Signed-off-by: Amery Hung <[email protected]>
Take care of rqspinlock error in bpf_local_storage_{map_free, destroy())
properly by switching to bpf_selem_unlink_lockless().

Signed-off-by: Amery Hung <[email protected]>
Check sk_omem_alloc when the caller of bpf_local_storage_destroy()
returns. bpf_local_storage_destroy() now returns the memory to uncharge
to the caller instead of directly uncharge. Therefore, in the
sk_storage_omem_uncharge, check sk_omem_alloc when bpf_sk_storage_free()
returns instead of bpf_local_storage_destroy().

Signed-off-by: Amery Hung <[email protected]>
Update the expected result of the selftest as recursion of task local
storage syscall and helpers have been relaxed. Now that the percpu
counter is removed, task local storage helpers, bpf_task_storage_get()
and bpf_task_storage_delete() can now run on the same CPU at the same
time unless they cause deadlock.

Note that since there is no percpu counter preventing recursion in
task local storage helpers, bpf_trampoline now catches the recursion
of on_update as reported by recursion_misses.

on_enter: tp_btf/sys_enter
on_update: fentry/bpf_local_storage_update

           Old behavior                         New behavior
           ____________                         ____________
on_enter                             on_enter
  bpf_task_storage_get(&map_a)         bpf_task_storage_get(&map_a)
    bpf_task_storage_trylock succeed     bpf_local_storage_update(&map_a)
    bpf_local_storage_update(&map_a)

    on_update                            on_update
      bpf_task_storage_get(&map_a)         bpf_task_storage_get(&map_a)
        bpf_task_storage_trylock fail        on_update::misses++ (1)
        return NULL                        create and return map_a::ptr

                                           map_a::ptr += 1 (1)

                                           bpf_task_storage_delete(&map_a)
                                             return 0

      bpf_task_storage_get(&map_b)         bpf_task_storage_get(&map_b)
        bpf_task_storage_trylock fail        on_update::misses++ (2)
        return NULL                        create and return map_b::ptr

                                           map_b::ptr += 1 (1)

    create and return map_a::ptr         create and return map_a::ptr
  map_a::ptr = 200                     map_a::ptr = 200

  bpf_task_storage_get(&map_b)         bpf_task_storage_get(&map_b)
    bpf_task_storage_trylock succeed     lockless lookup succeed
    bpf_local_storage_update(&map_b)     return map_b::ptr

    on_update
      bpf_task_storage_get(&map_a)
        bpf_task_storage_trylock fail
        lockless lookup succeed
        return map_a::ptr

      map_a::ptr += 1 (201)

      bpf_task_storage_delete(&map_a)
        bpf_task_storage_trylock fail
        return -EBUSY
      nr_del_errs++ (1)

      bpf_task_storage_get(&map_b)
        bpf_task_storage_trylock fail
        return NULL

    create and return ptr

  map_b::ptr = 100

Expected result:

map_a::ptr = 201                          map_a::ptr = 200
map_b::ptr = 100                          map_b::ptr = 1
nr_del_err = 1                            nr_del_err = 0
on_update::recursion_misses = 0           on_update::recursion_misses = 2
On_enter::recursion_misses = 0            on_enter::recursion_misses = 0

Signed-off-by: Amery Hung <[email protected]>
Adjsut the error code we are checking against as bpf_task_storage_get()
now returns -EDEADLK or -ETIMEDOUT when deadlock happens.

Signed-off-by: Amery Hung <[email protected]>
Remove a test in test_maps that checks if the updating of the percpu
counter in task local storage map is preemption safe as the percpu
counter is now removed.

Signed-off-by: Amery Hung <[email protected]>
bpf_cgrp_storage_busy has been removed. Use bpf_bprintf_nest_level
instead. This percpu variable is also in the bpf subsystem so that
if it is removed in the future, BPF-CI will catch this type of CI-
breaking change.

Signed-off-by: Amery Hung <[email protected]>
@ameryhung ameryhung force-pushed the remove_ls_percpu_locks_v3.4 branch from b7ebed6 to 47fd39b Compare November 20, 2025 17:57
@kernel-patches-daemon-bpf kernel-patches-daemon-bpf bot force-pushed the bpf-next_base branch 2 times, most recently from b0a5b86 to 44cbecf Compare November 21, 2025 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant