-
Notifications
You must be signed in to change notification settings - Fork 148
mm: BPF OOM #9512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: bpf-next_base
Are you sure you want to change the base?
mm: BPF OOM #9512
Conversation
Upstream branch: dbe99ea |
Upstream branch: dbe99ea |
30dd351
to
a2483ce
Compare
f3b4b37
to
bf66d41
Compare
Upstream branch: 6850a33 |
a2483ce
to
2960a99
Compare
bf66d41
to
74b5324
Compare
Upstream branch: dbe99ea |
2960a99
to
a2c8f43
Compare
74b5324
to
c3c6b4b
Compare
Introduce a bpf struct ops for implementing custom OOM handling policies. The struct ops provides the bpf_handle_out_of_memory() callback, which expected to return 1 if it was able to free some memory and 0 otherwise. In the latter case it's guaranteed that the in-kernel OOM killer will be invoked. Otherwise the kernel also checks the bpf_memory_freed field of the oom_control structure, which is expected to be set by kfuncs suitable for releasing memory. It's a safety mechanism which prevents a bpf program to claim forward progress without actually releasing memory. The callback program is sleepable to enable using iterators, e.g. cgroup iterators. The callback receives struct oom_control as an argument, so it can easily filter out OOM's it doesn't want to handle, e.g. global vs memcg OOM's. The callback is executed just before the kernel victim task selection algorithm, so all heuristics and sysctls like panic on oom, sysctl_oom_kill_allocating_task and sysctl_oom_kill_allocating_task are respected. The struct ops also has the name field, which allows to define a custom name for the implemented policy. It's printed in the OOM report in the oom_policy=<policy> format. "default" is printed if bpf is not used or policy name is not specified. [ 112.696676] test_progs invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0 oom_policy=bpf_test_policy [ 112.698160] CPU: 1 UID: 0 PID: 660 Comm: test_progs Not tainted 6.16.0-00015-gf09eb0d6badc #102 PREEMPT(full) [ 112.698165] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-5.fc42 04/01/2014 [ 112.698167] Call Trace: [ 112.698177] <TASK> [ 112.698182] dump_stack_lvl+0x4d/0x70 [ 112.698192] dump_header+0x59/0x1c6 [ 112.698199] oom_kill_process.cold+0x8/0xef [ 112.698206] bpf_oom_kill_process+0x59/0xb0 [ 112.698216] bpf_prog_7ecad0f36a167fd7_test_out_of_memory+0x2be/0x313 [ 112.698229] bpf__bpf_oom_ops_handle_out_of_memory+0x47/0xaf [ 112.698236] ? srso_alias_return_thunk+0x5/0xfbef5 [ 112.698240] bpf_handle_oom+0x11a/0x1e0 [ 112.698250] out_of_memory+0xab/0x5c0 [ 112.698258] mem_cgroup_out_of_memory+0xbc/0x110 [ 112.698274] try_charge_memcg+0x4b5/0x7e0 [ 112.698288] charge_memcg+0x2f/0xc0 [ 112.698293] __mem_cgroup_charge+0x30/0xc0 [ 112.698299] do_anonymous_page+0x40f/0xa50 [ 112.698311] __handle_mm_fault+0xbba/0x1140 [ 112.698317] ? srso_alias_return_thunk+0x5/0xfbef5 [ 112.698335] handle_mm_fault+0xe6/0x370 [ 112.698343] do_user_addr_fault+0x211/0x6a0 [ 112.698354] exc_page_fault+0x75/0x1d0 [ 112.698363] asm_exc_page_fault+0x26/0x30 [ 112.698366] RIP: 0033:0x7fa97236db00 It's possible to load multiple bpf struct programs. In the case of oom, they will be executed one by one in the same order they been loaded until one of them returns 1 and bpf_memory_freed is set to 1 - an indication that the memory was freed. This allows to have multiple bpf programs to focus on different types of OOM's - e.g. one program can only handle memcg OOM's in one memory cgroup. But the filtering is done in bpf - so it's fully flexible. Signed-off-by: Roman Gushchin <[email protected]>
Struct oom_control is used to describe the OOM context. It's memcg field defines the scope of OOM: it's NULL for global OOMs and a valid memcg pointer for memcg-scoped OOMs. Teach bpf verifier to recognize it as trusted or NULL pointer. It will provide the bpf OOM handler a trusted memcg pointer, which for example is required for iterating the memcg's subtree. Signed-off-by: Roman Gushchin <[email protected]>
Introduce bpf_oom_kill_process() bpf kfunc, which is supposed to be used by bpf OOM programs. It allows to kill a process in exactly the same way the OOM killer does: using the OOM reaper, bumping corresponding memcg and global statistics, respecting memory.oom.group etc. On success, it sets om_control's bpf_memory_freed field to true, enabling the bpf program to bypass the kernel OOM killer. Signed-off-by: Roman Gushchin <[email protected]>
To effectively operate with memory cgroups in bpf there is a need to convert css pointers to memcg pointers. A simple container_of cast which is used in the kernel code can't be used in bpf because from the verifier's point of view that's a out-of-bounds memory access. Introduce helper get/put kfuncs which can be used to get a refcounted memcg pointer from the css pointer: - bpf_get_mem_cgroup, - bpf_put_mem_cgroup. bpf_get_mem_cgroup() can take both memcg's css and the corresponding cgroup's "self" css. It allows it to be used with the existing cgroup iterator which iterates over cgroup tree, not memcg tree. Signed-off-by: Roman Gushchin <[email protected]>
Introduce a bpf kfunc to get a trusted pointer to the root memory cgroup. It's very handy to traverse the full memcg tree, e.g. for handling a system-wide OOM. It's possible to obtain this pointer by traversing the memcg tree up from any known memcg, but it's sub-optimal and makes bpf programs more complex and less efficient. bpf_get_root_mem_cgroup() has a KF_ACQUIRE | KF_RET_NULL semantics, however in reality it's not necessarily to bump the corresponding reference counter - root memory cgroup is immortal, reference counting is skipped, see css_get(). Once set, root_mem_cgroup is always a valid memcg pointer. It's safe to call bpf_put_mem_cgroup() for the pointer obtained with bpf_get_root_mem_cgroup(), it's effectively a no-op. Signed-off-by: Roman Gushchin <[email protected]>
Introduce bpf_out_of_memory() bpf kfunc, which allows to declare an out of memory events and trigger the corresponding kernel OOM handling mechanism. It takes a trusted memcg pointer (or NULL for system-wide OOMs) as an argument, as well as the page order. If the wait_on_oom_lock argument is not set, only one OOM can be declared and handled in the system at once, so if the function is called in parallel to another OOM handling, it bails out with -EBUSY. This mode is suited for global OOM's: any concurrent OOMs will likely do the job and release some memory. In a blocking mode (which is suited for memcg OOMs) the execution will wait on the oom_lock mutex. The function is declared as sleepable. It guarantees that it won't be called from an atomic context. It's required by the OOM handling code, which is not guaranteed to work in a non-blocking context. Handling of a memcg OOM almost always requires taking of the css_set_lock spinlock. The fact that bpf_out_of_memory() is sleepable also guarantees that it can't be called with acquired css_set_lock, so the kernel can't deadlock on it. Signed-off-by: Roman Gushchin <[email protected]>
Currently there is a hard-coded list of possible oom constraints: NONE, CPUSET, MEMORY_POLICY & MEMCG. Add a new one: CONSTRAINT_BPF. Also, add an ability to specify a custom constraint name when calling bpf_out_of_memory(). If an empty string is passed as an argument, CONSTRAINT_BPF is displayed. The resulting output in dmesg will look like this: [ 315.224875] kworker/u17:0 invoked oom-killer: gfp_mask=0x0(), order=0, oom_score_adj=0 oom_policy=default [ 315.226532] CPU: 1 UID: 0 PID: 74 Comm: kworker/u17:0 Not tainted 6.16.0-00015-gf09eb0d6badc #102 PREEMPT(full) [ 315.226534] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-5.fc42 04/01/2014 [ 315.226536] Workqueue: bpf_psi_wq bpf_psi_handle_event_fn [ 315.226542] Call Trace: [ 315.226545] <TASK> [ 315.226548] dump_stack_lvl+0x4d/0x70 [ 315.226555] dump_header+0x59/0x1c6 [ 315.226561] oom_kill_process.cold+0x8/0xef [ 315.226565] out_of_memory+0x111/0x5c0 [ 315.226577] bpf_out_of_memory+0x6f/0xd0 [ 315.226580] ? srso_alias_return_thunk+0x5/0xfbef5 [ 315.226589] bpf_prog_3018b0cf55d2c6bb_handle_psi_event+0x5d/0x76 [ 315.226594] bpf__bpf_psi_ops_handle_psi_event+0x47/0xa7 [ 315.226599] bpf_psi_handle_event_fn+0x63/0xb0 [ 315.226604] process_one_work+0x1fc/0x580 [ 315.226616] ? srso_alias_return_thunk+0x5/0xfbef5 [ 315.226624] worker_thread+0x1d9/0x3b0 [ 315.226629] ? __pfx_worker_thread+0x10/0x10 [ 315.226632] kthread+0x128/0x270 [ 315.226637] ? lock_release+0xd4/0x2d0 [ 315.226645] ? __pfx_kthread+0x10/0x10 [ 315.226649] ret_from_fork+0x81/0xd0 [ 315.226652] ? __pfx_kthread+0x10/0x10 [ 315.226655] ret_from_fork_asm+0x1a/0x30 [ 315.226667] </TASK> [ 315.239745] memory: usage 42240kB, limit 9007199254740988kB, failcnt 0 [ 315.240231] swap: usage 0kB, limit 0kB, failcnt 0 [ 315.240585] Memory cgroup stats for /cgroup-test-work-dir673/oom_test/cg2: [ 315.240603] anon 42897408 [ 315.241317] file 0 [ 315.241493] kernel 98304 ... [ 315.255946] Tasks state (memory values in pages): [ 315.256292] [ pid ] uid tgid total_vm rss rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name [ 315.257107] [ 675] 0 675 162013 10969 10712 257 0 155648 0 0 test_progs [ 315.257927] oom-kill:constraint=CONSTRAINT_BPF_PSI_MEM,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/cgroup-test-work-dir673/oom_test/cg2,task_memcg=/cgroup-test-work-dir673/oom_test/cg2,task=test_progs,pid=675,uid=0 [ 315.259371] Memory cgroup out of memory: Killed process 675 (test_progs) total-vm:648052kB, anon-rss:42848kB, file-rss:1028kB, shmem-rss:0kB, UID:0 pgtables:152kB oom_score_adj:0 Signed-off-by: Roman Gushchin <[email protected]>
Export tsk_is_oom_victim() helper as a bpf kfunc. It's very useful to avoid redundant oom kills. Signed-off-by: Roman Gushchin <[email protected]>
Implement read_cgroup_file() helper to read from cgroup control files, e.g. statistics. Signed-off-by: Roman Gushchin <[email protected]>
Implement a pseudo-realistic test for the OOM handling functionality. The OOM handling policy which is implemented in bpf is to kill all tasks belonging to the biggest leaf cgroup, which doesn't contain unkillable tasks (tasks with oom_score_adj set to -1000). Pagecache size is excluded from the accounting. The test creates a hierarchy of memory cgroups, causes an OOM at the top level, checks that the expected process will be killed and checks memcg's oom statistics. Signed-off-by: Roman Gushchin <[email protected]>
Currently psi_trigger_create() does a lot of things: parses the user text input, allocates and initializes the psi_trigger structure and turns on the trigger. It does it slightly different for two existing types of psi_triggers: system-wide and cgroup-wide. In order to support a new type of psi triggers, which will be owned by a bpf program and won't have a user's text description, let's refactor psi_trigger_create(). 1. Introduce psi_trigger_type enum: currently PSI_SYSTEM and PSI_CGROUP are valid values. 2. Introduce psi_trigger_params structure to avoid passing a large number of parameters to psi_trigger_create(). 3. Move out the user's input parsing into the new psi_trigger_parse() helper. 4. Move out the capabilities check into the new psi_file_privileged() helper. 5. Stop relying on t->of for detecting trigger type. Signed-off-by: Roman Gushchin <[email protected]>
This patch implements a bpf struct ops-based mechanism to create psi triggers, attach them to cgroups or system wide and handle psi events in bpf. The struct ops provides 3 callbacks: - init() called once at load, handy for creating psi triggers - handle_psi_event() called every time a psi trigger fires - handle_cgroup_free() called if a cgroup with an attached trigger is being freed A single struct ops can create a number of psi triggers, both cgroup-scoped and system-wide. All 3 struct ops callbacks can be sleepable. handle_psi_event() handlers are executed using a separate workqueue, so it won't affect the latency of other psi triggers. Signed-off-by: Roman Gushchin <[email protected]>
Implement a new bpf_psi_create_trigger() bpf kfunc, which allows to create new psi triggers and attach them to cgroups or be system-wide. Created triggers will exist until the struct ops is loaded and if they are attached to a cgroup until the cgroup exists. Due to a limitation of 5 arguments, the resource type and the "full" bit are squeezed into a single u32. Signed-off-by: Roman Gushchin <[email protected]>
Add a psi struct ops test. The test creates a cgroup with two child sub-cgroups, sets up memory.high for one of those and puts there a memory hungry process (initially frozen). Then it creates 2 psi triggers from within a init() bpf callback and attaches them to these cgroups. Then it deletes the first cgroup and runs the memory hungry task. The task is creating a high memory pressure, which triggers the psi event. The psi bpf handler declares a memcg oom in the corresponding cgroup. Finally the checks that both handle_cgroup_free() and handle_psi_event() handlers were executed, the correct process was killed and oom counters were updated. Signed-off-by: Roman Gushchin <[email protected]>
Upstream branch: 5c42715 |
a2c8f43
to
d739e23
Compare
Pull request for series with
subject: mm: BPF OOM
version: 1
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=992643