Skip to content

Inconsistent out of memory issue on node with rhods notebooks #159

@vbedida79

Description

@vbedida79

Summary

On OCP 4.13 using RHODS (RedHat openshift data science) with OpenVINO notebooks- the kernel restarts inconsistently with out of memory messages

Details

OCP cluster 4.13 with Intel Data Center Flex 170 GPU and notebook with memory requests and limits as 56GB.
When using RHODS with openvino notebook specifically while executing stable diffusion notebook, the python notebook kernel restarts inconsistently, dmesg on node shows:

[    0.019134] Early memory node ranges
[    0.023751] PM: hibernation: Registered nosave memory: [mem 0x00000000-0x00000fff]
[    0.023753] PM: hibernation: Registered nosave memory: [mem 0x0009d000-0x000fffff]
[    0.023755] PM: hibernation: Registered nosave memory: [mem 0x59039000-0x59039fff]
[    0.023757] PM: hibernation: Registered nosave memory: [mem 0x590fb000-0x590fbfff]
[    0.023758] PM: hibernation: Registered nosave memory: [mem 0x5ee4e000-0x5ee4efff]
[    0.023760] PM: hibernation: Registered nosave memory: [mem 0x5ee85000-0x5ee85fff]
[    0.023760] PM: hibernation: Registered nosave memory: [mem 0x5ee86000-0x5ee86fff]
[    0.023762] PM: hibernation: Registered nosave memory: [mem 0x5eebd000-0x5eebdfff]
[    0.023764] PM: hibernation: Registered nosave memory: [mem 0x5ef0b000-0x5efecfff]
[    0.023765] PM: hibernation: Registered nosave memory: [mem 0x66d71000-0x6866dfff]
[    0.023766] PM: hibernation: Registered nosave memory: [mem 0x6866e000-0x69897fff]
[    0.023766] PM: hibernation: Registered nosave memory: [mem 0x69898000-0x69dfdfff]
[    0.023768] PM: hibernation: Registered nosave memory: [mem 0x6f800000-0x8fffffff]
[    0.023769] PM: hibernation: Registered nosave memory: [mem 0x90000000-0xfdffffff]
[    0.023769] PM: hibernation: Registered nosave memory: [mem 0xfe000000-0xfe010fff]
[    0.023770] PM: hibernation: Registered nosave memory: [mem 0xfe011000-0xfed1ffff]
[    0.023770] PM: hibernation: Registered nosave memory: [mem 0xfed20000-0xfed44fff]
[    0.023771] PM: hibernation: Registered nosave memory: [mem 0xfed45000-0xffffffff]
[    0.237871] Freeing SMP alternatives memory: 36K
[    3.572274] Non-volatile memory driver v1.3
[    3.653525] Freeing initrd memory: 89312K
[    4.228204] Freeing unused decrypted memory: 2036K
[    4.232827] Freeing unused kernel image (initmem) memory: 2788K
[    4.247331] Freeing unused kernel image (text/rodata gap) memory: 2040K
[    4.251702] Freeing unused kernel image (rodata/data gap) memory: 60K
[   11.014980] i2c i2c-0: 16/32 memory slots populated (from DMI)
[   11.014982] i2c i2c-0: Systems with more than 4 memory slots not supported yet, not instantiating SPD
[   12.964055] EDAC i10nm: No hbm memory
[ 1357.676966] i915 0000:33:00.0: [drm] Local memory IO size: 0x000000037a800000
[ 1357.676968] i915 0000:33:00.0: [drm] Local memory available: 0x000000037a800000
[407440.611017]  out_of_memory+0xed/0x2e0
[407440.611029]  mem_cgroup_out_of_memory+0x13a/0x150
[407440.611116] memory: usage 58720252kB, limit 58720256kB, failcnt 23
[407440.611117] memory+swap: usage 58720252kB, limit 58720256kB, failcnt 17987903
[407440.611133] Tasks state (memory values in pages):
[407440.612317] Memory cgroup out of memory: Killed process 1535268 (python3.8) total-vm:1209308992kB, anon-rss:41459744kB, file-rss:466276kB, shmem-rss:4kB, UID:1000750000 pgtables:151104kB oom_score_adj:778
[408339.735618]  out_of_memory+0xed/0x2e0
[408339.735629]  mem_cgroup_out_of_memory+0x13a/0x150
[408339.735686] memory: usage 58720256kB, limit 58720256kB, failcnt 23
[408339.735687] memory+swap: usage 58720256kB, limit 58720256kB, failcnt 21385997
[408339.735703] Tasks state (memory values in pages):
[408339.736085] Memory cgroup out of memory: Killed process 2725201 (python3.8) total-vm:132980172kB, anon-rss:41961444kB, file-rss:304524kB, shmem-rss:4kB, UID:1000750000 pgtables:90372kB oom_score_adj:778
[457794.119151]  out_of_memory+0xed/0x2e0
[457794.119162]  mem_cgroup_out_of_memory+0x13a/0x150
[457794.119215] memory: usage 58720256kB, limit 58720256kB, failcnt 23
[457794.119217] memory+swap: usage 58720256kB, limit 58720256kB, failcnt 24769451
[457794.119234] Tasks state (memory values in pages):
[457794.119591] Memory cgroup out of memory: Killed process 2740651 (python3.8) total-vm:132968056kB, anon-rss:41960760kB, file-rss:305636kB, shmem-rss:4kB, UID:1000750000 pgtables:90380kB oom_score_adj:778

Todo/Solutions

Need to confirm the root cause, if its affected via CPU or GPU or memory issues on the node itself
Also execute other openvino notebooks and verify the issue

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggpuIntel GPU

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions