-
Notifications
You must be signed in to change notification settings - Fork 101
GPU plugin: skip announcing parent GPU if MIG-enabled (fix #719) #776
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cmd/gpu-kubelet-plugin/nvlib.go
Outdated
| if gpuInfo.migEnabled { | ||
| if len(migdevs) == 0 { | ||
| // Likely uninintentionally stranded capacity (misconfiguration). | ||
| klog.Warningf("Physical GPU %s has MIG mode enabled but no configured MIG devices", gpuInfo.CanonicalName()) | ||
| } | ||
| for _, mdev := range migdevs { | ||
| klog.Infof("Adding MIG device %s to allocatable devices (parent: %s)", mdev.CanonicalName(), gpuInfo.CanonicalName()) | ||
| devices[mdev.CanonicalName()] = mdev | ||
| } | ||
| } else { | ||
| klog.Infof("Adding device %s to allocatable devices", gpuInfo.CanonicalName()) | ||
| devices[gpuInfo.CanonicalName()] = parentdev | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if gpuInfo.migEnabled { | |
| if len(migdevs) == 0 { | |
| // Likely uninintentionally stranded capacity (misconfiguration). | |
| klog.Warningf("Physical GPU %s has MIG mode enabled but no configured MIG devices", gpuInfo.CanonicalName()) | |
| } | |
| for _, mdev := range migdevs { | |
| klog.Infof("Adding MIG device %s to allocatable devices (parent: %s)", mdev.CanonicalName(), gpuInfo.CanonicalName()) | |
| devices[mdev.CanonicalName()] = mdev | |
| } | |
| } else { | |
| klog.Infof("Adding device %s to allocatable devices", gpuInfo.CanonicalName()) | |
| devices[gpuInfo.CanonicalName()] = parentdev | |
| } | |
| if !gpuInfo.migEnabled { | |
| klog.Infof("Adding device %s to allocatable devices", gpuInfo.CanonicalName()) | |
| devices[gpuInfo.CanonicalName()] = parentdev | |
| return nil | |
| } | |
| // Likely unintentionally stranded capacity (misconfiguration). | |
| if len(migdevs) == 0 { | |
| klog.Warningf("Physical GPU %s has MIG mode enabled but no configured MIG devices", gpuInfo.CanonicalName()) | |
| } | |
| for _, mdev := range migdevs { | |
| klog.Infof("Adding MIG device %s to allocatable devices (parent: %s)", mdev.CanonicalName(), gpuInfo.CanonicalName()) | |
| devices[mdev.CanonicalName()] = mdev | |
| } | |
| return nil | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack, pushed change
Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
6ec44cc to
b449d0b
Compare
| } | ||
|
|
||
| deviceInfo := &AllocatableDevice{ | ||
| parentdev := &AllocatableDevice{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need this unless the gpu does not have MIG enabled. Does it make sense to only instantate this in that return path? (not a blocker though).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree!
(not doing this here; this section will change quite a bit again in upcoming patches; in this PR it's OK to just do something that gets the test to pass)
| if !gpuInfo.migEnabled { | ||
| klog.Infof("Adding device %s to allocatable devices", gpuInfo.CanonicalName()) | ||
| devices[gpuInfo.CanonicalName()] = parentdev | ||
| return nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we not check this BEFORE we start iterating mig devices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean before
l.discoverMigDevicesByGPU()
?
Yes, that can make sense. (not required towards correct behavior, though, ack?)
Again, not doing that here -- this section will change quite a bit again in upcoming patches.
|
/cherry-pick release-25.8 |
|
🤖 Backport PR created for |
Fixes #719. Unskips the corresponding regression test.
From CI:
(logs)