Skip to content

Conversation

@jgehrcke
Copy link
Collaborator

@jgehrcke jgehrcke commented Dec 5, 2025

Fixes #719. Unskips the corresponding regression test.

From CI:

ok 13 static MIG: mutual exclusivity with physical GPU in 20508ms

(logs)

@jgehrcke jgehrcke self-assigned this Dec 5, 2025
@jgehrcke jgehrcke moved this from Backlog to In Progress in Planning Board: k8s-dra-driver-gpu Dec 5, 2025
@jgehrcke jgehrcke added this to the v25.12.0 milestone Dec 5, 2025
@jgehrcke jgehrcke added the bug Issue/PR to expose/discuss/fix a bug label Dec 5, 2025
@jgehrcke jgehrcke requested a review from klueska December 5, 2025 17:31
Comment on lines 160 to 172
if gpuInfo.migEnabled {
if len(migdevs) == 0 {
// Likely uninintentionally stranded capacity (misconfiguration).
klog.Warningf("Physical GPU %s has MIG mode enabled but no configured MIG devices", gpuInfo.CanonicalName())
}
for _, mdev := range migdevs {
klog.Infof("Adding MIG device %s to allocatable devices (parent: %s)", mdev.CanonicalName(), gpuInfo.CanonicalName())
devices[mdev.CanonicalName()] = mdev
}
} else {
klog.Infof("Adding device %s to allocatable devices", gpuInfo.CanonicalName())
devices[gpuInfo.CanonicalName()] = parentdev
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if gpuInfo.migEnabled {
if len(migdevs) == 0 {
// Likely uninintentionally stranded capacity (misconfiguration).
klog.Warningf("Physical GPU %s has MIG mode enabled but no configured MIG devices", gpuInfo.CanonicalName())
}
for _, mdev := range migdevs {
klog.Infof("Adding MIG device %s to allocatable devices (parent: %s)", mdev.CanonicalName(), gpuInfo.CanonicalName())
devices[mdev.CanonicalName()] = mdev
}
} else {
klog.Infof("Adding device %s to allocatable devices", gpuInfo.CanonicalName())
devices[gpuInfo.CanonicalName()] = parentdev
}
if !gpuInfo.migEnabled {
klog.Infof("Adding device %s to allocatable devices", gpuInfo.CanonicalName())
devices[gpuInfo.CanonicalName()] = parentdev
return nil
}
// Likely unintentionally stranded capacity (misconfiguration).
if len(migdevs) == 0 {
klog.Warningf("Physical GPU %s has MIG mode enabled but no configured MIG devices", gpuInfo.CanonicalName())
}
for _, mdev := range migdevs {
klog.Infof("Adding MIG device %s to allocatable devices (parent: %s)", mdev.CanonicalName(), gpuInfo.CanonicalName())
devices[mdev.CanonicalName()] = mdev
}
return nil
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, pushed change

}

deviceInfo := &AllocatableDevice{
parentdev := &AllocatableDevice{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this unless the gpu does not have MIG enabled. Does it make sense to only instantate this in that return path? (not a blocker though).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree!

(not doing this here; this section will change quite a bit again in upcoming patches; in this PR it's OK to just do something that gets the test to pass)

Comment on lines +160 to +164
if !gpuInfo.migEnabled {
klog.Infof("Adding device %s to allocatable devices", gpuInfo.CanonicalName())
devices[gpuInfo.CanonicalName()] = parentdev
return nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we not check this BEFORE we start iterating mig devices?

Copy link
Collaborator Author

@jgehrcke jgehrcke Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean before

l.discoverMigDevicesByGPU()

?

Yes, that can make sense. (not required towards correct behavior, though, ack?)

Again, not doing that here -- this section will change quite a bit again in upcoming patches.

@jgehrcke jgehrcke merged commit b457aa2 into NVIDIA:main Dec 8, 2025
16 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Closed in Planning Board: k8s-dra-driver-gpu Dec 8, 2025
@jgehrcke
Copy link
Collaborator Author

jgehrcke commented Dec 8, 2025

/cherry-pick release-25.8

@github-actions
Copy link

github-actions bot commented Dec 8, 2025

🤖 Backport PR created for release-25.8: #779 ⚠️ (has conflicts)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Issue/PR to expose/discuss/fix a bug cherry-pick/release-25.8

Projects

Development

Successfully merging this pull request may close these issues.

MIG-partitioned Nodes also have whole GPUs advertised in ResourceSlice

3 participants