Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 18 additions & 6 deletions cmd/gpu-kubelet-plugin/nvlib.go
Original file line number Diff line number Diff line change
Expand Up @@ -144,21 +144,33 @@ func (l deviceLib) enumerateGpusAndMigDevices(config *Config) (AllocatableDevice
return fmt.Errorf("error getting info for GPU %d: %w", i, err)
}

deviceInfo := &AllocatableDevice{
parentdev := &AllocatableDevice{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this unless the gpu does not have MIG enabled. Does it make sense to only instantate this in that return path? (not a blocker though).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree!

(not doing this here; this section will change quite a bit again in upcoming patches; in this PR it's OK to just do something that gets the test to pass)

Gpu: gpuInfo,
}
devices[gpuInfo.CanonicalName()] = deviceInfo

migs, err := l.discoverMigDevicesByGPU(gpuInfo)
migdevs, err := l.discoverMigDevicesByGPU(gpuInfo)
if err != nil {
return fmt.Errorf("error discovering MIG devices for GPU %q: %w", gpuInfo.CanonicalName(), err)
}
if featuregates.Enabled(featuregates.PassthroughSupport) {
// If no MIG devices are found, allow VFIO devices.
gpuInfo.vfioEnabled = len(migs) == 0
gpuInfo.vfioEnabled = len(migdevs) == 0
}
for _, migDeviceInfo := range migs {
devices[migDeviceInfo.CanonicalName()] = migDeviceInfo

if !gpuInfo.migEnabled {
klog.Infof("Adding device %s to allocatable devices", gpuInfo.CanonicalName())
devices[gpuInfo.CanonicalName()] = parentdev
return nil
}
Comment on lines +160 to +164
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we not check this BEFORE we start iterating mig devices?

Copy link
Collaborator Author

@jgehrcke jgehrcke Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean before

l.discoverMigDevicesByGPU()

?

Yes, that can make sense. (not required towards correct behavior, though, ack?)

Again, not doing that here -- this section will change quite a bit again in upcoming patches.


// Likely unintentionally stranded capacity (misconfiguration).
if len(migdevs) == 0 {
klog.Warningf("Physical GPU %s has MIG mode enabled but no configured MIG devices", gpuInfo.CanonicalName())
}

for _, mdev := range migdevs {
klog.Infof("Adding MIG device %s to allocatable devices (parent: %s)", mdev.CanonicalName(), gpuInfo.CanonicalName())
devices[mdev.CanonicalName()] = mdev
}

return nil
Expand Down
8 changes: 5 additions & 3 deletions tests/bats/test_gpu_mig.bats
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,6 @@ bats::on_failure() {
@test "static MIG: mutual exclusivity with physical GPU" {
mig_ensure_teardown_on_all_nodes

skip "expected to fail as of issue 719"

# (Re)install, also to refresh ResourceSlice objects.
local _iargs=("--set" "logVerbosity=6")
iupgrade_wait "${TEST_CHART_REPO}" "${TEST_CHART_VERSION}" _iargs
Expand Down Expand Up @@ -85,13 +83,17 @@ bats::on_failure() {
local dev_count_after=$(kubectl get resourceslices.resource.k8s.io -o yaml "$rsname" | yq '.spec.devices | length')
echo "devices announced (after): ${dev_count_after}"

# This detects
# The following check detects the bug described in
# https://github.com/NVIDIA/k8s-dra-driver-gpu/issues/719
if [ $dev_count_before != $dev_count_after ]; then
echo "the number of announced devices must stay the same"
return 1
fi

# Success: with one MIG device being announced and the overall number of
# devices being the same as before we now understand that one parent GPU is
# _not_ announced anymore (we can enhance precision by comparing UUID sets, if
# ever necessary).
mig_ensure_teardown_on_all_nodes
}