Add reboot testing for DSS (New) #2118

motjuste · 2025-09-18T13:23:10Z

Description

Note

This PR is best merged / reviewed after #2112

This PR adds testing DSS across machine reboot in the contributed Checkbox-DSS provider. Most additions are adding sibling jobs to existing jobs and expanding the test plan to order their execution. Template jobs were switched to Jinja2 template-engine to enable adding siblings.

The test plan has been expanded so that the following jobs run:

All existing jobs except purging DSS.
New long-living notebooks are created and verified, but not removed.
Machine is rebooted.
Wait for the cluster to come back up and settle down.
The long-living notebooks are restarted and verified again, then removed.
All existing jobs are executed again, including purging DSS.

The tests now ensure that notebooks created in DSS are usable across reboots, irrespective of the Kubernetes cluster underneath.

But we do need to wait for the cluster to come back up after reboot. For this, we first just wait for the cluster to be accessible using kubectl, and then, depending on which GPU plugins are expected to have been installed (based on which GPUs are available), we also await the respective Kubernetes daemonsets to finish restarting. Due to some potential race conditions while the cluster is still restarting, we pad the checks with sleep calls in a few places.

Furthermore, the long-living notebooks that were left running before the reboot need to be restarted using dss stop and dss start commands. This is another way to handle the "race condition" during the cluster settling down. In this case, it happens so that the notebook pods get recreated before the NVIDIA GPU operator is properly back up ... i.e. the notebook pods get recreated without access to the NVIDIA GPU. The behaviour of the notebooks is correct ... they do not require NVIDIA GPU to run, but can use the GPU if the cluster can. Simply restarting the pods after the NVIDIA GPU operator is deployed is enough.

Resolved issues

CHECKBOX-1897.

Documentation

No changes to the Checkbox's documentation. There's no change in usage of this provider either.

Tests

Full GH workflow run on Canonical K8s: https://github.com/canonical/checkbox/actions/runs/17830308504
Full GH workflow run on Microk8s: https://github.com/canonical/checkbox/actions/runs/17830230779

stub for now for testing

Using _after_reboot as suffix makes it get picked up earlier than wanted. We control the ordering of the before reboot and after reboot jobs explicitly now.

easier to parse it when it is part of the prefix

These notebooks will live across the reboot

We need to wait for the GPU plugins to load as well, for which now we use kubectl

motjuste added 10 commits September 15, 2025 10:08

Add stub reboot jobs

e2e4fa2

stub for now for testing

Add sibling jobs for verifying DSS status

96c11b2

Migrate template jobs to jinja2 engine

2833e66

Add sibling after reboot jobs for new notebooks

5aa49a1

Rename siblings and explicitly order test plan

aa13108

Using _after_reboot as suffix makes it get picked up earlier than wanted. We control the ordering of the before reboot and after reboot jobs explicitly now.

Rename notebooks created after reboot

3f20ed1

easier to parse it when it is part of the prefix

Add jobs for long-living notebooks

4c85ecf

These notebooks will live across the reboot

Implement proper rebooting and waiting

688f387

We need to wait for the GPU plugins to load as well, for which now we use kubectl

Add smaller sleeps around start stop to restart nb

1bb2bff

Retry for some time to get cluster status

e3df09e

motjuste marked this pull request as ready for review September 19, 2025 06:55

motjuste requested a review from a team as a code owner September 19, 2025 06:55

motjuste assigned fernando79513 Sep 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add reboot testing for DSS (New) #2118

Add reboot testing for DSS (New) #2118

Uh oh!

motjuste commented Sep 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

Add reboot testing for DSS (New) #2118

Are you sure you want to change the base?

Add reboot testing for DSS (New) #2118

Uh oh!

Conversation

motjuste commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Resolved issues

Documentation

Tests

Uh oh!

Uh oh!

motjuste commented Sep 18, 2025 •

edited

Loading