Skip to content

Conversation

motjuste
Copy link
Contributor

@motjuste motjuste commented Sep 18, 2025

Description

Note

This PR is best merged / reviewed after #2112

This PR adds testing DSS across machine reboot in the contributed Checkbox-DSS provider. Most additions are adding sibling jobs to existing jobs and expanding the test plan to order their execution. Template jobs were switched to Jinja2 template-engine to enable adding siblings.

The test plan has been expanded so that the following jobs run:

  1. All existing jobs except purging DSS.
  2. New long-living notebooks are created and verified, but not removed.
  3. Machine is rebooted.
  4. Wait for the cluster to come back up and settle down.
  5. The long-living notebooks are restarted and verified again, then removed.
  6. All existing jobs are executed again, including purging DSS.

The tests now ensure that notebooks created in DSS are usable across reboots, irrespective of the Kubernetes cluster underneath.

But we do need to wait for the cluster to come back up after reboot. For this, we first just wait for the cluster to be accessible using kubectl, and then, depending on which GPU plugins are expected to have been installed (based on which GPUs are available), we also await the respective Kubernetes daemonsets to finish restarting. Due to some potential race conditions while the cluster is still restarting, we pad the checks with sleep calls in a few places.

Furthermore, the long-living notebooks that were left running before the reboot need to be restarted using dss stop and dss start commands. This is another way to handle the "race condition" during the cluster settling down. In this case, it happens so that the notebook pods get recreated before the NVIDIA GPU operator is properly back up ... i.e. the notebook pods get recreated without access to the NVIDIA GPU. The behaviour of the notebooks is correct ... they do not require NVIDIA GPU to run, but can use the GPU if the cluster can. Simply restarting the pods after the NVIDIA GPU operator is deployed is enough.

Resolved issues

CHECKBOX-1897.

Documentation

No changes to the Checkbox's documentation. There's no change in usage of this provider either.

Tests

stub for now for testing
Using _after_reboot as suffix makes it get picked up earlier than
wanted.  We control the ordering of the before reboot and after reboot
jobs explicitly now.
easier to parse it when it is part of the prefix
These notebooks will live across the reboot
We need to wait for the GPU plugins to load as well, for which now we
use kubectl
@motjuste motjuste marked this pull request as ready for review September 19, 2025 06:55
@motjuste motjuste requested a review from a team as a code owner September 19, 2025 06:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants