Add reboot testing for DSS (New) #2118
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Note
This PR is best merged / reviewed after #2112
This PR adds testing DSS across machine reboot in the contributed Checkbox-DSS provider. Most additions are adding sibling jobs to existing jobs and expanding the test plan to order their execution. Template jobs were switched to Jinja2 template-engine to enable adding siblings.
The test plan has been expanded so that the following jobs run:
The tests now ensure that notebooks created in DSS are usable across reboots, irrespective of the Kubernetes cluster underneath.
But we do need to wait for the cluster to come back up after reboot. For this, we first just wait for the cluster to be accessible using
kubectl
, and then, depending on which GPU plugins are expected to have been installed (based on which GPUs are available), we also await the respective Kubernetes daemonsets to finish restarting. Due to some potential race conditions while the cluster is still restarting, we pad the checks withsleep
calls in a few places.Furthermore, the long-living notebooks that were left running before the reboot need to be restarted using
dss stop
anddss start
commands. This is another way to handle the "race condition" during the cluster settling down. In this case, it happens so that the notebook pods get recreated before the NVIDIA GPU operator is properly back up ... i.e. the notebook pods get recreated without access to the NVIDIA GPU. The behaviour of the notebooks is correct ... they do not require NVIDIA GPU to run, but can use the GPU if the cluster can. Simply restarting the pods after the NVIDIA GPU operator is deployed is enough.Resolved issues
CHECKBOX-1897.
Documentation
No changes to the Checkbox's documentation. There's no change in usage of this provider either.
Tests