Add DSS integration tests (New) #1787

motjuste · 2025-03-11T10:03:51Z

Description

In the install-deps script, add installing git and python3-venv to be able to setup for the running the DSS integration tests from their repo.
Add a job to clone the data-scienc-stack repo, and to setup and venv with tox for running the tests.
Add a job to run the CPU integration tests from the repo.
Add a job to run the NVIDIA GPU integration tests from the repo if the NVIDIA GPU addon was successfully enabled and validated in the K8s cluster.
Fix script name for checking NVIDIA GPU rollout.

TODO

Parse Commit hash from DSS snap to determine what version of the repo to check out

Resolved issues

(Started as) Part of CHECKBOX-1693
CHECKBOX-1870

Documentation

No changes to Checkbox documentation.

Tests

There are no new unit-tests added. A full run of the checkbox-provider in Testflinger can be found here.

First job sets-up the repo and venv, then one job runs the CPU integration tests, and the next runs GPU integration tests if the NVIDIA GPU was enabled successfully

codecov · 2025-03-11T10:04:56Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 49.83%. Comparing base (da6b4e9) to head (272c394).
Report is 90 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1787   +/-   ##
=======================================
  Coverage   49.83%   49.83%           
=======================================
  Files         377      377           
  Lines       40719    40719           
  Branches     6851     6851           
=======================================
  Hits        20294    20294           
  Misses      19700    19700           
  Partials      725      725

Flag	Coverage Δ
provider-dss	`∅ <ø> (∅)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

motjuste · 2025-03-13T10:47:56Z

@fernando79513 , is there anything holding this PR anymore? I am unfortunately sick today, but I hope to be able to look at fixing any issues tomorrow.

contrib/checkbox-dss-validation/bin/install-deps

fernando79513 · 2025-03-13T15:08:42Z

Hey @motjuste, what's the exact purpose of this PR?
I thought the objective of checkbox-dss snap was to test the dss snap and make sure it could install/run on devices. I don't see how cloning the repo and installing it locally fits into that.
This also raises more questions:
What if the version of dss in the snap is not the same that's in main in the repo? How do you ensure that the version you are testing in the snap is the same as we are cloning from the repo?

motjuste · 2025-03-14T07:24:07Z

In this PR, we clone the DSS repo and only run their integration tests. The integration tests expect that the DSS snap is installed and there's a kubernetes cluster set up. We do not install DSS from this repo. I am sorry for the confusion.

fernando79513 · 2025-03-14T08:58:00Z

You are right, the DSS integration test are basically calling the DSS snap in different scenarios. So what's the difference between these tests and the ones we are already running?
I think that still does not solves the versioning issues. Imagine we are testing the DSS snap on beta, but they include a new feature on main and add integration tests for it. We will be running integration tests for a feature that's not yet in beta.
If we really want to use the integration tests they define in their repo, we would need to have the same version of the repo that created that DSS snap, and I don't think it's trivial.
If they want to use checkbox-dss in the future to run their integration tests, that would be great, but I think this tests should not be in the same test plan as the rest

Co-authored-by: Fernando Bravo <[email protected]>

motjuste · 2025-03-14T10:46:21Z

To be honest, our tests at the moment are more extensive than what the DSS integration tests cover. For example, the DSS integration tests currently do not cover Intel GPUs at all, but we do.

I am following a spec KF114 that @deusebio et.al. created for me. It does make sense to run the integration tests from the product team, especially since they will be in the best position to keep those tests up-to-date, e.g. with more granular tests, that don't make sense to be added to this Checkbox Provider.

There's one pattern we can use for handling potential version mis-match:

The repo with the integration tests has different branches for the version of the product they can test. Here the repo would be the DSS one.
This branch is provided as an argument to the test runner. Here this would become an argument to the Checkbox provider.

The DSS repo currently doesn't provide separate branches. I believe the default main branch is covers all currently published versions of the dss snap. If there is a divergence in the future --- if tests start to differ for different version of dss --- the DSS team can start following the pattern I describe and we can update the provider accordingly.

@deusebio : please provide any comments.

motjuste · 2025-04-17T13:35:12Z

I had a discussion with @deusebio from the DSS team, and he was indeed in favour of adding these into the Checkbox provider. The DSS product team will be able to add more specific tests in the future without having to update this Checkbox provider.

To avoid version mismatch, there are two options:

The CLI for running the Checkbox tests will accept an optional argument to choose which branch of the DSS repo to use for running the integration tests, defaulting to main.
The Checkbox runner can parse the the exact hash of the commit in the DSS repo from which the snap was created, and use that.

deusebio · 2025-04-17T13:48:03Z

Yes, @motjuste, thank you for the summary. I'm definitely in favor to externalize UATs test into a repository managed by the product team such that we can keep on updating tests and also implement new ones without the workflow on checkbox to be modified.

The versioning mismatch can be fixed in either ways you propose. Maybe my preference could possibly be option 2, given for DSS everything is self-contained in one repository. Charmed Kubeflow is a bit more complex (given it is made up of multiple repo/charms/services) and we have a dedicated repo for UATs, therefore the need to use branches, but this is not required here.

The hash of the commit could be parsed from the version (e.g. version of DSS is 0.1-8742e6d3c0a5450c6dbc4ea3788a, where the part after "-" is the commit hash. Otherwise we could also store this info in DSS and provide an API - say dss version to retrieve this information. We can start with just parsing the version if this is simple enough, and move to the dss version entrypoint if needed in the future

motjuste · 2025-04-17T13:51:20Z

@fernando79513 ... I looking into parsing the commit hash ... but feel free to add any other comments you may have in the meantime

fernando79513 · 2025-04-23T09:31:18Z

I still don't think this should be a checkbox test.

The idea of having these tests in checkbox (and eventually on Test Observer) is to have control of some of the core functionalities of the DSS module across different.
That's why we have in test observer a way to report which test have failed in previous version.
If these tests are modified elsewhere, we won't have a way to know if a new test has been added or removed.

If you really want to use checkbox to run these integration tests, I would have a separated testflinger job that does all the setup and calls tox - e integration, but all this set up should not be in a checkbox test IMO.
I don't like at all the idea of cloning repos and setting up the environment in a checkbox test.

And this integration tests should not be included in the dss-validation test plan
What do you think @pieqq @Hook25?

deusebio · 2025-04-23T16:27:12Z

My understanding was that the PR here was to provide the framework to enable testing, but use-case definition and implementation would still be on the product team. From your answer I gather this aims to be "independent" testing and you must both own definition and implementation of tests. I don't have particular concerns with this and it is definitely your call to make whether externalizing tests is appropriate, but the latitude on the support we can provide for evolving those tests will be obviously more limited (probably for good reasons since they are independent tests and product team has limited knowledge of the checkbox framework).

fernando79513 · 2025-04-24T08:23:36Z

I don't know who is going to be in charge of testing/implementing the tests, but I don't think having this in a checkbox test is a good approach with the current reporting system we have in place.

I see the point on running the integration tests in a set of devices, since these will probably be more up to date, but I think the setup should be done in a separate script (not in a checkbox test) and probably include everything in a separate testflinger job.
But we should not include that inside the dss-validation test-plan (which in my opinion should be tied to a specific dss snap revision)

mz2 · 2025-04-24T09:50:31Z

For the sake of reproducible test results indeed any sort of a floating (changing depending on when a test job is run) testing definition being pulled into a Checkbox job doesn't seem a good idea. If the tests are executable in some other context and there's a reason to reference them for that reason as a git ref (tag, or precise commit hash) for example, I can understand that. But since Checkbox is the way testing on real hardware in the labs is conducted and since testing on real hardware seems essential for the usecases relevant, @deusebio seems reasonable to expect that your team would become aware of this test tool enough to help maintain the coverage.

Happy to discuss in a meeting before or at the sprint if you do not find consensus otherwise.

deusebio · 2025-04-24T10:55:57Z

I feel there are a few different points of discussions we should unpack:

using floating points (e.g. branches) vs hashes
ownership for the tests and the process for defining, implementing and maintaining
checkbox requirements for snap/charms/rocks (who is mandating the requirement and if there is any specification on this)

Yes, I believe having a dedicated meeting in Frankfurt would definitely be very helpful to unpack those properly. I would prefer during roadmap if possible, since engineering sprint is looking more hectic. Would relevant stakeholders be there in the first week already? I'm sure we will be able to understand the constraint and the latitude for compromise we have. As a general point, I would like to centralize the business logic for user testing in one place, as re-implementing the same tests in different frameworks or repository would increase substantially the burden of maintainance and keeping things in sync, left aside the current skill gap. That's really my main concern and point here

mz2 · 2025-04-24T14:16:49Z

Agreed on all of that, let's resolve at the sprint and conclude back here (product week indeed the better option also for us).

motjuste · 2025-05-22T14:51:09Z

Closing this for now. Will be picked back up with better prep as part of CHECKBOX-1900

motjuste added 2 commits March 11, 2025 15:18

Install git and venv for DSS integration tests

92a92a9

Add DSS integration tests

682b156

First job sets-up the repo and venv, then one job runs the CPU integration tests, and the next runs GPU integration tests if the NVIDIA GPU was enabled successfully

Fix script name to check nvidia gpu rollout

b0a6e23

motjuste marked this pull request as ready for review March 11, 2025 11:42

motjuste requested a review from a team as a code owner March 11, 2025 11:42

motjuste requested a review from fernando79513 March 11, 2025 11:43

fernando79513 self-assigned this Mar 13, 2025

fernando79513 reviewed Mar 13, 2025

View reviewed changes

contrib/checkbox-dss-validation/bin/install-deps Outdated Show resolved Hide resolved

Fix step number in install-deps

272c394

Co-authored-by: Fernando Bravo <[email protected]>

fernando79513 added the waiting-for-changes The review has been completed but the PR is waiting for changes from the author label Mar 19, 2025

motjuste closed this May 22, 2025

Add DSS integration tests (New) #1787

Add DSS integration tests (New) #1787

Uh oh!

Conversation

motjuste commented Mar 11, 2025 • edited by jira bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

TODO

Resolved issues

Documentation

Tests

Uh oh!

codecov bot commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

motjuste commented Mar 13, 2025

Uh oh!

Uh oh!

fernando79513 commented Mar 13, 2025

Uh oh!

motjuste commented Mar 14, 2025

Uh oh!

fernando79513 commented Mar 14, 2025

Uh oh!

motjuste commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

motjuste commented Apr 17, 2025

Uh oh!

deusebio commented Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

motjuste commented Apr 17, 2025

Uh oh!

fernando79513 commented Apr 23, 2025

Uh oh!

deusebio commented Apr 23, 2025

Uh oh!

fernando79513 commented Apr 24, 2025

Uh oh!

mz2 commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

deusebio commented Apr 24, 2025

Uh oh!

mz2 commented Apr 24, 2025

Uh oh!

motjuste commented May 22, 2025 • edited by jira bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

motjuste commented Mar 11, 2025 •

edited by jira bot

Loading

codecov bot commented Mar 11, 2025 •

edited

Loading

motjuste commented Mar 14, 2025 •

edited

Loading

deusebio commented Apr 17, 2025 •

edited

Loading

mz2 commented Apr 24, 2025 •

edited

Loading

motjuste commented May 22, 2025 •

edited by jira bot

Loading