Skip to content

Conversation

hector-cao
Copy link
Collaborator

@hector-cao hector-cao commented Mar 14, 2025

Description

This MR adds tests for Intel Intel QuickAssist Technology (QAT) devices. QAT is an accelerator for crypto/compression operations, available on Intel server-class hardware equipped with XEON processors.

Resolved issues

Documentation

Tests

I run these tests on Intel hardwares with/without QAT devices.
On hardware without QAT support, no test will be run (just 1 test is skipped, no tests are generated from the job templates).

On a machine with 402xx QAT device (402xx driver), here is the sample output of the results:

==================================[ Results ]===================================
 ☑ : Hardware Manifest
 ☑ : Detect Intel QuickAssist Technology device (>= Gen4)
 ☑ : Check PF sysfs for 01:00.0
 ☑ : Check PF sysfs for 0b:00.0
 ☑ : Check PF sysfs for 81:00.0
 ☑ : Check PF sysfs for 8b:00.0
 ☑ : Check SR-IOV support 01:00.0
 ☑ : Check SR-IOV support 0b:00.0
 ☑ : Check SR-IOV support 81:00.0
 ☑ : Check SR-IOV support 8b:00.0
 ☑ : Check telemetry data in debugfs for 01:00.0
 ☑ : Check telemetry data in debugfs for 0b:00.0
 ☑ : Check telemetry data in debugfs for 81:00.0
 ☑ : Check telemetry data in debugfs for 8b:00.0
 ☑ : Check VFIO-PCI support 01:00.0
 ☑ : Check VFIO-PCI support 0b:00.0
 ☑ : Check VFIO-PCI support 81:00.0
 ☑ : Check VFIO-PCI support 8b:00.0
 ☑ : Bring up and down device 01:00.0
 ☑ : Bring up and down device 0b:00.0
 ☑ : Bring up and down device 81:00.0
 ☑ : Bring up and down device 8b:00.0
 ☑ : Attach devices list
 ☑ : Collect information about installed software packages
 ☑ : Run CPA symmetric crypto tests
 ☑ : Run CPA RSA tests
 ☑ : Run compression tests
 ☑ : Run CPA symmetric crypto tests (in standalone mode)
 ☑ : Run CPA RSA tests (in standalone mode)
 ☑ : Run compression tests (in standalone mode)

@hector-cao hector-cao force-pushed the dev-add-qat-tests branch 2 times, most recently from c762ecf to 6cacc6d Compare March 14, 2025 16:09
@hector-cao hector-cao changed the title providers/base : add tests for Intel QAT providers/base : add tests for Intel QAT (New) Mar 14, 2025
@hector-cao hector-cao force-pushed the dev-add-qat-tests branch 2 times, most recently from 076bda5 to 5650fda Compare March 14, 2025 16:19
@bladernr
Copy link
Collaborator

I made a couple inline comments... I'm interested in including this in the server suite. Mostly my questions are around the dependency on the detect job, which simply checks that someone ticked Y on a manifest entry (server suite doesn't use manifests, it uses resources).

Would it be reasonable for the detect job to pass if either condition is there (e.g. the resource job returns a pf: present or whatever if the QAT PFs are detected by qatctl.py, OR if the manifest is true)?

and add that extra bit to the resource similar to other resource constraints in use? so... something like this for the detect job:

requires: manifest.has_intel_qat == 'True' or pf.present == 'True'

And keep the rest of it as-is with the dependency on the detect job?

@hector-cao hector-cao force-pushed the dev-add-qat-tests branch 5 times, most recently from 8432a90 to a889a1d Compare March 17, 2025 13:08
@hector-cao
Copy link
Collaborator Author

I made a couple inline comments... I'm interested in including this in the server suite. Mostly my questions are around the dependency on the detect job, which simply checks that someone ticked Y on a manifest entry (server suite doesn't use manifests, it uses resources).

Would it be reasonable for the detect job to pass if either condition is there (e.g. the resource job returns a pf: present or whatever if the QAT PFs are detected by qatctl.py, OR if the manifest is true)?

and add that extra bit to the resource similar to other resource constraints in use? so... something like this for the detect job:

requires: manifest.has_intel_qat == 'True' or pf.present == 'True'

And keep the rest of it as-is with the dependency on the detect job?

Thanks @bladernr for your feedback !

Based on your comments, I re-designed the tests, please take a look and let me know things you would want to be improved.

@hector-cao hector-cao force-pushed the dev-add-qat-tests branch 2 times, most recently from d462f2b to 0561f5c Compare March 17, 2025 15:36
@pieqq
Copy link
Collaborator

pieqq commented May 12, 2025

@bladernr I just saw your previous comment :

Mostly my questions are around the dependency on the detect job, which simply checks that someone ticked Y on a manifest entry (server suite doesn't use manifests, it uses resources).

Manifest entries can be entered manually when running Checkbox, but they can also be pre-filled (it's just a filke stored in /var/tmp/checkbox-ng/machine-manifest.json).

Also, manifest entries and resources do not serve exactly the same purpose. With a manifest, you tell Checkbox that this device does have a given piece of hardware, or a specific feature enabled. Resources retrieve the information automatically from the system, which may lead in jobs being skipped if the resource script fails, or if the driver is not properly loaded and therefore the feature is not exposed to the user.

A typical example is has_wlan_adapter, which tells Checkbox whether or not the device has WiFi. The wireless/detect test will then try to find a wireless interface only if this manifest is set to True. This allows to catch situations where we know a device should have WiFi, but the driver failed to load.

The tutorial has a whole page about how manifests work, you can check it out.

Copy link
Collaborator

@pieqq pieqq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the long time to provide feedback! Please check my comments to see if they make sense.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide unit tests for this file?

Comment on lines +24 to +28
unit: template
template-resource: qat
template-engine: jinja2
template-unit: job
id: intel-qat-common/{{ available }}-attach-devices
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
unit: template
template-resource: qat
template-engine: jinja2
template-unit: job
id: intel-qat-common/{{ available }}-attach-devices
unit: template
template-resource: qat
template-unit: job
id: intel-qat-common/{available}-attach-devices

Template jobs use python string formatting by default. I don't think jinja2 is needed here (nor in any of the following template jobs in this file).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! you are right but I tend to use jinja as the default every time I have to deal with template job, I do not see any strong reason to switch to a less powerful template engine, can we consider using it even if it is not mandatory right now ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually have a plan to remove the jinja2 templating option in the future, because it creates more problems than it solves.

In your template job, the only use of templates are to replace variables, so it's really a matter of replacing {{ variable }} with {variable}.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P.S.: you're not the only one to get tricked into using jinja2 template engine, so we created a document with a list of dos and don'ts about Checkbox test authoring, and one section covers this:

https://github.com/canonical/checkbox/tree/main/providers#checkbox-templates

Comment on lines +137 to +138
package.name == 'qatlib-examples'
package.name == 'qatlib-service'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are these package made available on the system? Is it a package to pull from the official repos? Is it something that needs building?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These packages are available in the archive (universe) from Ubuntu 24.04 onwards

# switch all devices to crypto sym mode
printf "POLICY=0\nServicesEnabled=sym\n" | tee /etc/sysconfig/qat
systemctl restart qat
cpa_sample_code runTests=1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is cpa_sample_code made available? (I assume it's part of the packages mentioned above?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tool is part of the qatlib-examples package

Comment on lines 104 to 110
rmmod vfio-pci || true
nb_vfio=$(qatctl.py status --devices {{ pf }} --vfio | wc -l)
[ "$nb_vfio" -le 0 ] || (echo "nb vfio devices should be <= 0" && exit 1)
# we have to pass the VF device ids
modprobe vfio-pci ids=8086:4941,8086:4943,8086:4945,8086:4947
nb_vfio=$(qatctl.py status --devices {{ pf }} --vfio | wc -l)
[ "$nb_vfio" -gt 0 ] || (echo "nb vfio devices should be > 0" && exit 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several suggestions (some apply to other jobs in this PR too):

  1. Put these into a separate bash script in bin/, and start it with set -e so that it fails as soon as possible if anything goes wrong.
  2. a check flag could be added to the qatctl.py script to avoid relying on additional bash commands (such as [ "$nb_vfio" -le 0 ] || (echo "nb vfio devices should be <= 0" && exit 1))
  3. maybe this test could be split in 2 (the second test would depend on the first):
  • check that no VFIO files are present if the vfio-pci module is removed
  • check that VFIO are there when the module is reloaded
  1. if something goes wrong before you reload the vfio-pci module, all the tests running after will fail, so you probably have to make sure the modules are reloaded regardless of the outcome.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 - I would agree to put the command steps in a bash script if it can be reused some where else (via script arguments) or the command section is quite long, but in this case, there are only 6 steps, I m wondering if it is a good idea,

2 - Good idea !

3 - I m not totally convinced by this suggestion, even if we split this test into 2 tests, I don't think we want the second one to depend on the first one because that means the first one will unload the vfio_pci module and the second one will load it. In that case, can checkbox make sure that there is no test that is interleaved between those two tests and this test might need the vfio_pci to be loaded ?

Copy link
Collaborator

@pieqq pieqq Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. It's up to you. However, adding set -e is required, otherwise you may run things after failed commands, which is probably not desired.
  2. 👍
  3. You're right, it's probably better to keep it coupled then.

Comment on lines +54 to +64
for pf in ${PFS}; do
driver_path=$(readlink /sys/bus/pci/devices/0000:"${pf}"/driver)
driver=$(basename "${driver_path}")
if [ "${driver}" == "4xxx" ] || [ "${driver}" == "420xx" ]; then
echo "pf: ${pf}"
echo "driver: ${driver}"
echo "available: qat"
echo ""
break
fi
done
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be put into the python script directly (maybe as a qatctl.py resource command). Easier to test and to maintain.

# Hector Cao <[email protected]>

unit: job
id: qat_pf
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This resource job looks like it's doing exactly the same thing as qat below, except it doesn't add the available field. Consider removing this and rely on qat only.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is quite subtle but the 2 resources are not doing the samething, the qat resource only output the first QAT PF (physical device) (see the break in the control loop)

Copy link

codecov bot commented Jul 25, 2025

Codecov Report

❌ Patch coverage is 88.92128% with 38 lines in your changes missing coverage. Please review.
✅ Project coverage is 51.95%. Comparing base (30f498f) to head (a71d9a7).
⚠️ Report is 22 commits behind head on main.

Files with missing lines Patch % Lines
providers/base/bin/qatctl.py 88.92% 27 Missing and 11 partials ⚠️

❌ Your patch check has failed because the patch coverage (88.92%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1795      +/-   ##
==========================================
+ Coverage   51.66%   51.95%   +0.29%     
==========================================
  Files         386      387       +1     
  Lines       41492    41830     +338     
  Branches     7711     7714       +3     
==========================================
+ Hits        21435    21732     +297     
- Misses      19293    19346      +53     
+ Partials      764      752      -12     
Flag Coverage Δ
provider-base 28.23% <88.92%> (+1.32%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hector-cao hector-cao force-pushed the dev-add-qat-tests branch 3 times, most recently from 9eb7d51 to 3e4b307 Compare July 31, 2025 14:45
@pieqq
Copy link
Collaborator

pieqq commented Sep 26, 2025

@hector-cao can you check the tox error?

FAIL: test_qat_dev_manager_list_devices (test_qatctl.TestQatDevManager)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ubuntu/actions-runner/_work/_tool/Python/3.6.15/x64/lib/python3.6/unittest/mock.py", line 1183, in patched
    return func(*args, **keywargs)
  File "/home/ubuntu/actions-runner/_work/checkbox/checkbox/providers/base/tests/test_qatctl.py", line 388, in test_qat_dev_manager_list_devices
    self.assertIn("00:00.0 4xxx", printed_lines)
AssertionError: '00:00.0 4xxx' not found in ['args', 'args', 'args', 'args']

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants