Skip to content

Commit a3b4047

Browse files
committed
docs: add a README for the DRA prototype.
Signed-off-by: Krisztian Litkey <[email protected]>
1 parent f0c792b commit a3b4047

File tree

3 files changed

+258
-1
lines changed

3 files changed

+258
-1
lines changed

README-DRA-driver-proto.md

Lines changed: 254 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,254 @@
1+
# Prototyping CPU DRA device abstraction / DRA-based CPU allocation
2+
3+
## Background
4+
5+
This prototype patch set bolts a DRA allocation frontend on top of the existing
6+
topology aware resource policy plugin. The main intention with of this patch set
7+
is to
8+
9+
- provide something practical to play around with for the [feasibility study]( https://docs.google.com/document/d/1Tb_dC60YVCBr7cNYWuVLddUUTMcNoIt3zjd5-8rgug0/edit?tab=t.0#heading=h.iutbebngx80e) of enabling DRA-based CPU allocation,
10+
- allow (relatively) easy experimentation with how to expose CPU as DRA
11+
devices (IOW test various CPU DRA attributes)
12+
- allow testing how DRA-based CPU allocation (using non-trivial CEL expressions)
13+
would scale with cluster and cluster node size
14+
15+
## Notes
16+
17+
This patched NRI plugin, especially in its current state and form, is
18+
*not a proposal* for a first real DRA-based CPU driver.
19+
20+
## Prerequisites for Testing
21+
22+
To test out this in a cluster, make sure you have
23+
24+
1. DRA enabled in your cluster
25+
One way to ensure it is to bootstrap you cluster using an InitConfig with the
26+
following bits set:
27+
28+
```yaml
29+
apiVersion: kubeadm.k8s.io/v1beta4
30+
kind: InitConfiguration
31+
...
32+
---
33+
apiServer:
34+
extraArgs:
35+
- name: feature-gates
36+
value: DynamicResourceAllocation=true,DRADeviceTaints=true,DRAAdminAccess=true,DRAPrioritizedList=true,DRAPartitionableDevices=true,DRAResourceClaimDeviceStatus=true
37+
- name: runtime-config
38+
value: resource.k8s.io/v1beta2=true,resource.k8s.io/v1beta1=true,resource.k8s.io/v1alpha3=true
39+
apiVersion: kubeadm.k8s.io/v1beta4
40+
...
41+
controllerManager:
42+
extraArgs:
43+
- name: feature-gates
44+
value: DynamicResourceAllocation=true,DRADeviceTaints=true
45+
...
46+
scheduler:
47+
extraArgs:
48+
- name: feature-gates
49+
value: DynamicResourceAllocation=true,DRADeviceTaints=true,DRAAdminAccess=true,DRAPrioritizedList=true,DRAPartitionableDevices=true
50+
---
51+
apiVersion: kubelet.config.k8s.io/v1beta1
52+
kind: KubeletConfiguration
53+
featureGates:
54+
DynamicResourceAllocation: true
55+
```
56+
57+
2. CDI enabled in your runtime configuration
58+
59+
## Installation and Testing
60+
61+
Once you have your cluster properly set upset up, you can pull this in to
62+
your cluster with for testing with something like this:
63+
64+
```bash
65+
helm install --devel -n kube-system test oci://ghcr.io/klihub/nri-plugins/helm-charts/nri-resource-policy-topology-aware --version v0.9-dra-driver-unstable --set image.pullPolicy=Always --set extraEnv.OVERRIDE_SYS_ATOM_CPUS='2-5' --set extraEnv.OVERRIDE_SYS_CORE_CPUS='0\,1\,6-15'
66+
```
67+
68+
Once the NRI plugin+DRA driver is up and running, you should see some CPUs
69+
exposed as DRI devices. You can check the resource slices with the following
70+
command
71+
72+
```bash
73+
[kli@n4c16-fedora-40-cloud-base-containerd ~]# kubectl get resourceslices
74+
NAME NODE DRIVER POOL AGE
75+
n4c16-fedora-40-cloud-base-containerd-native.cpu-jxfkj n4c16-fedora-40-cloud-base-containerd native.cpu pool0 4d2h
76+
```
77+
78+
And the exposed devices like this:
79+
80+
```bash
81+
[kli@n4c16-fedora-40-cloud-base-containerd ~]# kubectl get resourceslices -oyaml | less
82+
apiVersion: v1
83+
items:
84+
- apiVersion: resource.k8s.io/v1beta2
85+
kind: ResourceSlice
86+
metadata:
87+
creationTimestamp: "2025-06-10T06:01:54Z"
88+
generateName: n4c16-fedora-40-cloud-base-containerd-native.cpu-
89+
generation: 1
90+
name: n4c16-fedora-40-cloud-base-containerd-native.cpu-jxfkj
91+
ownerReferences:
92+
- apiVersion: v1
93+
controller: true
94+
kind: Node
95+
name: n4c16-fedora-40-cloud-base-containerd
96+
uid: 90a99f1f-c1ca-4bea-8dbd-3cc821f744b1
97+
resourceVersion: "871388"
98+
uid: 4639d31f-e508-4b0a-8378-867f6c1c7cb1
99+
spec:
100+
devices:
101+
- attributes:
102+
cache0ID:
103+
int: 0
104+
cache1ID:
105+
int: 8
106+
cache2ID:
107+
int: 16
108+
cache3ID:
109+
int: 24
110+
cluster:
111+
int: 0
112+
core:
113+
int: 0
114+
coreType:
115+
string: P-core
116+
die:
117+
int: 0
118+
isolated:
119+
bool: false
120+
localMemory:
121+
int: 0
122+
package:
123+
int: 0
124+
name: cpu1
125+
- attributes:
126+
- attributes:
127+
cache0ID:
128+
int: 1
129+
cache1ID:
130+
int: 9
131+
cache2ID:
132+
int: 17
133+
cache3ID:
134+
int: 24
135+
cluster:
136+
int: 2
137+
core:
138+
int: 1
139+
coreType:
140+
string: E-core
141+
die:
142+
int: 0
143+
isolated:
144+
bool: false
145+
localMemory:
146+
int: 0
147+
package:
148+
int: 0
149+
name: cpu2
150+
- attributes:
151+
cache0ID:
152+
int: 1
153+
cache1ID:
154+
int: 9
155+
cache2ID:
156+
int: 17
157+
cache3ID:
158+
int: 24
159+
cluster:
160+
int: 2
161+
core:
162+
...
163+
```
164+
165+
If everything looks fine and you do have CPUs available as DRA devices, you
166+
can test DRA-based CPU allocation with something like this. This allocates
167+
a single P-core for the container.
168+
169+
```yaml
170+
apiVersion: resource.k8s.io/v1beta1
171+
kind: ResourceClaimTemplate
172+
metadata:
173+
name: any-cores
174+
spec:
175+
spec:
176+
devices:
177+
requests:
178+
- name: cpu
179+
deviceClassName: native.cpu
180+
---
181+
apiVersion: resource.k8s.io/v1beta1
182+
kind: ResourceClaimTemplate
183+
metadata:
184+
name: p-cores
185+
spec:
186+
spec:
187+
devices:
188+
requests:
189+
- name: cpu
190+
deviceClassName: native.cpu
191+
selectors:
192+
- cel:
193+
expression: device.attributes["native.cpu"].coreType == "P-core"
194+
count: 1
195+
---
196+
apiVersion: resource.k8s.io/v1beta1
197+
kind: ResourceClaimTemplate
198+
metadata:
199+
name: e-cores
200+
spec:
201+
spec:
202+
devices:
203+
requests:
204+
- name: cpu
205+
deviceClassName: native.cpu
206+
selectors:
207+
- cel:
208+
expression: device.attributes["native.cpu"].coreType == "E-core"
209+
count: 1
210+
---
211+
apiVersion: v1
212+
kind: Pod
213+
metadata:
214+
name: pcore-test
215+
labels:
216+
app: pod
217+
spec:
218+
containers:
219+
- name: ctr0
220+
image: busybox
221+
imagePullPolicy: IfNotPresent
222+
args:
223+
- /bin/sh
224+
- -c
225+
- trap 'exit 0' TERM; sleep 3600 & wait
226+
resources:
227+
requests:
228+
cpu: 1
229+
memory: 100M
230+
limits:
231+
cpu: 1
232+
memory: 100M
233+
claims:
234+
- name: claim-pcores
235+
resourceClaims:
236+
- name: claim-pcores
237+
resourceClaimTemplateName: p-cores
238+
terminationGracePeriodSeconds: 1
239+
```
240+
241+
If you want to try a mixed native CPU + DRA-based allocation, try
242+
increasing the CPU request and limit in the pods spec to 1500m CPUs
243+
or CPUs and see what happens.
244+
245+
246+
## Playing Around with CPU Abstractions
247+
248+
If you want to play around with this (for instance modify the exposed CPU abstraction), the easiest way is to
249+
1. [fork](https://github.com/containers/nri-plugins/fork) the [main NRI Reference Plugins](https://github.com/containers/nri-plugins) repo
250+
2. enable github actions in your personal fork
251+
3. make any changes you want (for instance, to alter the CPU abstraction, take a look at [cpu.DRA()](https://github.com/klihub/nri-plugins/blob/test/build/dra-driver/pkg/sysfs/dra.go)
252+
4. Push your changes to ssh://[email protected]/$YOUR_FORK/nri-plugins/refs/heads/test/build/dra-driver.
253+
5. Wait for the image and Helm chart publishing actions to succeed
254+
6. Once done, you can pull the result in to your cluster with something like `helm install --devel -n kube-system test oci://ghcr.io/$YOUR_GITHUB_USERID/nri-plugins/helm-charts/nri-resource-policy-topology-aware --version v0.9-dra-driver-unstable`

README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,3 +19,6 @@ Currently following plugins are available:
1919
[5]: https://containers.github.io/nri-plugins/stable/docs/memory/sgx-epc.html
2020

2121
See the [NRI plugins documentation](https://containers.github.io/nri-plugins/) for more information.
22+
23+
See the [DRA CPU driver prototype notes](README-DRA-driver-proto.md) for more information
24+
about using the Topology Aware policy as a DRA CPU driver.

docs/conf.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -127,7 +127,7 @@ def gomod_versions(modules):
127127
# List of patterns, relative to source directory, that match files and
128128
# directories to ignore when looking for source files.
129129
# This pattern also affects html_static_path and html_extra_path.
130-
exclude_patterns = ['build', 'build-aux', '_build', '.github', '_work', 'generate', 'README.md', 'TODO.md', 'SECURITY.md', 'CODE-OF-CONDUCT.md', 'docs/releases', 'test/e2e/README.md', 'docs/resource-policy/releases', 'docs/resource-policy/README.md','test/statistics-analysis/README.md', 'deployment/helm/*/*.md', '**/testdata']
130+
exclude_patterns = ['build', 'build-aux', '_build', '.github', '_work', 'generate', 'README.md', 'TODO.md', 'SECURITY.md', 'CODE-OF-CONDUCT.md', 'docs/releases', 'test/e2e/README.md', 'docs/resource-policy/releases', 'docs/resource-policy/README.md','test/statistics-analysis/README.md', 'deployment/helm/*/*.md', '**/testdata', 'README-DRA-driver-proto.md']
131131

132132
# -- Options for HTML output -------------------------------------------------
133133

0 commit comments

Comments
 (0)