The NodeQuotaSync plugin is an extension for HNS.
It enables syncing the root subnamespace quota object and secondary subnamespaces quota objects with the node's allocatable resources in the cluster. It provides support for resources multiplier for over-commit and a reserved resources mechanism, making it easier to remove nodes from the cluster temporarily without affecting the allocatable subnamespaces resources.
- Auto-syncs the quota objects of the
root subnamespaceandsecondary subnamespaceswith the matching nodes' allocatable resources. - Configurable resources multiplier for over-commit.
- Reserved resources mechanism for removing nodes in a safe way.
- Select what type of resource to control
- Control through a Config CRD
Helm chart docs are available on charts/hns-nqs-plugin directory.
$ helm upgrade hns-nqs-plugin --install --namespace nodequotasync-system --create-namespace oci://ghcr.io/dana-team/helm-charts/hns-nqs-plugin --version <release>The NodeQuotaSync plugin is configured using the NodeQuotaConfig CRD:
apiVersion: dana.hns.io/v1alpha1
kind: NodeQuotaConfig
metadata:
name: example-nodequotaconfig
spec:
reservedHoursToLive: 24
controlledResources: ["cpu","ephermal-storage","memory","pods","nvidia.com/gpu"]
subnamespacesRoots:
- rootNamespace: cluster-root
secondaryRoots:
- labelSelector:
app: gpu
name: gpu
multipliers:
cpu: "2"
memory: "2"
- labelSelector:
app: cpu-workloads
name: cpu-workloads
multipliers:
memory: "4"subnamespaceRoots- defines the cluster's hierarchy;rootNamespace- represents the name of therootnamespace;secondaryRoots- the direct children of therootnamespaces with their corresponding node'slabelSelectorand multipliers.
The ReservedResources mechanism is a way to ensure there is no resources shortage when dealing with Nodes maintenance.
It works by giving the cluster's admins time to restore the node into the cluster without recalculating the cluster's resources and only removing the node's resources in a controlled way after a pre-set number of hours. This can be configured in the Config CR using the ReservedHoursToLive field.
When we remove one of the nodes from the cluster a ReservedResources field will be added to the CRD status:
reservedResources:
- Timestamp: '2023-07-09T07:04:27Z'
nodeGroup: cpu-workloads
resources:
cpu: 3500m
memory: '61847027712'
pods: '250'The cluster's resources will not be updated until the number of hours in the reservedHoursToLive will pass from (starting from Timestamp); afterwards, the node's resources will be removed.