-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Description
The problem is fixed in too PRs:
But they lack a proper reproduction in the description of the PR so let me create this issue.
Steps to reproduce the issue
- Enable Hierarchical queues for capacity plugin.
- Set proper root queue capability values for the root queue, on my local cluster:
kubectl get queues root -o yaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
generation: 11
name: root
spec:
capability:
cpu: "60"
ephemeral-storage: "8108184468740"
hugepages-1Gi: "1"
hugepages-2Mi: "5"
memory: 104885440Ki
pods: "4400"
....
reclaimable: false
weight: 1
status:
allocated:
cpu: "0"
memory: "0"
reservation: {}
state: Open
In one of the child queues set a bad capability value.
kubectl get queues default -o yaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: default
spec:
capability:
cpu: "61"
....
status:
allocated:
cpu: "0"
memory: "0"
reservation: {}
state: Open
-
Add any load to the queue:
kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/example/job.yaml -
See the scheduler panic
I1014 09:45:42.133691 2108 panic.go:787] End scheduling ...
E1014 09:45:42.133815 2108 panic.go:262] "Observed a panic" panic="runtime error: invalid memory address or nil pointer dereference" panicGoValue="\"invalid memory address or nil pointer dereference\"" stacktrace=<
goroutine 423 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x44d0e08, 0xc0004ed320}, {0x3bfc2c0, 0x5f77550})
/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:132 +0xbc
k8s.io/apimachinery/pkg/util/runtime.handleCrash({0x44d1510, 0xc0002c8150}, {0x3bfc2c0, 0x5f77550}, {0x0, 0x0, 0xc000c73268?})
/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:107 +0x116
k8s.io/apimachinery/pkg/util/runtime.HandleCrashWithContext({0x44d1510, 0xc0002c8150}, {0x0, 0x0, 0x0})
/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:78 +0x5a
panic({0x3bfc2c0?, 0x5f77550?})
/usr/local/go/src/runtime/panic.go:787 +0x132
volcano.sh/volcano/pkg/scheduler/api.(*Resource).Clone(...)
/go/src/volcano.sh/volcano/pkg/scheduler/api/resource_info.go:143
volcano.sh/volcano/pkg/scheduler/plugins/capacity.(*queueAttr).Clone(0xc0004bb320)
/go/src/volcano.sh/volcano/pkg/scheduler/plugins/capacity/capacity.go:957 +0x72b
volcano.sh/volcano/pkg/scheduler/plugins/capacity.(*capacityPlugin).OnSessionOpen.func5(0xc000bef340)
/go/src/volcano.sh/volcano/pkg/scheduler/plugins/capacity/capacity.go:238 +0x285
volcano.sh/volcano/pkg/scheduler/framework.(*Session).PrePredicateFn(0xc00031cb08, 0xc000bef340)
/go/src/volcano.sh/volcano/pkg/scheduler/framework/session_plugins.go:786 +0x118
volcano.sh/volcano/pkg/scheduler/actions/preempt.(*Action).preempt(0xc00046eed0, 0xc00031cb08, 0xc000f7ee80, 0xc000bef340, 0xc000f7eea0, {0x44a0ec0, 0xc00011db58})
/go/src/volcano.sh/volcano/pkg/scheduler/actions/preempt/preempt.go:286 +0x7e
volcano.sh/volcano/pkg/scheduler/actions/preempt.(*Action).Execute(0xc00046eed0, 0xc00031cb08)
/go/src/volcano.sh/volcano/pkg/scheduler/actions/preempt/preempt.go:181 +0x1591
volcano.sh/volcano/pkg/scheduler.(*Scheduler).runOnce(0xc00014b800)
/go/src/volcano.sh/volcano/pkg/scheduler/scheduler.go:130 +0x36a
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1({0x1e036f6?, 0xc0002d1030?})
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:233 +0x13
k8s.io/apimachinery/pkg/util/wait.BackoffUntilWithContext.func1({0x44d1510?, 0xc0002c8150?}, 0x16e0d14?)
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:255 +0x51
k8s.io/apimachinery/pkg/util/wait.BackoffUntilWithContext({0x44d1510, 0xc0002c8150}, 0xc000c73f50, {0x44a1ae0, 0xc0004ce660}, 0x1)
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:256 +0xe5
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x0?, {0x44a1ae0?, 0xc0004ce660?}, 0x0?, 0x0?)
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:233 +0x46
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0006a2890, 0x2540be400, 0x0, 0x1, 0xc0002c8150)
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:210 +0x7f
k8s.io/apimachinery/pkg/util/wait.Until(...)
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:163
created by volcano.sh/volcano/pkg/scheduler.(*Scheduler).Run in goroutine 1
/go/src/volcano.sh/volcano/pkg/scheduler/scheduler.go:96 +0x1a7
>
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x370abcb]
goroutine 423 [running]:
k8s.io/apimachinery/pkg/util/runtime.handleCrash({0x44d1510, 0xc0002c8150}, {0x3bfc2c0, 0x5f77550}, {0x0, 0x0, 0xc000c73268?})
/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:114 +0x1a9
k8s.io/apimachinery/pkg/util/runtime.HandleCrashWithContext({0x44d1510, 0xc0002c8150}, {0x0, 0x0, 0x0})
/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:78 +0x5a
panic({0x3bfc2c0?, 0x5f77550?})
/usr/local/go/src/runtime/panic.go:787 +0x132
volcano.sh/volcano/pkg/scheduler/api.(*Resource).Clone(...)
/go/src/volcano.sh/volcano/pkg/scheduler/api/resource_info.go:143
volcano.sh/volcano/pkg/scheduler/plugins/capacity.(*queueAttr).Clone(0xc0004bb320)
/go/src/volcano.sh/volcano/pkg/scheduler/plugins/capacity/capacity.go:957 +0x72b
volcano.sh/volcano/pkg/scheduler/plugins/capacity.(*capacityPlugin).OnSessionOpen.func5(0xc000bef340)
/go/src/volcano.sh/volcano/pkg/scheduler/plugins/capacity/capacity.go:238 +0x285
volcano.sh/volcano/pkg/scheduler/framework.(*Session).PrePredicateFn(0xc00031cb08, 0xc000bef340)
/go/src/volcano.sh/volcano/pkg/scheduler/framework/session_plugins.go:786 +0x118
volcano.sh/volcano/pkg/scheduler/actions/preempt.(*Action).preempt(0xc00046eed0, 0xc00031cb08, 0xc000f7ee80, 0xc000bef340, 0xc000f7eea0, {0x44a0ec0, 0xc00011db58})
/go/src/volcano.sh/volcano/pkg/scheduler/actions/preempt/preempt.go:286 +0x7e
volcano.sh/volcano/pkg/scheduler/actions/preempt.(*Action).Execute(0xc00046eed0, 0xc00031cb08)
/go/src/volcano.sh/volcano/pkg/scheduler/actions/preempt/preempt.go:181 +0x1591
volcano.sh/volcano/pkg/scheduler.(*Scheduler).runOnce(0xc00014b800)
/go/src/volcano.sh/volcano/pkg/scheduler/scheduler.go:130 +0x36a
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1({0x1e036f6?, 0xc0002d1030?})
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:233 +0x13
k8s.io/apimachinery/pkg/util/wait.BackoffUntilWithContext.func1({0x44d1510?, 0xc0002c8150?}, 0x16e0d14?)
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:255 +0x51
k8s.io/apimachinery/pkg/util/wait.BackoffUntilWithContext({0x44d1510, 0xc0002c8150}, 0xc000c73f50, {0x44a1ae0, 0xc0004ce660}, 0x1)
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:256 +0xe5
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x0?, {0x44a1ae0?, 0xc0004ce660?}, 0x0?, 0x0?)
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:233 +0x46
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0006a2890, 0x2540be400, 0x0, 0x1, 0xc0002c8150)
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:210 +0x7f
k8s.io/apimachinery/pkg/util/wait.Until(...)
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:163
created by volcano.sh/volcano/pkg/scheduler.(*Scheduler).Run in goroutine 1
/go/src/volcano.sh/volcano/pkg/scheduler/scheduler.go:96 +0x1a7
I added the following log message to see where this fails into the PrePredicateFn:
ssn.AddPrePredicateFn(cp.Name(), func(task *api.TaskInfo) error {
state := &capacityState{
queueAttrs: make(map[api.QueueID]*queueAttr),
}
for _, queue := range cp.queueOpts {
klog.V(5).Infof("Cloning queue <%s> with realCapability: <%v>"+
" for task <%s/%s>.", queue.name, queue.realCapability, task.Namespace, task.Name)
state.queueAttrs[queue.queueID] = queue.Clone()
}
ssn.GetCycleState(task.UID).Write(capacityStateKey, state)
return nil
})
To see the failing Cloning error:
I1014 09:45:42.133422 2108 capacity.go:235] Cloning queue with realCapability: <<nil>> for task <default/test-job-default-nginx-0>.
Describe the results you received and expected
Do not panic.
I managed to let this happen on my production cluster:
kubectl get pods -n volcano-system
NAME READY STATUS RESTARTS AGE
volcano-admission-86cf67c97f-5szc8 1/1 Running 0 3d21h
volcano-admission-86cf67c97f-hkfns 1/1 Running 0 3d21h
volcano-admission-86cf67c97f-r4qh5 1/1 Running 0 3d21h
volcano-controllers-7c6df95d67-jm2w2 1/1 Running 0 3d21h
volcano-controllers-7c6df95d67-r5jmb 1/1 Running 0 3d21h
volcano-controllers-7c6df95d67-tl6rd 1/1 Running 0 3d21h
volcano-dashboard-7fc4c6b5cb-jwfg2 2/2 Running 0 72d
volcano-scheduler-6b78bf9f58-cns5q 0/1 CrashLoopBackOff 18 (2m52s ago) 3d7h
volcano-scheduler-6b78bf9f58-hbzg2 0/1 CrashLoopBackOff 18 (3m21s ago) 3d7h
volcano-scheduler-6b78bf9f58-q6hnm 0/1 CrashLoopBackOff 18 (2m25s ago) 3d7h
What version of Volcano are you using?
latest
Any other relevant information
Other Information:
The problem happens because the new newQueueAttr does not initiliazes the realCapability field in buildHierarchicalQueueAttrs.
https://github.com/volcano-sh/volcano/blob/master/pkg/scheduler/plugins/capacity/capacity.go#L526
https://github.com/volcano-sh/volcano/blob/master/pkg/scheduler/plugins/capacity/capacity.go#L698
And when the child queues are checked with checkHierarchicalQueue.
https://github.com/volcano-sh/volcano/blob/master/pkg/scheduler/plugins/capacity/capacity.go#L598
https://github.com/volcano-sh/volcano/blob/master/pkg/scheduler/plugins/capacity/capacity.go#L752
We fail out here:
https://github.com/volcano-sh/volcano/blob/master/pkg/scheduler/plugins/capacity/capacity.go#L780
So the code that initializes the realCapability field later normally is not executed:
https://github.com/volcano-sh/volcano/blob/master/pkg/scheduler/plugins/capacity/capacity.go#L797