[RFE] Collect additional information about all pods in the cluster #339

oarribas · 2022-12-01T10:51:01Z

Collect similar information than the information shown by oc describe nodes, like CPU/memory limits and requests per pod, allocated resources in the nodes, real resource usage by pods (maybe also by containers) in must-gather.

Currently, information like the following is shown in an oc describe nodes output:

[...]
Non-terminated Pods:                         (100 in total)
  Namespace                                  Name                                                          CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                                  ----                                                          ------------  ----------  ---------------  -------------  ---
  axyz                                        daemonset-example-d4lvf                                       0 (0%)        0 (0%)      0 (0%)           0 (0%)         29h
  axyz                                        deployment-example-862f9x79n-z27bg                            0 (0%)        0 (0%)      0 (0%)           0 (0%)         12h
  axyz                                        job-example-mxg74                                             0 (0%)        0 (0%)      0 (0%)           0 (0%)         12h
  axyz                                        replication-example-vhz2v                                     0 (0%)        0 (0%)      0 (0%)           0 (0%)         12h
  axyz                                        replication-example-yxz4c                                     0 (0%)        0 (0%)      0 (0%)           0 (0%)         12h
  axyz                                        replication-example-njcv2                                     0 (0%)        0 (0%)      0 (0%)           0 (0%)         12h
  zyxw                                      backend-worker-1-b2bl4                                        150m (4%)     1 (28%)     50Mi (0%)        300Mi (4%)     7h2m
  zyxw                                      system-sphinx-1-xnvz1                                         80m (2%)      1 (28%)     250Mi (3%)       512Mi (7%)     12h
[...]

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests          Limits
  --------           --------          ------
  cpu                3364m (96%)       9780m (279%)
  memory             7118172800 (99%)  9694Mi (141%)
  ephemeral-storage  0 (0%)            0 (0%)
  hugepages-1Gi      0 (0%)            0 (0%)
  hugepages-2Mi      0 (0%)            0 (0%)

But in a must-gather, not all namespaces/pods are collected.

That information could help to identify overcommitted nodes, pods without requests/limits, etc.

The text was updated successfully, but these errors were encountered:

palonsoro · 2022-12-01T11:05:38Z

A good way to do this would be to parse oc get pod --all-namespaces -o json with jq to produce a reduced JSON summary (kudos to @gmeghnag for the idea).

Not sure if jq is included in must-gather image already, but it is included in OCP4 repos, so it shouldn't be a problem.

gmeghnag · 2022-12-01T13:14:11Z

Something like the following (need to be tested to check if it is valid):

$ oc get pods -A --field-selector="status.phase!=Succeeded" -o json | jq '[.items[]| {"name": .metadata.name, node: .spec.nodeName, resources: .spec.containers[].resources}]'

an example:

oc get pods -A --field-selector="status.phase!=Succeeded" -o json | jq '[.items[]| {"name": .metadata.name, node: .spec.nodeName, resources: .spec.containers[].resources}]'
[
  {
    "name": "openshift-apiserver-operator-85bc4dfdb4-zj6xn",
    "node": "ip-10-0-215-218.eu-central-1.compute.internal",
    "resources": {
      "requests": {
        "cpu": "10m",
        "memory": "50Mi"
      }
    }
  },
  {
    "name": "apiserver-6f8b7d589f-69kt4",
    "node": "ip-10-0-131-238.eu-central-1.compute.internal",
    "resources": {
      "requests": {
        "cpu": "100m",
        "memory": "200Mi"
      }
    }
  },
  ...

Or if we want the same output by node name, something like the following:

NODE=<NODE_NAME>
oc get pods -A --field-selector="status.phase!=Succeeded" -o json | jq --arg NODE "$NODE" '[.items[]| select(.spec.nodeName==$NODE) | {name: .metadata.name, node: .spec.nodeName, resources: .spec.containers[].resources}]'

palonsoro · 2022-12-01T17:03:57Z

You would like to have the namespace in that output. It is perfectly possible to have more than one pod with the same name, specially if they come from statefulsets or are created by some custom controller (or by hand)

palonsoro · 2022-12-01T17:04:14Z

For the rest, it looks fine.

palonsoro · 2022-12-01T17:04:38Z

I'd also suggest using -c option of jq to produce compact output and not wrap inside an array. That way, one can both use jq and grep on the results (this is what the audit logs do, for reference).

gmeghnag · 2022-12-01T23:25:21Z

I updated the query to display also the containerName:

oc get pods -A --field-selector="status.phase!=Succeeded" -o json | jq --arg NODE "$NODE" '.items[]| select(.spec.nodeName==$NODE) | . as $pod | .spec.containers[] | {node: $pod.spec.nodeName, namespace: $pod.metadata.namespace,  podName: $pod.metadata.name, containerName: .name, resources: .resources}' -c

an example:

oc get pods -A --field-selector="status.phase!=Succeeded" -o json | jq --arg NODE "$NODE" '.items[]| select(.spec.nodeName==$NODE) | . as $pod | .spec.containers[] | {node: $pod.spec.nodeName, namespace: $pod.metadata.namespace,  podName: $pod.metadata.name, containerName: .name, resources: .resources}' -c | head -5
{"node":"ip-10-0-131-238.eu-central-1.compute.internal","namespace":"openshift-apiserver","podName":"apiserver-6f8b7d589f-69kt4","containerName":"openshift-apiserver","resources":{"requests":{"cpu":"100m","memory":"200Mi"}}}
{"node":"ip-10-0-131-238.eu-central-1.compute.internal","namespace":"openshift-apiserver","podName":"apiserver-6f8b7d589f-69kt4","containerName":"openshift-apiserver-check-endpoints","resources":{"requests":{"cpu":"10m","memory":"50Mi"}}}
{"node":"ip-10-0-131-238.eu-central-1.compute.internal","namespace":"openshift-authentication","podName":"oauth-openshift-59795457bf-sbg4n","containerName":"oauth-openshift","resources":{"requests":{"cpu":"10m","memory":"50Mi"}}}
{"node":"ip-10-0-131-238.eu-central-1.compute.internal","namespace":"openshift-cluster-csi-drivers","podName":"aws-ebs-csi-driver-controller-676777c46f-2cqn5","containerName":"csi-driver","resources":{"requests":{"cpu":"10m","memory":"50Mi"}}}
{"node":"ip-10-0-131-238.eu-central-1.compute.internal","namespace":"openshift-cluster-csi-drivers","podName":"aws-ebs-csi-driver-controller-676777c46f-2cqn5","containerName":"driver-kube-rbac-proxy","resources":{"requests":{"cpu":"10m","memory":"20Mi"}}}

soltysh · 2022-12-12T14:35:26Z

But in a must-gather, not all namespaces/pods are collected.

This is intentional, we are only focusing on control-plane related data which is required to diagnose the cluster state and help our customers resolve the problem. Also, collecting any kind of data spanning all namespaces would risk exposing various Personal Identifiable Information which we would be required to remove from the collected data set, which isn't a trivial task to undertake.
Lastly, every data we scrape increase the overall size of the archive, which when working in a cluster with a few nodes isn't that big of a deal, but when you reach clusters with hundreds, or thousands of nodes, the extra few bytes make a significant difference. This forces us to justify any addition in terms of balance between how much data we have to gather every time vs what data we can request in followup engagements with our customers.

That information could help to identify overcommitted nodes, pods without requests/limits, etc.

That is valid use-case, but with the current capabilities OpenShift has, that kind of information would be much better suited to be expose in OpenShift Insights, based on cluster metrics and suggest any actions a user might take to help them improve the stability and availability of their cluster.

Based on the above, as well as other information presented in this issue, I'm closing this as won't fix.

/close

openshift-ci · 2022-12-12T14:36:24Z

@soltysh: Closing this issue.

In response to this:

But in a must-gather, not all namespaces/pods are collected.

This is intentional, we are only focusing on control-plane related data which is required to diagnose the cluster state and help our customers resolve the problem. Also, collecting any kind of data spanning all namespaces would risk exposing various Personal Identifiable Information which we would be required to remove from the collected data set, which isn't a trivial task to undertake.
Lastly, every data we scrape increase the overall size of the archive, which when working in a cluster with a few nodes isn't that big of a deal, but when you reach clusters with hundreds, or thousands of nodes, the extra few bytes make a significant difference. This forces us to justify any addition in terms of balance between how much data we have to gather every time vs what data we can request in followup engagements with our customers.

That information could help to identify overcommitted nodes, pods without requests/limits, etc.

That is valid use-case, but with the current capabilities OpenShift has, that kind of information would be much better suited to be expose in OpenShift Insights, based on cluster metrics and suggest any actions a user might take to help them improve the stability and availability of their cluster.

Based on the above, as well as other information presented in this issue, I'm closing this as won't fix.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

oarribas · 2025-04-29T16:12:10Z

Related to this issue, RFE-7515 for collecting part of the data from this request has been created, and PR #488 also created.

cc @soltysh

oarribas mentioned this issue Dec 1, 2022

[RFE] Include the output of oc describe nodes in the must-gather #337

Closed

sferich888 mentioned this issue Dec 9, 2022

report correct version when multiple images invoked #327

Merged

openshift-ci bot closed this as completed Dec 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFE] Collect additional information about all pods in the cluster #339

[RFE] Collect additional information about all pods in the cluster #339

oarribas commented Dec 1, 2022

palonsoro commented Dec 1, 2022 •

edited

Loading

gmeghnag commented Dec 1, 2022

palonsoro commented Dec 1, 2022

palonsoro commented Dec 1, 2022

palonsoro commented Dec 1, 2022 •

edited

Loading

gmeghnag commented Dec 1, 2022

soltysh commented Dec 12, 2022

openshift-ci bot commented Dec 12, 2022

oarribas commented Apr 29, 2025 •

edited

Loading

[RFE] Collect additional information about all pods in the cluster #339

[RFE] Collect additional information about all pods in the cluster #339

Comments

oarribas commented Dec 1, 2022

palonsoro commented Dec 1, 2022 • edited Loading

gmeghnag commented Dec 1, 2022

palonsoro commented Dec 1, 2022

palonsoro commented Dec 1, 2022

palonsoro commented Dec 1, 2022 • edited Loading

gmeghnag commented Dec 1, 2022

soltysh commented Dec 12, 2022

openshift-ci bot commented Dec 12, 2022

oarribas commented Apr 29, 2025 • edited Loading

palonsoro commented Dec 1, 2022 •

edited

Loading

palonsoro commented Dec 1, 2022 •

edited

Loading

oarribas commented Apr 29, 2025 •

edited

Loading