Troubleshooting

Log queries

Run the following queries on GCP Logs Explorer to check logs.

Lustre CSI drver controller server logs:

resource.type="k8s_container"
resource.labels.pod_name=~"lustre-csi-controller*"

Lustre CSI drver node server logs:

resource.type="k8s_container"
resource.labels.pod_name=~"lustre-csi-node*"

Pod event warnings

If your workload Pods cannot start up, run kubectl describe pod <your-pod-name> -n <your-namespace> to check the Pod events. Find the troubleshooting guide below according to the Pod event.

CSI driver enablement issues

Pod event warning examples:
- MountVolume.MountDevice failed for volume "xxx" : kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name lustre.csi.storage.gke.io not found in the list of registered CSI drivers
- MountVolume.SetUp failed for volume "xxx" : kubernetes.io/csi: mounter.SetUpAt failed to get CSI client: driver name lustre.csi.storage.gke.io not found in the list of registered CSI drivers
Solutions:

This warning indicates that the CSI driver is either not installed or not yet running. Double-check whether the CSI driver is running in your cluster by referring to this section. If the cluster was recently scaled, updated, or upgraded, this warning is expected and should be transient, as the CSI driver Pods may take a few minutes to become fully functional after cluster operations.

MountVolume failures

AlreadyExists

Pod event warning examples:

MountVolume.MountDevice failed for volume "xxx" : rpc error: code = AlreadyExists desc = A mountpoint with the same lustre filesystem name "xxx" already exists on node "xxx". Please mount different lustre filesystems

Solutions:

Please try recreating the Lustre instance with a different filesystem name or use another Lustre instance with a unique filesystem name. Mounting multiple volumes from different Lustre instances with the same filesystem name on a single node is not supported because identical filesystem names result in the same major and minor device numbers, which conflicts with the shared mount architecture on a per-node basis.

Internal

Pod Event Warning Examples:

MountVolume.MountDevice failed for volume "preprov-pv-wrongfs" : rpc error: code = Internal desc = Could not mount "10.90.2.4@tcp:/testlfs1" at "/var/lib/kubelet/plugins/kubernetes.io/csi/lustre.csi.storage.gke.io/639947affddca6d2ff04eac5ec9766c65dd851516ce34b3b44017babfc01b5dc/globalmount" on node gke-lustre-default-nw-6988-pool-1-acbefebf-jl1v: mount failed: exit status 2
Mounting command: mount
Mounting arguments: -t lustre 10.90.2.4@tcp:/testlfs1 /var/lib/kubelet/plugins/kubernetes.io/csi/lustre.csi.storage.gke.io/639947affddca6d2ff04eac5ec9766c65dd851516ce34b3b44017babfc01b5dc/globalmount
Output: mount.lustre: mount 10.90.2.4@tcp:/testlfs1 at /var/lib/kubelet/plugins/kubernetes.io/csi/lustre.csi.storage.gke.io/639947affddca6d2ff04eac5ec9766c65dd851516ce34b3b44017babfc01b5dc/globalmount failed: No such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)

Solutions:

This error means the filesystem name of the Lustre instance you are trying to mount is incorrect or does not exist. Double-check the filesystem name of the Lustre instance.

Pod Event Warning Examples:

MountVolume.MountDevice failed for volume "preprov-pv-wrongfs" : rpc error: code = Internal desc = Could not mount "10.90.2.5@tcp:/testlfs" at "/var/lib/kubelet/plugins/kubernetes.io/csi/lustre.csi.storage.gke.io/639947affddca6d2ff04eac5ec9766c65dd851516ce34b3b44017babfc01b5dc/globalmount" on node gke-lustre-default-nw-6988-pool-1-acbefebf-jl1v: mount failed: exit status 5
Mounting command: mount
Mounting arguments: -t lustre 10.90.2.5@tcp:/testlfs /var/lib/kubelet/plugins/kubernetes.io/csi/lustre.csi.storage.gke.io/639947affddca6d2ff04eac5ec9766c65dd851516ce34b3b44017babfc01b5dc/globalmount
Output: mount.lustre: mount 10.90.2.5@tcp:/testlfs at /var/lib/kubelet/plugins/kubernetes.io/csi/lustre.csi.storage.gke.io/639947affddca6d2ff04eac5ec9766c65dd851516ce34b3b44017babfc01b5dc/globalmount failed: Input/output error
Is the MGS running?

Solutions:

This error means your GKE cluster cannot connect to the Lustre instance using the specified IP and filesystem name. Ensure the instance IP is correct and that your GKE cluster is in the same VPC network as your Lustre instance.

Pod Event Warning Examples:

MountVolume.MountDevice failed for volume "xxx" : rpc error: code = Internal desc = xxx

MountVolume.SetUp failed for volume "xxx" : rpc error: code = Internal desc = xxx

UnmountVolume.TearDown failed for volume "xxx" : rpc error: code = Internal desc = xxx

UnmountVolume.Unmount failed for volume "xxx" : rpc error: code = Internal desc = xxx

Solutions:

Warnings not listed above that include an RPC error code Internal indicate unexpected issues in the CSI driver. Create a new issue on the GitHub project page, including your GKE cluster version, detailed workload information, and the Pod event warning message in the issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

troubleshooting.md

troubleshooting.md

Troubleshooting

Log queries

Pod event warnings

CSI driver enablement issues

MountVolume failures

AlreadyExists

Internal

Files

troubleshooting.md

Latest commit

History

troubleshooting.md

File metadata and controls

Troubleshooting

Log queries

Pod event warnings

CSI driver enablement issues

MountVolume failures

AlreadyExists

Internal