Azure CSI volume does not format and mount new volume #8874

carlosrbcunha · 2020-09-11T17:03:44Z

Nomad version

Nomad v0.12.4 (8efaee4)

Operating system and Environment details

Ubuntu 18.04.5 LTS

Issue

Nomad does not format and mount new csi volume

Reproduction steps

New azure disk is created via terraform
Volume is registered in Nomad
Job is submitted that uses previously registered volume

Volume definition

id = "teste1"
name = "teste1"
type = "csi"
external_id = "/subscriptions/11111111-2222-3333-4444-555555555555/resourceGroups/bifana-core-rg/providers/Microsoft.Compute/disks/bifana-core-teste1-disk"
plugin_id = "az-disk0"
access_mode = "single-node-writer"
attachment_mode = "file-system"
mount_options {
   fs_type = "ext4"untitled:Untitled-5
   mount_flags = ["ro"]
}

Previous opened issues also regarding CSI on Azure

#7812

Plugin-azure-disk-node logs

nodeserver.go:237] Using default NodeGetCapabilities
utils.go:114] GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":3}}},{"Type":{"Rpc":{"type":2}}}]}
utils.go:108] GRPC call: /csi.v1.Identity/Probe
utils.go:109] GRPC request: {}
utils.go:114] GRPC response: {"ready":{"value":true}}
utils.go:108] GRPC call: /csi.v1.Identity/Probe
utils.go:109] GRPC request: {}
utils.go:114] GRPC response: {"ready":{"value":true}}
utils.go:108] GRPC call: /csi.v1.Node/NodeGetCapabilities
utils.go:109] GRPC request: {}
nodeserver.go:237] Using default NodeGetCapabilities
utils.go:114] GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":3}}},{"Type":{"Rpc":{"type":2}}}]}
utils.go:108] GRPC call: /csi.v1.Node/NodeStageVolume
utils.go:109] GRPC request: {"publish_context":{"LUN":"0"},"staging_target_path":"/csi/staging/teste1/rw-file-system-single-node-writer","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4","mount_flags":["ro"]}},"access_mode":{"mode":1}},"volume_id":"/subscriptions/11111111-2222-3333-4444-555555555555/resourceGroups/bifana-core-rg/providers/Microsoft.Compute/disks/bifana-core-teste1-disk"}
azure_common_linux.go:175] azureDisk - found /dev/disk/azure/scsi1/lun0 by sdc under /dev/disk/azure/scsi1/
nodeserver.go:121] NodeStageVolume: formatting /dev/disk/azure/scsi1/lun0 and mounting at /csi/staging/teste1/rw-file-system-single-node-writer with mount options([ro])
mount_linux.go:405] Attempting to determine if disk "/dev/disk/azure/scsi1/lun0" is formatted using blkid with args: ([-p -s TYPE -s PTTYPE -o export /dev/disk/azure/scsi1/lun0])
mount_linux.go:408] Output: "", err: exit status 2
utils.go:112] GRPC error: rpc error: code = Internal desc = could not format "/dev/disk/azure/scsi1/lun0"(lun: "0"), and mount it at "/csi/staging/teste1/rw-file-system-single-node-writer"
utils.go:108] GRPC call: /csi.v1.Node/NodeUnpublishVolume
utils.go:109] GRPC request: {"target_path":"/csi/per-alloc/ec0b8113-d042-5587-b592-4e85330a6a0a/teste1/rw-file-system-single-node-writer","volume_id":"/subscriptions/11111111-2222-3333-4444-555555555555/resourceGroups/bifana-core-rg/providers/Microsoft.Compute/disks/bifana-core-teste1-disk"}
nodeserver.go:225] NodeUnpublishVolume: unmounting volume /subscriptions/11111111-2222-3333-4444-555555555555/resourceGroups/bifana-core-rg/providers/Microsoft.Compute/disks/bifana-core-teste1-disk on /csi/per-alloc/ec0b8113-d042-5587-b592-4e85330a6a0a/teste1/rw-file-system-single-node-writer
mount_helper_common.go:33] Warning: Unmount skipped because path does not exist: /csi/per-alloc/ec0b8113-d042-5587-b592-4e85330a6a0a/teste1/rw-file-system-single-node-writer
nodeserver.go:230] NodeUnpublishVolume: unmount volume /subscriptions/11111111-2222-3333-4444-555555555555/resourceGroups/bifana-core-rg/providers/Microsoft.Compute/disks/bifana-core-teste1-disk on /csi/per-alloc/ec0b8113-d042-5587-b592-4e85330a6a0a/teste1/rw-file-system-single-node-writer successfully
utils.go:114] GRPC response: {}
utils.go:108] GRPC call: /csi.v1.Node/NodeUnstageVolume
utils.go:109] GRPC request: {"staging_target_path":"/csi/staging/teste1/rw-file-system-single-node-writer","volume_id":"/subscriptions/11111111-2222-3333-4444-555555555555/resourceGroups/bifana-core-rg/providers/Microsoft.Compute/disks/bifana-core-teste1-disk"}
nodeserver.go:144] NodeUnstageVolume: unmounting /csi/staging/teste1/rw-file-system-single-node-writer
mount_helper_common.go:65] Warning: "/csi/staging/teste1/rw-file-system-single-node-writer" is not a mountpoint, deleting
nodeserver.go:149] NodeUnstageVolume: unmount /csi/staging/teste1/rw-file-system-single-node-writer successfully

We can see the disk being attached to the VM and nomad trying to format and mount it to the server but something went wrong in identifying wether the disk is formatted or not.

GRPC error: rpc error: code = Internal desc = could not format "/dev/disk/azure/scsi1/lun0"(lun: "0"), and mount it at "/csi/staging/teste1/rw-file-system-single-node-writer"

While the allocation fails, the job is retried on all the clients of the cluster (with the same errors) and it fails the job completely.

The text was updated successfully, but these errors were encountered:

tgross · 2020-09-11T17:33:42Z

Hi @carlosrbcunha! Sorry to hear about that. Can you provide the logs from the node plugin? Usually something like nomad alloc logs -stderr :alloc_id will get these. The jobspec for the plugins might help as well. Also, can you verify for me that the CSI plugin is running with the Docker driver's privileged=true configuration?

josemaia · 2020-09-14T08:42:22Z

With regards to plugins, we have, in our client.hcl files, the following:

	plugin "docker" {
		config {
			allow_privileged = true
		}
	}

as well as in the configuration of the node plugin:

      config {
        image = "mcr.microsoft.com/k8s/csi/azuredisk-csi"

        volumes = [
          "local/azure.json:/etc/kubernetes/azure.json"
        ]

        args = [
          "--nodeid=${attr.unique.hostname}-vm",
          "--endpoint=unix://csi/csi.sock",
          "--logtostderr",
          "--v=5",
        ]

        # node plugins must run as privileged jobs because they
        # mount disks to the host
        privileged = true
      }

Looking into getting some logs right now.

carlosrbcunha · 2020-09-14T10:20:55Z

Hi Tim,

I already send the logs via email but I also attach them here for you.
node_logs_csi.log

Regarding this error, when we browse /dev/disk/azure/scsi1 on the server it has only a link called lun0 that points out to the disk device.
When running blkid with the args -p -s TYPE -s PTTYPE -o export /dev/disk/azure/scsi1/lun0 it does not return TYPE with anything (because the disk has no such attribute. We tested with a disk already partitioned and formatted and the result was the same (the node plugin behaved the same way) but in the /dev/disk/azure/scsi1 folder there was another link to lun0-part1 witch returned TYPE=ext4 as it's supposed to.

What is the expected behaviour here ?
Something is wrong in this logic.

If you need more logs or tests, please ask.

Best Regards,
Carlos Cunha

tgross · 2020-09-14T12:59:23Z

I see two sets of errors in the node client logs. First:

I0914 08:50:55.248915       1 azure_vmclient.go:124] Received error in vm.get.request: resourceID: /subscriptions/470bd2ee-0973-4f49-a719-dc721d0d6e4f/resourceGroups/bifana-core-rg/providers/Microsoft.Compute/virtualMachines/8afd69bb8976, error: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 404, RawError: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 404, RawError: {"error":{"code":"ResourceNotFound","message":"The Resource 'Microsoft.Compute/virtualMachines/8afd69bb8976' under resource group 'bifana-core-rg' was not found. For more details please go to https://aka.ms/ARMResourceNotFoundFix"}}
I0914 08:50:55.248946       1 azure_wrap.go:194] Virtual machine "8afd69bb8976" not found
W0914 08:50:55.248956       1 nodeserver.go:262] Failed to get zone from Azure cloud provider, nodeName: bifana-core-nomad-client-1-vm, error: instance not found
I0914 08:50:55.248961       1 nodeserver.go:281] got a matching size in getMaxDataDiskCount, VM Size: STANDARD_B2S, MaxDataDiskCount: 4
I0914 08:50:55.248967       1 utils.go:114] GRPC response: {"accessible_topology":{"segments":{"topology.disk.csi.azure.com/zone":""}},"max_volumes_per_node":4,"node_id":"bifana-core-nomad-client-1-vm"}

The Azure CSI plugin is getting a "resource not found" exception for the VM that it's trying to attach the volume to. I'm not much of an Azure expert, but my guess here is that the client node doesn't have the appropriate managed identity to be able to perform the query for the VM. The link in the logs (https://aka.ms/ARMResourceNotFoundFix) probably has more platform-specific clues for you.

But then we get:

I0914 08:59:10.871154       1 utils.go:108] GRPC call: /csi.v1.Node/NodeStageVolume
I0914 08:59:10.871166       1 utils.go:109] GRPC request: {"publish_context":{"LUN":"0"},"staging_target_path":"/csi/staging/teste1/rw-file-system-single-node-writer","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4","mount_flags":["ro"]}},"access_mode":{"mode":1}},"volume_id":"/subscriptions/470bd2ee-0973-4f49-a719-dc721d0d6e4f/resourceGroups/bifana-core-rg/providers/Microsoft.Compute/disks/bifana-core-teste1-disk"}
I0914 08:59:12.528770       1 azure_common_linux.go:175] azureDisk - found /dev/disk/azure/scsi1/lun0 by sdc under /dev/disk/azure/scsi1/
I0914 08:59:12.528790       1 nodeserver.go:121] NodeStageVolume: formatting /dev/disk/azure/scsi1/lun0 and mounting at /csi/staging/teste1/rw-file-system-single-node-writer with mount options([ro])
I0914 08:59:12.528804       1 mount_linux.go:405] Attempting to determine if disk "/dev/disk/azure/scsi1/lun0" is formatted using blkid with args: ([-p -s TYPE -s PTTYPE -o export /dev/disk/azure/scsi1/lun0])
I0914 08:59:12.619217       1 mount_linux.go:408] Output: "DEVNAME=/dev/disk/azure/scsi1/lun0\nPTTYPE=gpt\n", err: <nil>
I0914 08:59:12.619240       1 mount_linux.go:446] Disk /dev/disk/azure/scsi1/lun0 detected partition table type: gpt
W0914 08:59:12.619247       1 mount_linux.go:381] Configured to mount disk /dev/disk/azure/scsi1/lun0 as unknown data, probably partitions but current format is ext4, things might break
I0914 08:59:12.619252       1 mount_linux.go:394] Attempting to mount disk /dev/disk/azure/scsi1/lun0 in ext4 format at /csi/staging/teste1/rw-file-system-single-node-writer
I0914 08:59:12.619282       1 mount_linux.go:146] Mounting cmd (mount) with arguments (-t ext4 -o ro,defaults /dev/disk/azure/scsi1/lun0 /csi/staging/teste1/rw-file-system-single-node-writer)
E0914 08:59:12.630547       1 mount_linux.go:150] Mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t ext4 -o ro,defaults /dev/disk/azure/scsi1/lun0 /csi/staging/teste1/rw-file-system-single-node-writer
Output: mount: /csi/staging/teste1/rw-file-system-single-node-writer: wrong fs type, bad option, bad superblock on /dev/sdc, missing codepage or helper program, or other error.

E0914 08:59:12.630575       1 utils.go:112] GRPC error: rpc error: code = Internal desc = could not format "/dev/disk/azure/scsi1/lun0"(lun: "0"), and mount it at "/csi/staging/teste1/rw-file-system-single-node-writer"

Combined with your findings above, I think you may be running into an issue with the Azure driver. It would probably be a good idea to see if you can open an issue with the azuredisk-csi-driver folks.

josemaia · 2020-09-14T13:03:21Z

8afd69bb8976 isn't the name of any VM we have, but it looks suspiciously like the end of an UUID (no match with any of the Nomad clients, allocations or servers, however).

edit: almost certainly a docker container ID / container hostname

carlosrbcunha · 2020-09-14T15:25:25Z

I opened an issue in the azure disk-csi-driver as you suggested.
For tracking
kubernetes-sigs/azuredisk-csi-driver#539

carlosrbcunha · 2020-09-16T08:14:30Z

After getting some help from the azuredisk-csi-driver folks we figured out what was the problem.
Having this mount_options specified makes the driver fail while checking the format of the disk.

mount_options {
   fs_type = "ext4"
   mount_flags = ["ro"]
}

Removing this made it all work properly.

Thanks for your help.

tgross · 2020-09-16T12:52:39Z

Glad to hear you've worked it out!

github-actions · 2022-11-02T02:42:04Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

tgross added theme/storage type/bug stage/waiting-reply labels Sep 11, 2020

tgross self-assigned this Sep 11, 2020

carlosrbcunha closed this as completed Sep 16, 2020

tgross removed the stage/waiting-reply label Sep 16, 2020

github-actions bot locked as resolved and limited conversation to collaborators Nov 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Azure CSI volume does not format and mount new volume #8874

Azure CSI volume does not format and mount new volume #8874

carlosrbcunha commented Sep 11, 2020

tgross commented Sep 11, 2020 •

edited

Loading

josemaia commented Sep 14, 2020

carlosrbcunha commented Sep 14, 2020

tgross commented Sep 14, 2020

josemaia commented Sep 14, 2020 •

edited

Loading

carlosrbcunha commented Sep 14, 2020

carlosrbcunha commented Sep 16, 2020

tgross commented Sep 16, 2020

github-actions bot commented Nov 2, 2022

Azure CSI volume does not format and mount new volume #8874

Azure CSI volume does not format and mount new volume #8874

Comments

carlosrbcunha commented Sep 11, 2020

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Volume definition

Previous opened issues also regarding CSI on Azure

Plugin-azure-disk-node logs

tgross commented Sep 11, 2020 • edited Loading

josemaia commented Sep 14, 2020

carlosrbcunha commented Sep 14, 2020

tgross commented Sep 14, 2020

josemaia commented Sep 14, 2020 • edited Loading

carlosrbcunha commented Sep 14, 2020

carlosrbcunha commented Sep 16, 2020

tgross commented Sep 16, 2020

github-actions bot commented Nov 2, 2022

tgross commented Sep 11, 2020 •

edited

Loading

josemaia commented Sep 14, 2020 •

edited

Loading