Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure CSI volume does not format and mount new volume #8874

Closed
carlosrbcunha opened this issue Sep 11, 2020 · 9 comments
Closed

Azure CSI volume does not format and mount new volume #8874

carlosrbcunha opened this issue Sep 11, 2020 · 9 comments

Comments

@carlosrbcunha
Copy link

Nomad version

Nomad v0.12.4 (8efaee4)

Operating system and Environment details

Ubuntu 18.04.5 LTS

Issue

Nomad does not format and mount new csi volume

Reproduction steps

  • New azure disk is created via terraform
  • Volume is registered in Nomad
  • Job is submitted that uses previously registered volume

Volume definition

id = "teste1"
name = "teste1"
type = "csi"
external_id = "/subscriptions/11111111-2222-3333-4444-555555555555/resourceGroups/bifana-core-rg/providers/Microsoft.Compute/disks/bifana-core-teste1-disk"
plugin_id = "az-disk0"
access_mode = "single-node-writer"
attachment_mode = "file-system"
mount_options {
   fs_type = "ext4"untitled:Untitled-5
   mount_flags = ["ro"]
}

Previous opened issues also regarding CSI on Azure

#7812

Plugin-azure-disk-node logs

nodeserver.go:237] Using default NodeGetCapabilities
utils.go:114] GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":3}}},{"Type":{"Rpc":{"type":2}}}]}
utils.go:108] GRPC call: /csi.v1.Identity/Probe
utils.go:109] GRPC request: {}
utils.go:114] GRPC response: {"ready":{"value":true}}
utils.go:108] GRPC call: /csi.v1.Identity/Probe
utils.go:109] GRPC request: {}
utils.go:114] GRPC response: {"ready":{"value":true}}
utils.go:108] GRPC call: /csi.v1.Node/NodeGetCapabilities
utils.go:109] GRPC request: {}
nodeserver.go:237] Using default NodeGetCapabilities
utils.go:114] GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":3}}},{"Type":{"Rpc":{"type":2}}}]}
utils.go:108] GRPC call: /csi.v1.Node/NodeStageVolume
utils.go:109] GRPC request: {"publish_context":{"LUN":"0"},"staging_target_path":"/csi/staging/teste1/rw-file-system-single-node-writer","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4","mount_flags":["ro"]}},"access_mode":{"mode":1}},"volume_id":"/subscriptions/11111111-2222-3333-4444-555555555555/resourceGroups/bifana-core-rg/providers/Microsoft.Compute/disks/bifana-core-teste1-disk"}
azure_common_linux.go:175] azureDisk - found /dev/disk/azure/scsi1/lun0 by sdc under /dev/disk/azure/scsi1/
nodeserver.go:121] NodeStageVolume: formatting /dev/disk/azure/scsi1/lun0 and mounting at /csi/staging/teste1/rw-file-system-single-node-writer with mount options([ro])
mount_linux.go:405] Attempting to determine if disk "/dev/disk/azure/scsi1/lun0" is formatted using blkid with args: ([-p -s TYPE -s PTTYPE -o export /dev/disk/azure/scsi1/lun0])
mount_linux.go:408] Output: "", err: exit status 2
utils.go:112] GRPC error: rpc error: code = Internal desc = could not format "/dev/disk/azure/scsi1/lun0"(lun: "0"), and mount it at "/csi/staging/teste1/rw-file-system-single-node-writer"
utils.go:108] GRPC call: /csi.v1.Node/NodeUnpublishVolume
utils.go:109] GRPC request: {"target_path":"/csi/per-alloc/ec0b8113-d042-5587-b592-4e85330a6a0a/teste1/rw-file-system-single-node-writer","volume_id":"/subscriptions/11111111-2222-3333-4444-555555555555/resourceGroups/bifana-core-rg/providers/Microsoft.Compute/disks/bifana-core-teste1-disk"}
nodeserver.go:225] NodeUnpublishVolume: unmounting volume /subscriptions/11111111-2222-3333-4444-555555555555/resourceGroups/bifana-core-rg/providers/Microsoft.Compute/disks/bifana-core-teste1-disk on /csi/per-alloc/ec0b8113-d042-5587-b592-4e85330a6a0a/teste1/rw-file-system-single-node-writer
mount_helper_common.go:33] Warning: Unmount skipped because path does not exist: /csi/per-alloc/ec0b8113-d042-5587-b592-4e85330a6a0a/teste1/rw-file-system-single-node-writer
nodeserver.go:230] NodeUnpublishVolume: unmount volume /subscriptions/11111111-2222-3333-4444-555555555555/resourceGroups/bifana-core-rg/providers/Microsoft.Compute/disks/bifana-core-teste1-disk on /csi/per-alloc/ec0b8113-d042-5587-b592-4e85330a6a0a/teste1/rw-file-system-single-node-writer successfully
utils.go:114] GRPC response: {}
utils.go:108] GRPC call: /csi.v1.Node/NodeUnstageVolume
utils.go:109] GRPC request: {"staging_target_path":"/csi/staging/teste1/rw-file-system-single-node-writer","volume_id":"/subscriptions/11111111-2222-3333-4444-555555555555/resourceGroups/bifana-core-rg/providers/Microsoft.Compute/disks/bifana-core-teste1-disk"}
nodeserver.go:144] NodeUnstageVolume: unmounting /csi/staging/teste1/rw-file-system-single-node-writer
mount_helper_common.go:65] Warning: "/csi/staging/teste1/rw-file-system-single-node-writer" is not a mountpoint, deleting
nodeserver.go:149] NodeUnstageVolume: unmount /csi/staging/teste1/rw-file-system-single-node-writer successfully

We can see the disk being attached to the VM and nomad trying to format and mount it to the server but something went wrong in identifying wether the disk is formatted or not.

GRPC error: rpc error: code = Internal desc = could not format "/dev/disk/azure/scsi1/lun0"(lun: "0"), and mount it at "/csi/staging/teste1/rw-file-system-single-node-writer"

While the allocation fails, the job is retried on all the clients of the cluster (with the same errors) and it fails the job completely.

@tgross
Copy link
Member

tgross commented Sep 11, 2020

Hi @carlosrbcunha! Sorry to hear about that. Can you provide the logs from the node plugin? Usually something like nomad alloc logs -stderr :alloc_id will get these. The jobspec for the plugins might help as well. Also, can you verify for me that the CSI plugin is running with the Docker driver's privileged=true configuration?

@josemaia
Copy link
Contributor

With regards to plugins, we have, in our client.hcl files, the following:

	plugin "docker" {
		config {
			allow_privileged = true
		}
	}

as well as in the configuration of the node plugin:

      config {
        image = "mcr.microsoft.com/k8s/csi/azuredisk-csi"

        volumes = [
          "local/azure.json:/etc/kubernetes/azure.json"
        ]

        args = [
          "--nodeid=${attr.unique.hostname}-vm",
          "--endpoint=unix://csi/csi.sock",
          "--logtostderr",
          "--v=5",
        ]

        # node plugins must run as privileged jobs because they
        # mount disks to the host
        privileged = true
      }

Looking into getting some logs right now.

@carlosrbcunha
Copy link
Author

Hi Tim,

I already send the logs via email but I also attach them here for you.
node_logs_csi.log

Regarding this error, when we browse /dev/disk/azure/scsi1 on the server it has only a link called lun0 that points out to the disk device.
When running blkid with the args -p -s TYPE -s PTTYPE -o export /dev/disk/azure/scsi1/lun0 it does not return TYPE with anything (because the disk has no such attribute. We tested with a disk already partitioned and formatted and the result was the same (the node plugin behaved the same way) but in the /dev/disk/azure/scsi1 folder there was another link to lun0-part1 witch returned TYPE=ext4 as it's supposed to.

What is the expected behaviour here ?
Something is wrong in this logic.

If you need more logs or tests, please ask.

Best Regards,
Carlos Cunha

@tgross
Copy link
Member

tgross commented Sep 14, 2020

I see two sets of errors in the node client logs. First:

I0914 08:50:55.248915       1 azure_vmclient.go:124] Received error in vm.get.request: resourceID: /subscriptions/470bd2ee-0973-4f49-a719-dc721d0d6e4f/resourceGroups/bifana-core-rg/providers/Microsoft.Compute/virtualMachines/8afd69bb8976, error: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 404, RawError: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 404, RawError: {"error":{"code":"ResourceNotFound","message":"The Resource 'Microsoft.Compute/virtualMachines/8afd69bb8976' under resource group 'bifana-core-rg' was not found. For more details please go to https://aka.ms/ARMResourceNotFoundFix"}}
I0914 08:50:55.248946       1 azure_wrap.go:194] Virtual machine "8afd69bb8976" not found
W0914 08:50:55.248956       1 nodeserver.go:262] Failed to get zone from Azure cloud provider, nodeName: bifana-core-nomad-client-1-vm, error: instance not found
I0914 08:50:55.248961       1 nodeserver.go:281] got a matching size in getMaxDataDiskCount, VM Size: STANDARD_B2S, MaxDataDiskCount: 4
I0914 08:50:55.248967       1 utils.go:114] GRPC response: {"accessible_topology":{"segments":{"topology.disk.csi.azure.com/zone":""}},"max_volumes_per_node":4,"node_id":"bifana-core-nomad-client-1-vm"}

The Azure CSI plugin is getting a "resource not found" exception for the VM that it's trying to attach the volume to. I'm not much of an Azure expert, but my guess here is that the client node doesn't have the appropriate managed identity to be able to perform the query for the VM. The link in the logs (https://aka.ms/ARMResourceNotFoundFix) probably has more platform-specific clues for you.

But then we get:

I0914 08:59:10.871154       1 utils.go:108] GRPC call: /csi.v1.Node/NodeStageVolume
I0914 08:59:10.871166       1 utils.go:109] GRPC request: {"publish_context":{"LUN":"0"},"staging_target_path":"/csi/staging/teste1/rw-file-system-single-node-writer","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4","mount_flags":["ro"]}},"access_mode":{"mode":1}},"volume_id":"/subscriptions/470bd2ee-0973-4f49-a719-dc721d0d6e4f/resourceGroups/bifana-core-rg/providers/Microsoft.Compute/disks/bifana-core-teste1-disk"}
I0914 08:59:12.528770       1 azure_common_linux.go:175] azureDisk - found /dev/disk/azure/scsi1/lun0 by sdc under /dev/disk/azure/scsi1/
I0914 08:59:12.528790       1 nodeserver.go:121] NodeStageVolume: formatting /dev/disk/azure/scsi1/lun0 and mounting at /csi/staging/teste1/rw-file-system-single-node-writer with mount options([ro])
I0914 08:59:12.528804       1 mount_linux.go:405] Attempting to determine if disk "/dev/disk/azure/scsi1/lun0" is formatted using blkid with args: ([-p -s TYPE -s PTTYPE -o export /dev/disk/azure/scsi1/lun0])
I0914 08:59:12.619217       1 mount_linux.go:408] Output: "DEVNAME=/dev/disk/azure/scsi1/lun0\nPTTYPE=gpt\n", err: <nil>
I0914 08:59:12.619240       1 mount_linux.go:446] Disk /dev/disk/azure/scsi1/lun0 detected partition table type: gpt
W0914 08:59:12.619247       1 mount_linux.go:381] Configured to mount disk /dev/disk/azure/scsi1/lun0 as unknown data, probably partitions but current format is ext4, things might break
I0914 08:59:12.619252       1 mount_linux.go:394] Attempting to mount disk /dev/disk/azure/scsi1/lun0 in ext4 format at /csi/staging/teste1/rw-file-system-single-node-writer
I0914 08:59:12.619282       1 mount_linux.go:146] Mounting cmd (mount) with arguments (-t ext4 -o ro,defaults /dev/disk/azure/scsi1/lun0 /csi/staging/teste1/rw-file-system-single-node-writer)
E0914 08:59:12.630547       1 mount_linux.go:150] Mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t ext4 -o ro,defaults /dev/disk/azure/scsi1/lun0 /csi/staging/teste1/rw-file-system-single-node-writer
Output: mount: /csi/staging/teste1/rw-file-system-single-node-writer: wrong fs type, bad option, bad superblock on /dev/sdc, missing codepage or helper program, or other error.

E0914 08:59:12.630575       1 utils.go:112] GRPC error: rpc error: code = Internal desc = could not format "/dev/disk/azure/scsi1/lun0"(lun: "0"), and mount it at "/csi/staging/teste1/rw-file-system-single-node-writer"

Combined with your findings above, I think you may be running into an issue with the Azure driver. It would probably be a good idea to see if you can open an issue with the azuredisk-csi-driver folks.

@josemaia
Copy link
Contributor

josemaia commented Sep 14, 2020

8afd69bb8976 isn't the name of any VM we have, but it looks suspiciously like the end of an UUID (no match with any of the Nomad clients, allocations or servers, however).

edit: almost certainly a docker container ID / container hostname

@carlosrbcunha
Copy link
Author

I opened an issue in the azure disk-csi-driver as you suggested.
For tracking
kubernetes-sigs/azuredisk-csi-driver#539

@carlosrbcunha
Copy link
Author

After getting some help from the azuredisk-csi-driver folks we figured out what was the problem.
Having this mount_options specified makes the driver fail while checking the format of the disk.

mount_options {
   fs_type = "ext4"
   mount_flags = ["ro"]
}

Removing this made it all work properly.

Thanks for your help.

@tgross
Copy link
Member

tgross commented Sep 16, 2020

Glad to hear you've worked it out!

@github-actions
Copy link

github-actions bot commented Nov 2, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 2, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants