Volume Creation failed with - CnsFault error: CNS: Failed to initialize FcdService :Operation timed out #135

doctori · 2020-02-14T16:45:49Z

/kind bug

What happened:
In a topology aware environment the CNS creation request doesn't respect the the storage policy and send the full list of datastore available on all the clusters of the vcenter

 
 failed to create cns volume. createSpec: "(*types.CnsVolumeCreateSpec)(0xc00076a210)({
 DynamicData: (types.DynamicData) {
 },
 Name: (string) (len=40) "pvc-307433b6-7f89-4381-8598-45a685216b68",
 VolumeType: (string) (len=5) "BLOCK",
 Datastores: ([]types.ManagedObjectReference) (len=66 cap=128) {
 (types.ManagedObjectReference) Datastore:datastore-11809,
 (types.ManagedObjectReference) Datastore:datastore-21706,
 (types.ManagedObjectReference) Datastore:datastore-21707,
 (types.ManagedObjectReference) Datastore:datastore-21814,
 (types.ManagedObjectReference) Datastore:datastore-21815,
 (types.ManagedObjectReference) Datastore:datastore-21872,
 [... a shitload of datastores ...]
 (types.ManagedObjectReference) Datastore:datastore-65240,
 (types.ManagedObjectReference) Datastore:datastore-65241,
 (types.ManagedObjectReference) Datastore:datastore-7012,
 (types.ManagedObjectReference) Datastore:datastore-7355
 },
 Metadata: (types.CnsVolumeMetadata) {
 DynamicData: (types.DynamicData) {
 },
 ContainerCluster: (types.CnsContainerCluster) {
 DynamicData: (types.DynamicData) {
 },
 ClusterType: (string) (len=10) "KUBERNETES",
 ClusterId: (string) (len=14) "my-cluster-id",
 VSphereUser: (string) (len=14) "my-account"
 },
 EntityMetadata: ([]types.BaseCnsEntityMetadata) <nil>
 },
 BackingObjectDetails: (*types.CnsBlockBackingDetails)(0xc000e34480)({
 CnsBackingObjectDetails: (types.CnsBackingObjectDetails) {
 DynamicData: (types.DynamicData) {
 },
 CapacityInMb: (int64) 15360
 },
 BackingDiskId: (string) ""
 }),
 Profile: ([]types.BaseVirtualMachineProfileSpec) (len=1 cap=1) {
 (*types.VirtualMachineDefinedProfileSpec)(0xc000a893c0)({
 VirtualMachineProfileSpec: (types.VirtualMachineProfileSpec) {
 DynamicData: (types.DynamicData) {
 }
 },
 ProfileId: (string) (len=36) "611e2fe8-1aca-4e91-8bfe-9f5da399e98e",
 ReplicationSpec: (*types.ReplicationSpec)(<nil>),
 ProfileData: (*types.VirtualMachineProfileRawData)(<nil>),
 ProfileParams: ([]types.KeyValue) <nil>
 })
 }
})
", fault: "(*types.CnsFault)(0xc000ad87c0)({
 Fault: (*types.BaseMethodFault)(0xc00086a7d0)(<nil>),
 LocalizedMessage: (string) (len=73) "CnsFault error: CNS: Failed to initialize FcdService :Operation timed out"
})
", opId: "b72d5c4c"
E0214 16:15:03.728337 1 vsphereutil.go:120] Failed to create disk pvc-307433b6-7f89-4381-8598-45a685216b68 with error CnsFault error: CNS: Failed to initialize FcdService :Operation timed out
E0214 16:15:03.728360 1 controller.go:200] Failed to create volume. Error: CnsFault error: CNS: Failed to initialize FcdService :Operation timed out

the FCD service timesout because some datastores are not reachable - some clusters are not in 6.7u3.

What you expected to happen:
the CNS creation request should only holds datastore compatible with the storage policy

How to reproduce it (as minimally and precisely as possible):

Create a storage policy concerning only 2 datastores attached to two clusters with ESXi nodes in 6.7u3 within a environment with multiple other clusters with ESXi 6.7u2 (or even lower)
then create a k8s cluster with this kind of storage class :

allowedTopologies:
- matchLabelExpressions:
  - key: failure-domain.beta.kubernetes.io/region
    values:
    - tokyo
  - key: failure-domain.beta.kubernetes.io/zone
    values:
    - tokyo-A
    - tokyo-B
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
  name: vmware-tier1
parameters:
  fstype: ext4
  storagepolicyname: k8s-tokyo-tier1
provisioner: csi.vsphere.vmware.com
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

Anything else we need to know?:
This situation seems to have appened after a few weeks of runtime (restarting the whole stack didn't fix the behaviour)

Environment:

csi-vsphere version: 1.0.2
vsphere-cloud-controller-manager version: 1.1.0
Kubernetes version: 1.16.4
vSphere version: 6.7.0 build 14368073
OS (e.g. from /etc/os-release): Ubuntu 18.04.2 LTS
Kernel (e.g. uname -a): Linux rebrand0026.hosting.cegedim.cloud 4.15.0-50-generic Metadata Syncer for vsphere CSI Driver #54-Ubuntu SMP Mon May 6 18:46:08 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Install tools: Rancher 2.3.4
Others:

The text was updated successfully, but these errors were encountered:

divyenpatel · 2020-02-14T17:21:55Z

@doctori CNS does respect storage policy specified in the request.

CSI driver does send list of shared datastores accessible in the requested zone/region.
CNS backend in the vCenter, choose the compliant datastore out of the list sent from the CSI driver.

You can verify this on the vCenter CNS UI.

The issue I see is

LocalizedMessage: (string) (len=73) "CnsFault error: CNS: Failed to initialize FcdService :Operation timed out"

Can you post VC side log here?

To check the logs on the vCenter Server side, search for the logs with the cns prefix at the following locations:

To check the logs related to the provisioning workflow, use the /var/log/vmware/vsan-health/vsanvcmgmtd.log file on vCenter Server.
To check the logs related to authentication and permission issues, use the /var/log/vmware/vsan-health/vmware-vsan-health-service.log file on vCenter Server.

doctori · 2020-02-14T18:17:12Z

this is from the vsanvcmgmtd.log :

2020-02-14T16:45:03.735+01:00 verbose vsanvcmgmtd[12641] [vSAN@6876 sub=ServiceManager opId=b72d5c4c] CNS: initializing FcdService
2020-02-14T16:45:03.735+01:00 verbose vsanvcmgmtd[12641] [vSAN@6876 sub=HttpConnectionPool-000000 opId=b72d5c4c] [IncConnectionCount] Cannot increment number of connections to <cs p:00007f1e740071a0, TCP:localhost:443> as max connections (20) already in use
2020-02-14T16:45:09.758+01:00 verbose vsanvcmgmtd[29771] [vSAN@6876 sub=VsanSoapSvc] Unrecognized version URI "urn:vsan/"; using default handler for "urn:vsan/6.9.0"
2020-02-14T16:45:09.760+01:00 verbose vsanvcmgmtd[29771] [vSAN@6876 sub=CnsAccessChecker] CNS AuthZ: Check if the child entity is a datacenter: datacenter-2.
2020-02-14T16:45:09.760+01:00 verbose vsanvcmgmtd[29771] [vSAN@6876 sub=CnsAccessChecker] CNS AuthZ: Look up verify if datastore is present in the datacenter: datacenter-2.
2020-02-14T16:45:09.763+01:00 verbose vsanvcmgmtd[29771] [vSAN@6876 sub=CnsAccessChecker] CNS AuthZ: Getting privileges checklist for create.
2020-02-14T16:45:09.763+01:00 verbose vsanvcmgmtd[29771] [vSAN@6876 sub=CnsAccessChecker] CNS AuthZ: Checking System.Read privilege on entities.
2020-02-14T16:45:09.792+01:00 verbose vsanvcmgmtd[29771] [vSAN@6876 sub=CnsAccessChecker] CNS AuthZ: Checking Datastore.FileManagement privilege on entities.

so clearly there is something wrong on the vcenter side.
I though the CSI controller was supposed to do the filtering.
A support request has been opened with vmware support.
Thank you for your help.

divyenpatel · 2020-02-28T21:13:29Z

@doctori Did you get help from vmware support regarding this?

Let us know if you need any help.

fejta-bot · 2020-05-28T21:24:23Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

doctori · 2020-05-30T07:03:12Z

this was an issue from the vcenter
/close

k8s-ci-robot · 2020-05-30T07:03:17Z

@doctori: Closing this issue.

In response to this:

this was an issue from the vcenter
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dmc5179 · 2020-06-11T15:11:56Z

@doctori What was the issue with vcenter? I get a slightly different error but might be related

failed to provision volume with StorageClass "vmware-csi-sc": rpc error: code = Internal desc = Failed to create volume. Error: CnsFault error: CNS: Failed to initialize FcdService :Connection refused: The remote service is not running, OR is overloaded, OR a firewall is rejecting connections

doctori · 2020-06-12T09:58:50Z

@dmc5179 : you might want to check your connectivity between you CSI controller pod and vcenter, looks like the controller can't talk to the vcenter. or you might want to raise the roundTrip value to give the client a chance to get a response after a while (on slow vcenters this helps)

dmc5179 · 2020-06-12T14:32:48Z

@doctori I suspect the issue is because we have vCenter on a different port from the default 443. I had to update the OpenShift cluster config when deploying to ensure OpenShift understands how to talk to vcenter on the other port. When I deployed the CSI driver I made sure the configuration included the updated port. It's possible that there is somewhere in the code that doesn't use the config and is locked to 443 or there is another place that I need to configure the CSI driver to use an alternate port for vCenter.

I'm going to setup a port forwarding rule on one of our hosts port 443 to route calls to vCenter and see if the CSI driver is happier. If is is then I might have open an issue here about supporting alternate ports.

RaunakShah · 2020-06-15T20:37:49Z

@dmc5179 have you tried changing the port field in your csi-vsphere.conf file? That defaults to 443.

dmc5179 · 2020-06-16T22:12:01Z

@RaunakShah @doctori We are adding the custom port to the main cloud provider config map which works as the default storage class comes online when we install the cluster. We're also adding the port to the csi-vsphere.conf but that doesn't appear to resolve the issue.

I've even gone so far as to create another host as a proxy, forwarding 443 on the proxy to 8443 on vcenter, and installed the cluster against the proxy. When I do that, and leave the port out of any of the configs, the default storage class again works, through the proxy 443 -> vcenter 8443, but the CSI driver fails with the exact same error message.

Perhaps there is a service that I need to enable in VMWare somewhere to support the CSI driver?

divyenpatel · 2020-06-16T22:40:11Z

We are adding the custom port to the main cloud provider config map which works as the default storage class comes online when we install the cluster. We're also adding the port to the csi-vsphere.conf but that doesn't appear to resolve the issue.

@dmc5179
After changing configuration in the secret, did you restart pod? We have added capability to watch on config changes in 2.0.0 version of the driver. if you are using v1.0.2 or older driver, restart pod should pick up port configuration from secret.

dmc5179 · 2020-06-16T22:45:28Z

@divyenpatel We're reinstalled the cluster several times, including with the port set in the config maps and secrets from the start. Interestingly enough I get a different error if I set the storage policy name:

parameters:
  storagePolicyName: 'vSAN Default Storage Policy'

I get a different error from the PVC:

csi.vsphere.vmware.com_vsphere-csi-controller-0_fff13437-b01f-11ea-bf4b-0a580a82001a  failed to provision volume with StorageClass "vmware-csi-sc": rpc error: code = Internal desc = Failed to create volume. Error: CnsFault error: CNS: Failed to initialize SpbmService :Connection refused: The remote service is not running, OR is overloaded, OR a firewall is rejecting connections.

I'm not sure yet what service I need to turn on, or how, in vCenter to enable to storage policy based management service.

dmc5179 · 2020-06-17T04:04:49Z

So I figured out that I cannot use the storage policy in the storage class because I don't have a vSAN. We only have on ESXI host with vCenter. So I'm using the datastore in the storage class. Which results in the error about FCD

CNS: Failed to initialize FcdService

How do I turn on the fcd service?

…ency-openshift-4.19-ose-vmware-vsphere-csi-driver OCPBUGS-45402: Updating ose-vmware-vsphere-csi-driver-container image to be consistent with ART for 4.19

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 14, 2020

divyenpatel changed the title ~~CSI controller doesn't respect storage policy~~ Volume Creation failed with - CnsFault error: CNS: Failed to initialize FcdService :Operation timed out Feb 14, 2020

divyenpatel self-assigned this Feb 28, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 28, 2020

k8s-ci-robot closed this as completed May 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Volume Creation failed with - CnsFault error: CNS: Failed to initialize FcdService :Operation timed out #135

Volume Creation failed with - CnsFault error: CNS: Failed to initialize FcdService :Operation timed out #135

doctori commented Feb 14, 2020

divyenpatel commented Feb 14, 2020 •

edited

Loading

doctori commented Feb 14, 2020 •

edited

Loading

divyenpatel commented Feb 28, 2020

fejta-bot commented May 28, 2020

doctori commented May 30, 2020

k8s-ci-robot commented May 30, 2020

dmc5179 commented Jun 11, 2020

doctori commented Jun 12, 2020

dmc5179 commented Jun 12, 2020

RaunakShah commented Jun 15, 2020

dmc5179 commented Jun 16, 2020

divyenpatel commented Jun 16, 2020

dmc5179 commented Jun 16, 2020

dmc5179 commented Jun 17, 2020

Volume Creation failed with - CnsFault error: CNS: Failed to initialize FcdService :Operation timed out #135

Volume Creation failed with - CnsFault error: CNS: Failed to initialize FcdService :Operation timed out #135

Comments

doctori commented Feb 14, 2020

divyenpatel commented Feb 14, 2020 • edited Loading

doctori commented Feb 14, 2020 • edited Loading

divyenpatel commented Feb 28, 2020

fejta-bot commented May 28, 2020

doctori commented May 30, 2020

k8s-ci-robot commented May 30, 2020

dmc5179 commented Jun 11, 2020

doctori commented Jun 12, 2020

dmc5179 commented Jun 12, 2020

RaunakShah commented Jun 15, 2020

dmc5179 commented Jun 16, 2020

divyenpatel commented Jun 16, 2020

dmc5179 commented Jun 16, 2020

dmc5179 commented Jun 17, 2020

divyenpatel commented Feb 14, 2020 •

edited

Loading

doctori commented Feb 14, 2020 •

edited

Loading