Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Volume Creation failed with - CnsFault error: CNS: Failed to initialize FcdService :Operation timed out #135

Closed
doctori opened this issue Feb 14, 2020 · 14 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@doctori
Copy link

doctori commented Feb 14, 2020

/kind bug

What happened:
In a topology aware environment the CNS creation request doesn't respect the the storage policy and send the full list of datastore available on all the clusters of the vcenter

 
 failed to create cns volume. createSpec: "(*types.CnsVolumeCreateSpec)(0xc00076a210)({
 DynamicData: (types.DynamicData) {
 },
 Name: (string) (len=40) "pvc-307433b6-7f89-4381-8598-45a685216b68",
 VolumeType: (string) (len=5) "BLOCK",
 Datastores: ([]types.ManagedObjectReference) (len=66 cap=128) {
 (types.ManagedObjectReference) Datastore:datastore-11809,
 (types.ManagedObjectReference) Datastore:datastore-21706,
 (types.ManagedObjectReference) Datastore:datastore-21707,
 (types.ManagedObjectReference) Datastore:datastore-21814,
 (types.ManagedObjectReference) Datastore:datastore-21815,
 (types.ManagedObjectReference) Datastore:datastore-21872,
 [... a shitload of datastores ...]
 (types.ManagedObjectReference) Datastore:datastore-65240,
 (types.ManagedObjectReference) Datastore:datastore-65241,
 (types.ManagedObjectReference) Datastore:datastore-7012,
 (types.ManagedObjectReference) Datastore:datastore-7355
 },
 Metadata: (types.CnsVolumeMetadata) {
 DynamicData: (types.DynamicData) {
 },
 ContainerCluster: (types.CnsContainerCluster) {
 DynamicData: (types.DynamicData) {
 },
 ClusterType: (string) (len=10) "KUBERNETES",
 ClusterId: (string) (len=14) "my-cluster-id",
 VSphereUser: (string) (len=14) "my-account"
 },
 EntityMetadata: ([]types.BaseCnsEntityMetadata) <nil>
 },
 BackingObjectDetails: (*types.CnsBlockBackingDetails)(0xc000e34480)({
 CnsBackingObjectDetails: (types.CnsBackingObjectDetails) {
 DynamicData: (types.DynamicData) {
 },
 CapacityInMb: (int64) 15360
 },
 BackingDiskId: (string) ""
 }),
 Profile: ([]types.BaseVirtualMachineProfileSpec) (len=1 cap=1) {
 (*types.VirtualMachineDefinedProfileSpec)(0xc000a893c0)({
 VirtualMachineProfileSpec: (types.VirtualMachineProfileSpec) {
 DynamicData: (types.DynamicData) {
 }
 },
 ProfileId: (string) (len=36) "611e2fe8-1aca-4e91-8bfe-9f5da399e98e",
 ReplicationSpec: (*types.ReplicationSpec)(<nil>),
 ProfileData: (*types.VirtualMachineProfileRawData)(<nil>),
 ProfileParams: ([]types.KeyValue) <nil>
 })
 }
})
", fault: "(*types.CnsFault)(0xc000ad87c0)({
 Fault: (*types.BaseMethodFault)(0xc00086a7d0)(<nil>),
 LocalizedMessage: (string) (len=73) "CnsFault error: CNS: Failed to initialize FcdService :Operation timed out"
})
", opId: "b72d5c4c"
E0214 16:15:03.728337 1 vsphereutil.go:120] Failed to create disk pvc-307433b6-7f89-4381-8598-45a685216b68 with error CnsFault error: CNS: Failed to initialize FcdService :Operation timed out
E0214 16:15:03.728360 1 controller.go:200] Failed to create volume. Error: CnsFault error: CNS: Failed to initialize FcdService :Operation timed out

the FCD service timesout because some datastores are not reachable - some clusters are not in 6.7u3.

What you expected to happen:
the CNS creation request should only holds datastore compatible with the storage policy

How to reproduce it (as minimally and precisely as possible):

Create a storage policy concerning only 2 datastores attached to two clusters with ESXi nodes in 6.7u3 within a environment with multiple other clusters with ESXi 6.7u2 (or even lower)
then create a k8s cluster with this kind of storage class :

allowedTopologies:
- matchLabelExpressions:
  - key: failure-domain.beta.kubernetes.io/region
    values:
    - tokyo
  - key: failure-domain.beta.kubernetes.io/zone
    values:
    - tokyo-A
    - tokyo-B
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
  name: vmware-tier1
parameters:
  fstype: ext4
  storagepolicyname: k8s-tokyo-tier1
provisioner: csi.vsphere.vmware.com
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

Anything else we need to know?:
This situation seems to have appened after a few weeks of runtime (restarting the whole stack didn't fix the behaviour)

Environment:

  • csi-vsphere version: 1.0.2

  • vsphere-cloud-controller-manager version: 1.1.0

  • Kubernetes version: 1.16.4

  • vSphere version: 6.7.0 build 14368073

  • OS (e.g. from /etc/os-release): Ubuntu 18.04.2 LTS

  • Kernel (e.g. uname -a): Linux rebrand0026.hosting.cegedim.cloud 4.15.0-50-generic Metadata Syncer for vsphere CSI Driver #54-Ubuntu SMP Mon May 6 18:46:08 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

  • Install tools: Rancher 2.3.4

  • Others:

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 14, 2020
@divyenpatel
Copy link
Member

divyenpatel commented Feb 14, 2020

@doctori CNS does respect storage policy specified in the request.

CSI driver does send list of shared datastores accessible in the requested zone/region.
CNS backend in the vCenter, choose the compliant datastore out of the list sent from the CSI driver.

You can verify this on the vCenter CNS UI.

The issue I see is

LocalizedMessage: (string) (len=73) "CnsFault error: CNS: Failed to initialize FcdService :Operation timed out"

Can you post VC side log here?

To check the logs on the vCenter Server side, search for the logs with the cns prefix at the following locations:

  • To check the logs related to the provisioning workflow, use the /var/log/vmware/vsan-health/vsanvcmgmtd.log file on vCenter Server.

  • To check the logs related to authentication and permission issues, use the /var/log/vmware/vsan-health/vmware-vsan-health-service.log file on vCenter Server.

@divyenpatel divyenpatel changed the title CSI controller doesn't respect storage policy Volume Creation failed with - CnsFault error: CNS: Failed to initialize FcdService :Operation timed out Feb 14, 2020
@doctori
Copy link
Author

doctori commented Feb 14, 2020

this is from the vsanvcmgmtd.log :

2020-02-14T16:45:03.735+01:00 verbose vsanvcmgmtd[12641] [vSAN@6876 sub=ServiceManager opId=b72d5c4c] CNS: initializing FcdService
2020-02-14T16:45:03.735+01:00 verbose vsanvcmgmtd[12641] [vSAN@6876 sub=HttpConnectionPool-000000 opId=b72d5c4c] [IncConnectionCount] Cannot increment number of connections to <cs p:00007f1e740071a0, TCP:localhost:443> as max connections (20) already in use
2020-02-14T16:45:09.758+01:00 verbose vsanvcmgmtd[29771] [vSAN@6876 sub=VsanSoapSvc] Unrecognized version URI "urn:vsan/"; using default handler for "urn:vsan/6.9.0"
2020-02-14T16:45:09.760+01:00 verbose vsanvcmgmtd[29771] [vSAN@6876 sub=CnsAccessChecker] CNS AuthZ: Check if the child entity is a datacenter: datacenter-2.
2020-02-14T16:45:09.760+01:00 verbose vsanvcmgmtd[29771] [vSAN@6876 sub=CnsAccessChecker] CNS AuthZ: Look up verify if datastore is present in the datacenter: datacenter-2.
2020-02-14T16:45:09.763+01:00 verbose vsanvcmgmtd[29771] [vSAN@6876 sub=CnsAccessChecker] CNS AuthZ: Getting privileges checklist for create.
2020-02-14T16:45:09.763+01:00 verbose vsanvcmgmtd[29771] [vSAN@6876 sub=CnsAccessChecker] CNS AuthZ: Checking System.Read privilege on entities.
2020-02-14T16:45:09.792+01:00 verbose vsanvcmgmtd[29771] [vSAN@6876 sub=CnsAccessChecker] CNS AuthZ: Checking Datastore.FileManagement privilege on entities.

so clearly there is something wrong on the vcenter side.
I though the CSI controller was supposed to do the filtering.
A support request has been opened with vmware support.
Thank you for your help.

@divyenpatel divyenpatel self-assigned this Feb 28, 2020
@divyenpatel
Copy link
Member

@doctori Did you get help from vmware support regarding this?

Let us know if you need any help.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 28, 2020
@doctori
Copy link
Author

doctori commented May 30, 2020

this was an issue from the vcenter
/close

@k8s-ci-robot
Copy link
Contributor

@doctori: Closing this issue.

In response to this:

this was an issue from the vcenter
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dmc5179
Copy link

dmc5179 commented Jun 11, 2020

@doctori What was the issue with vcenter? I get a slightly different error but might be related

failed to provision volume with StorageClass "vmware-csi-sc": rpc error: code = Internal desc = Failed to create volume. Error: CnsFault error: CNS: Failed to initialize FcdService :Connection refused: The remote service is not running, OR is overloaded, OR a firewall is rejecting connections

@doctori
Copy link
Author

doctori commented Jun 12, 2020

@dmc5179 : you might want to check your connectivity between you CSI controller pod and vcenter, looks like the controller can't talk to the vcenter. or you might want to raise the roundTrip value to give the client a chance to get a response after a while (on slow vcenters this helps)

@dmc5179
Copy link

dmc5179 commented Jun 12, 2020

@doctori I suspect the issue is because we have vCenter on a different port from the default 443. I had to update the OpenShift cluster config when deploying to ensure OpenShift understands how to talk to vcenter on the other port. When I deployed the CSI driver I made sure the configuration included the updated port. It's possible that there is somewhere in the code that doesn't use the config and is locked to 443 or there is another place that I need to configure the CSI driver to use an alternate port for vCenter.

I'm going to setup a port forwarding rule on one of our hosts port 443 to route calls to vCenter and see if the CSI driver is happier. If is is then I might have open an issue here about supporting alternate ports.

@RaunakShah
Copy link
Contributor

@dmc5179 have you tried changing the port field in your csi-vsphere.conf file? That defaults to 443.

@dmc5179
Copy link

dmc5179 commented Jun 16, 2020

@RaunakShah @doctori We are adding the custom port to the main cloud provider config map which works as the default storage class comes online when we install the cluster. We're also adding the port to the csi-vsphere.conf but that doesn't appear to resolve the issue.

I've even gone so far as to create another host as a proxy, forwarding 443 on the proxy to 8443 on vcenter, and installed the cluster against the proxy. When I do that, and leave the port out of any of the configs, the default storage class again works, through the proxy 443 -> vcenter 8443, but the CSI driver fails with the exact same error message.

Perhaps there is a service that I need to enable in VMWare somewhere to support the CSI driver?

@divyenpatel
Copy link
Member

We are adding the custom port to the main cloud provider config map which works as the default storage class comes online when we install the cluster. We're also adding the port to the csi-vsphere.conf but that doesn't appear to resolve the issue.

@dmc5179
After changing configuration in the secret, did you restart pod? We have added capability to watch on config changes in 2.0.0 version of the driver. if you are using v1.0.2 or older driver, restart pod should pick up port configuration from secret.

@dmc5179
Copy link

dmc5179 commented Jun 16, 2020

@divyenpatel We're reinstalled the cluster several times, including with the port set in the config maps and secrets from the start. Interestingly enough I get a different error if I set the storage policy name:

parameters:
  storagePolicyName: 'vSAN Default Storage Policy'

I get a different error from the PVC:

csi.vsphere.vmware.com_vsphere-csi-controller-0_fff13437-b01f-11ea-bf4b-0a580a82001a  failed to provision volume with StorageClass "vmware-csi-sc": rpc error: code = Internal desc = Failed to create volume. Error: CnsFault error: CNS: Failed to initialize SpbmService :Connection refused: The remote service is not running, OR is overloaded, OR a firewall is rejecting connections.

I'm not sure yet what service I need to turn on, or how, in vCenter to enable to storage policy based management service.

@dmc5179
Copy link

dmc5179 commented Jun 17, 2020

So I figured out that I cannot use the storage policy in the storage class because I don't have a vSAN. We only have on ESXI host with vCenter. So I'm using the datastore in the storage class. Which results in the error about FCD

CNS: Failed to initialize FcdService 

How do I turn on the fcd service?

gnufied pushed a commit to gnufied/vsphere-csi-driver that referenced this issue Dec 11, 2024
…ency-openshift-4.19-ose-vmware-vsphere-csi-driver

OCPBUGS-45402: Updating ose-vmware-vsphere-csi-driver-container image to be consistent with ART for 4.19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

6 participants