Fix an error when creating a volume with topology feature gate #157

dkoshkin · 2018-12-21T23:19:20Z

Is this a bug fix or adding new feature?
Bug fix

What is this PR about? / Why do we need it?

In the waitForVolume case "creating" incorrectly returned an error. This would then return from cloud.CreateDisk, causing a possible race condition. Where the function driver.CreateVolume could create another disk with the same com.amazon.aws.csi.volume tag. Eventually erroring with ErrMultiDisks. This also exposed a bug where the AvailabilityZone and fsType were not properly set. Which only surfaced when setting --feature-gates=Topology=true.
Added volumeBindingMode: WaitForFirstConsumer in the StorageClass because since the flag --feature-gates=Topology=true is set and it is required for that feature.

What testing is done?
Manual testing and new unit tests

Error volume vol-0fff33f4c0834292a is still being created returned

$ kubectl describe pvc

Events:
  Type       Reason                 Age                From                                                                    Message
  ----       ------                 ----               ----                                                                    -------
  Warning    ProvisioningFailed     39s                persistentvolume-controller                                             storageclass.storage.k8s.io "slow" not found
  Warning    ProvisioningFailed     29s                ebs.csi.aws.com_csi-provisioner-0_87553ac8-0499-11e9-a00f-06f965927db1  failed to provision volume with StorageClass "slow": rpc error: code = Internal desc = Could not create volume "pvc-0ba471eb-0482-11e9-b7b3-22df1a95de60": failed to get an available volume in EC2: volume vol-0fff33f4c0834292a is still being created
  Normal     ExternalProvisioning   18s (x3 over 33s)  persistentvolume-controller                                             waiting for a volume to be created, either by external provisioner "ebs.csi.aws.com" or manually created by system administrator
  Normal     Provisioning           14s (x2 over 33s)  ebs.csi.aws.com_csi-provisioner-0_87553ac8-0499-11e9-a00f-06f965927db1  External provisioner is provisioning volume for claim "default/claim1"
  Normal     ProvisioningSucceeded  14s                ebs.csi.aws.com_csi-provisioner-0_87553ac8-0499-11e9-a00f-06f965927db1  Successfully provisioned volumepvc-0ba471eb-0482-11e9-b7b3-22df1a95de60
Mounted By:  app-5f7dccccc6-wzn9g

Then when the PV is created in Kubernetes, values field was empty for matchExpressions:

$ kubectl get pv -o yaml

    nodeAffinity:
      required:
        nodeSelectorTerms:
        - matchExpressions:
          - key: topology.ebs.csi.aws.com/zone
            operator: In
            values:
            - ""

And the pod would not properly run:

$ kubectl describe pods

Events:
  Type     Reason                  Age                 From                                                Message
  ----     ------                  ----                ----                                                -------
  Warning  FailedScheduling        54s (x2 over 54s)   default-scheduler                                   pod has unbound immediate PersistentVolumeClaims
  Normal   Scheduled               30s                 default-scheduler                                   Successfully assigned default/app-5f7dccccc6-t9z8s to kube-node-1-kubelet.devkubernetes01.mesos
  Normal   SuccessfulAttachVolume  26s                 attachdetach-controller                             AttachVolume.Attach succeeded for volume "pvc-0ba471eb-0482-11e9-b7b3-22df1a95de60"
  Warning  FailedMount             20s (x25 over 22s)  kubelet, kube-node-1-kubelet.devkubernetes01.mesos  MountVolume.NodeAffinity check failed for volume "pvc-0ba471eb-0482-11e9-b7b3-22df1a95de60" : No matching NodeSelectorTerms

This only really surfaced itself when setting --feature-gates=Topology=true since the pod needs to have the proper nodeAffinity set in PV spec.

k8s-ci-robot · 2018-12-21T23:19:33Z

Hi @dkoshkin. Thanks for your PR.

I'm waiting for a kubernetes-sigs or kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

coveralls · 2018-12-21T23:25:05Z

Pull Request Test Coverage Report for Build 313

10 of 13 (76.92%) changed or added relevant lines in 3 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.2%) to 50.525%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/cloud/fakes.go	0	3	0.0%

Totals
Change from base Build 312:	0.2%
Covered Lines:	577
Relevant Lines:	1142

💛 - Coveralls

bertinatto · 2018-12-22T14:05:09Z

/ok-to-test

leakingtapan

In the waitForVolume case "creating" incorrectly returned an error. This would then return from cloud.CreateDisk, causing a possible race condition.

Good catch. We actually found the same bug when integrating integ test with Prow. And this bug is fixed at #156

Also are you testing on v1.13 cluster? Im wondering why we didn't catch this issue earlier.

pkg/driver/controller_test.go

dkoshkin · 2018-12-28T16:25:34Z

Also are you testing on v1.13 cluster? Im wondering why we didn't catch this issue earlier.

Yes testing with Kubernetes v1.13, and I'm guessing this wasn't caught before because volumeBindingMode: WaitForFirstConsumer was not in the storageClass and maybe wasn't tested extensively.
Also the first few E2E tests didn't use WaitForFirstConsumer so they weren't failing but will add one soon.

bertinatto

Good catch!

bertinatto · 2019-01-02T15:54:19Z

pkg/driver/controller.go

 	// volume exists already
 	if disk != nil {
+		disk.FsType = fsType


I'm wondering if it's necessary for us to return the FS type as an attribute. The reason is that we don't do any validation with it, we simple send it back to the CO.

For example, if the client sends in an invalid FS type, we return success. Later on it'll fail when staging, but I'm wondering if we should return an error right here. Another option would be to ignore it (other drivers seems to do that).

@leakingtapan, do you know if it's strictly necessary that we return this info back to the CO?

We need to pass fsType back since it is needed by NodeStageVolume later here.

Another option would be to ignore it (other drivers seems to do that).

In other driver, what's the behavior once a invalid fsType is ignored? Does it fall back to default option?

We need to pass fsType back since it is needed by NodeStageVolume later here.

From what I understand from CreateVolume specs, that wouldn't be required.

I took a quick look in GCP and it seems like they don't handle fsType as well. It'd be nice to test this.

It'd be nice to have this tested, but we can probably do that in another PR.

From what I understand from CreateVolume specs, that wouldn't be required.

It's not required in CreateVolume but then when the volume is formatted in NodeStageVolume the fsType is missing and it would just use the default right?

@bertinatto it is not required to be CSI compliant, however, it is required to be EBS in-tree complicant since fsType is supported there. See here. And this explains why it is not handled by GCE PD CSI driver since their in-tree doesn't have this. See here

@dkoshkin Yep. When fsType is missing the default one will be used in node stage phase here

So from a EBS point of view, we would still need fsType passed into NodeStageVolume from CreateVolume

leakingtapan · 2019-01-03T00:06:11Z

The change /lgtm

@dkoshkin lets take a look at @bertinatto 's comment before merge.

leakingtapan · 2019-01-03T06:36:08Z

/lgtm

leakingtapan · 2019-01-07T23:05:05Z

/retest

leakingtapan · 2019-01-07T23:36:32Z

@dkoshkin you might need to rebase to pick up the fixes for integ test

In the waitForVolume case "creating" incorrectly returned an error This exposed a bug where the AvailabilityZone and fsType were not properly set Which only surfaced when setting --feature-gates=Topology=true

leakingtapan · 2019-01-08T00:19:00Z

/lgtm
/approve

k8s-ci-robot · 2019-01-08T00:19:06Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dkoshkin, leakingtapan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [leakingtapan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…owner carry: add BZ component

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 21, 2018

k8s-ci-robot requested review from d-nishi and leakingtapan December 21, 2018 23:19

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 21, 2018

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Dec 21, 2018

dkoshkin force-pushed the topology branch 2 times, most recently from 952e1f8 to bf0af10 Compare December 22, 2018 01:09

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 22, 2018

dkoshkin force-pushed the topology branch from bf0af10 to 46d48d7 Compare December 22, 2018 16:11

leakingtapan reviewed Dec 26, 2018

View reviewed changes

pkg/driver/controller_test.go Outdated Show resolved Hide resolved

dkoshkin force-pushed the topology branch from 46d48d7 to 0832484 Compare December 28, 2018 16:15

bertinatto reviewed Jan 2, 2019

View reviewed changes

leakingtapan self-assigned this Jan 3, 2019

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 3, 2019

Fix an error when creating a volume with topology feature gate

6bfd354

In the waitForVolume case "creating" incorrectly returned an error This exposed a bug where the AvailabilityZone and fsType were not properly set Which only surfaced when setting --feature-gates=Topology=true

dkoshkin force-pushed the topology branch from 0832484 to 6bfd354 Compare January 7, 2019 23:39

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 7, 2019

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 8, 2019

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 8, 2019

k8s-ci-robot merged commit ff1fe8e into kubernetes-sigs:master Jan 8, 2019

dkoshkin deleted the topology branch January 10, 2019 19:38

huffmanca pushed a commit to huffmanca/aws-ebs-csi-driver that referenced this pull request Jul 23, 2020

Merge pull request kubernetes-sigs#157 from gnufied/add-bz-component-…

19b50d6

…owner carry: add BZ component

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix an error when creating a volume with topology feature gate #157

Fix an error when creating a volume with topology feature gate #157

dkoshkin commented Dec 21, 2018

k8s-ci-robot commented Dec 21, 2018

coveralls commented Dec 21, 2018 •

edited

Loading

bertinatto commented Dec 22, 2018

leakingtapan left a comment

dkoshkin commented Dec 28, 2018

bertinatto left a comment

bertinatto Jan 2, 2019

leakingtapan Jan 3, 2019

bertinatto Jan 3, 2019

dkoshkin Jan 3, 2019 •

edited

Loading

leakingtapan Jan 3, 2019

leakingtapan commented Jan 3, 2019

leakingtapan commented Jan 3, 2019

leakingtapan commented Jan 7, 2019

leakingtapan commented Jan 7, 2019

leakingtapan commented Jan 8, 2019

k8s-ci-robot commented Jan 8, 2019

Fix an error when creating a volume with topology feature gate #157

Fix an error when creating a volume with topology feature gate #157

Conversation

dkoshkin commented Dec 21, 2018

k8s-ci-robot commented Dec 21, 2018

coveralls commented Dec 21, 2018 • edited Loading

Pull Request Test Coverage Report for Build 313

💛 - Coveralls

bertinatto commented Dec 22, 2018

leakingtapan left a comment

Choose a reason for hiding this comment

dkoshkin commented Dec 28, 2018

bertinatto left a comment

Choose a reason for hiding this comment

bertinatto Jan 2, 2019

Choose a reason for hiding this comment

leakingtapan Jan 3, 2019

Choose a reason for hiding this comment

bertinatto Jan 3, 2019

Choose a reason for hiding this comment

dkoshkin Jan 3, 2019 • edited Loading

Choose a reason for hiding this comment

leakingtapan Jan 3, 2019

Choose a reason for hiding this comment

leakingtapan commented Jan 3, 2019

leakingtapan commented Jan 3, 2019

leakingtapan commented Jan 7, 2019

leakingtapan commented Jan 7, 2019

leakingtapan commented Jan 8, 2019

k8s-ci-robot commented Jan 8, 2019

coveralls commented Dec 21, 2018 •

edited

Loading

dkoshkin Jan 3, 2019 •

edited

Loading