Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[V2] fix: fix issue where file is read before written in e2e tests 🐞 #1326

Merged

Conversation

jmclong
Copy link

@jmclong jmclong commented May 10, 2022

What type of PR is this?

/kind bug
/kind flake

What this PR does / why we need it:
This change fixes a test issue in the several e2e tests, where we are looking for a file in pods to contain a certain string. The framework.LookForStringInPodExec method fails the test automatically if the command executed has a non-zero exit code. This can happen if the pod has not completed writing to the file.

There are some additional methods like framework.RunHostCmd that we might consider using, but these assume the pod is running Linux. Because of this, I decided to roll our own method.

Requirements:

Special notes for your reviewer:
I changed name from Exec => PollForStringInPodsExec to make it clearer that the command passed in will be executed multiple times.

There is some duplicated code in test/resources/deployment.go and test/resources/statefulset.go, but I think it is better to do this in a separate PR if we want to refactor that.

Fixes e2e test failure symptoms similar to this:

{ Failure /home/prow/go/src/sigs.k8s.io/azuredisk-csi-driver/test/e2e/dynamic_provisioning_test.go:436
Unexpected error:
    <exec.CodeExitError>: {
        Err: <*errors.errorString | 0xc0000a3740>{
            s: "error running /usr/local/bin/kubectl --server=https://kubetest-wuw9nwbh.westus2.cloudapp.azure.com --kubeconfig=/root/tmp1662410290/kubeconfig/kubeconfig.westus2.json --namespace=azuredisk-4663 exec azuredisk-volume-tester-p797m-6495d86c45-74zxb --namespace=azuredisk-4663 -- cmd /c type C:\\mnt\\test-1\\data.txt:\nCommand stdout:\n\nstderr:\nThe system cannot find the file specified.\r\ncommand terminated with exit code 1\n\nerror:\nexit status 1",
        },
        Code: 1,
    }
    error running /usr/local/bin/kubectl --server=https://kubetest-wuw9nwbh.westus2.cloudapp.azure.com --kubeconfig=/root/tmp1662410290/kubeconfig/kubeconfig.westus2.json --namespace=azuredisk-4663 exec azuredisk-volume-tester-p797m-6495d86c45-74zxb --namespace=azuredisk-4663 -- cmd /c type C:\mnt\test-1\data.txt:
    Command stdout:
    
    stderr:
    The system cannot find the file specified.
    command terminated with exit code 1
    
    error:
    exit status 1
occurred
/home/prow/go/src/sigs.k8s.io/azuredisk-csi-driver/vendor/k8s.io/kubernetes/test/e2e/framework/util.go:603}

Release note:

none

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 10, 2022
@k8s-ci-robot
Copy link
Contributor

Hi @jmclong. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 10, 2022
@k8s-ci-robot k8s-ci-robot requested review from andyzhangx and edreed May 10, 2022 21:13
@jmclong jmclong changed the title fix: fix issue where file is read before written in e2e tests 🐞 [V2] fix: fix issue where file is read before written in e2e tests 🐞 May 10, 2022
@jmclong jmclong marked this pull request as ready for review May 10, 2022 21:45
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 10, 2022
@k8s-ci-robot k8s-ci-robot requested a review from ZeroMagic May 10, 2022 21:45
@edreed
Copy link
Collaborator

edreed commented May 10, 2022

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 10, 2022
@jmclong
Copy link
Author

jmclong commented May 11, 2022

First test failure was in
Dynamic Provisioning [single-az] should create a deployment object, write and read to it, delete the pod and write and read to it again [kubernetes.io/azure-disk] [disk.csi.azure.com] [Windows] [csi-azuredisk-scheduler-extender].

This is a test impacted by my change, but the issue appears unrelated. The first pod is deleted, but the second pod is never able to attach or mount the volume because it thinks it is still in use. If you look when the test is trying to clean up the PV on test teardown, it polls for the full 10 minutes. Ginkgo does not seem to report errors that occur during defers after the initial error.

May 11 00:07:29.291: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for azuredisk-volume-tester-9dqfp-6666c46988-n6dpl: { } Scheduled: Successfully assigned azuredisk-1498/azuredisk-volume-tester-9dqfp-6666c46988-n6dpl to k8s-agentpool1-37254516-0
May 11 00:07:29.291: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for azuredisk-volume-tester-9dqfp-6666c46988-sk4q9: { } Scheduled: Successfully assigned azuredisk-1498/azuredisk-volume-tester-9dqfp-6666c46988-sk4q9 to k8s-agentpool1-37254516-1
May 11 00:07:29.291: INFO: At 2022-05-10 23:51:34 +0000 UTC - event for azuredisk-volume-tester-9dqfp: {deployment-controller } ScalingReplicaSet: Scaled up replica set azuredisk-volume-tester-9dqfp-6666c46988 to 1
May 11 00:07:29.291: INFO: At 2022-05-10 23:51:34 +0000 UTC - event for azuredisk-volume-tester-9dqfp-6666c46988: {replicaset-controller } SuccessfulCreate: Created pod: azuredisk-volume-tester-9dqfp-6666c46988-n6dpl
May 11 00:07:29.291: INFO: At 2022-05-10 23:51:34 +0000 UTC - event for pvc-5x68d: {persistentvolume-controller } ExternalProvisioning: waiting for a volume to be created, either by external provisioner "disk.csi.azure.com" or manually created by system administrator
May 11 00:07:29.291: INFO: At 2022-05-10 23:51:34 +0000 UTC - event for pvc-5x68d: {disk.csi.azure.com_k8s-agentpool1-37254516-0_ae2584ae-f27f-4714-b649-ce129d0e36cc } Provisioning: External provisioner is provisioning volume for claim "azuredisk-1498/pvc-5x68d"
May 11 00:07:29.291: INFO: At 2022-05-10 23:51:34 +0000 UTC - event for pvc-5x68d: {persistentvolume-controller } WaitForFirstConsumer: waiting for first consumer to be created before binding
May 11 00:07:29.291: INFO: At 2022-05-10 23:51:36 +0000 UTC - event for pvc-5x68d: {disk.csi.azure.com_k8s-agentpool1-37254516-0_ae2584ae-f27f-4714-b649-ce129d0e36cc } ProvisioningSucceeded: Successfully provisioned volume pvc-0ba2a2e8-b74b-4f48-bbcb-24418260bf11
May 11 00:07:29.291: INFO: At 2022-05-10 23:51:37 +0000 UTC - event for azuredisk-volume-tester-9dqfp-6666c46988-n6dpl: {attachdetach-controller } SuccessfulAttachVolume: AttachVolume.Attach succeeded for volume "pvc-0ba2a2e8-b74b-4f48-bbcb-24418260bf11" 
May 11 00:07:29.291: INFO: At 2022-05-10 23:51:51 +0000 UTC - event for azuredisk-volume-tester-9dqfp-6666c46988-n6dpl: {kubelet k8s-agentpool1-37254516-0} Pulling: Pulling image "k8s.gcr.io/e2e-test-images/busybox:1.29-2"
May 11 00:07:29.291: INFO: At 2022-05-10 23:51:52 +0000 UTC - event for azuredisk-volume-tester-9dqfp-6666c46988-n6dpl: {kubelet k8s-agentpool1-37254516-0} Pulled: Successfully pulled image "k8s.gcr.io/e2e-test-images/busybox:1.29-2" in 428.757518ms
May 11 00:07:29.291: INFO: At 2022-05-10 23:51:52 +0000 UTC - event for azuredisk-volume-tester-9dqfp-6666c46988-n6dpl: {kubelet k8s-agentpool1-37254516-0} Started: Started container volume-tester
May 11 00:07:29.291: INFO: At 2022-05-10 23:51:52 +0000 UTC - event for azuredisk-volume-tester-9dqfp-6666c46988-n6dpl: {kubelet k8s-agentpool1-37254516-0} Created: Created container volume-tester
May 11 00:07:29.291: INFO: At 2022-05-10 23:51:54 +0000 UTC - event for azuredisk-volume-tester-9dqfp-6666c46988: {replicaset-controller } SuccessfulCreate: Created pod: azuredisk-volume-tester-9dqfp-6666c46988-sk4q9
May 11 00:07:29.291: INFO: At 2022-05-10 23:51:54 +0000 UTC - event for azuredisk-volume-tester-9dqfp-6666c46988-n6dpl: {kubelet k8s-agentpool1-37254516-0} Killing: Stopping container volume-tester
May 11 00:07:29.291: INFO: At 2022-05-10 23:51:54 +0000 UTC - event for azuredisk-volume-tester-9dqfp-6666c46988-sk4q9: {attachdetach-controller } FailedAttachVolume: Multi-Attach error for volume "pvc-0ba2a2e8-b74b-4f48-bbcb-24418260bf11" Volume is already used by pod(s) azuredisk-volume-tester-9dqfp-6666c46988-n6dpl
May 11 00:07:29.291: INFO: At 2022-05-10 23:53:57 +0000 UTC - event for azuredisk-volume-tester-9dqfp-6666c46988-sk4q9: {kubelet k8s-agentpool1-37254516-1} FailedMount: Unable to attach or mount volumes: unmounted volumes=[test-volume-1], unattached volumes=[kube-api-access-w24f5 test-volume-1]: timed out waiting for the condition
May 11 00:07:29.291: INFO: At 2022-05-10 23:56:16 +0000 UTC - event for azuredisk-volume-tester-9dqfp-6666c46988-sk4q9: {kubelet k8s-agentpool1-37254516-1} FailedMount: Unable to attach or mount volumes: unmounted volumes=[test-volume-1], unattached volumes=[test-volume-1 kube-api-access-w24f5]: timed out waiting for the condition

The second failure is in Dynamic Provisioning [multi-az] should receive FailedMount event with invalid mount options [kubernetes.io/azure-disk] [disk.csi.azure.com] [csi-azuredisk-scheduler-extender] .

This test finds the 'FailedMount' event, but the PV never gets deleted as it polls for the full 10 minutes.

@jmclong jmclong force-pushed the fix-sporadic-looking-for-file branch from 7a67d6c to a410368 Compare May 11, 2022 14:25
@jmclong jmclong requested a review from landreasyan May 11, 2022 14:27
@jmclong
Copy link
Author

jmclong commented May 11, 2022

Failures might be due (or related) to #1300? The PVs get released but never deleted.

@jmclong jmclong force-pushed the fix-sporadic-looking-for-file branch from a410368 to b0e613e Compare May 11, 2022 17:19
@edreed
Copy link
Collaborator

edreed commented May 11, 2022

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 11, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: edreed, jmclong, landreasyan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 11, 2022
@k8s-ci-robot k8s-ci-robot merged commit 708c938 into kubernetes-sigs:main_v2 May 11, 2022
@jmclong jmclong deleted the fix-sporadic-looking-for-file branch May 12, 2022 12:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants