CI job pmem-csi-periodic success rate very low after Sep-8: connection refused #740

okartau · 2020-09-19T08:29:50Z

After Sep-8th we started to see many failures. Success rate 5/22 jobs since then.
Do we know already and/or can we track down the cause(s), changes that may have contributed ?

pohly · 2020-09-21T12:40:38Z

After running tests in a loop, I just caught one instance of slow (but not fatally slow) test startup with four minutes of "connection refused" errors:

"lvm-testing sanity PMEM-CSI can mount again after reboot" killed all pods because it had to reboot a node, which increments the restart counter at 12:11:05
"lvm-testing sanity PMEM-CSI can publish volume after a node driver restart" starts and begins waiting the driver at 14:11:15
it logs four "connection refused" errors, each one minute apart, before finally the missing socat pods get created at 12:16:40

So what happens here is that the daemonset controller takes a long time to create the pods.

Looking into the kube-controller log, one can see:

E0921 12:11:07.532563       1 daemon_controller.go:292] default/pmem-csi-node-testing failed with : Internal error occurred: failed calling webhook "pod-hook.pmem-csi.intel.com": Post https://pmem-csi-scheduler.default.svc:443/pod/mutate?timeout=30s: dial tcp 10.110.59.69:443: connect: connection refused

Not surprising, we just killed the pod which provides the webhook... It looks like the controller then goes into an exponential backoff and depending on the exact timing, that then can lead to minutes before it tries again.

The reason why we see that more now is that 35c1182#diff-84cb29332213948b2ef38a9d0c899903 introduced the container restart test and with it the code which kills all pods.

The solution could be two-fold:

make the webhook optional, i.e. allow pod scheduling while it is down
exclude the socat pods from the webhook

The second point is obviously correct and should solve the issue. Whether we need the first point is up for debate...

pohly · 2020-09-22T08:39:42Z

Might have been fixed by PR #739

okartau · 2020-09-22T17:54:38Z

most recent job 920 which I believe had latest changes, has timed out (4h run time) in all versions. This is different profile from previous runs.

pohly · 2020-09-22T19:20:47Z

Testing on 1.17 and 1.18 had no failures and was making progress right until the jobs were killed, i.e. testing just ran too slowly.

On 1.19 (https://cloudnative-k8sci.southcentralus.cloudapp.azure.com/blue/rest/organizations/jenkins/pipelines/pmem-csi/branches/devel/runs/242/nodes/93/steps/110/log/?start=0), there was:

[2020-09-22T13:58:27.965Z] Sep 22 13:58:27.783: FAIL: validate driver after update and restart
[2020-09-22T13:58:27.965Z] Unexpected error:
[2020-09-22T13:58:27.965Z]     <*errors.errorString | 0xc003a7be80>: {
[2020-09-22T13:58:27.965Z]         s: "timed out waiting for deployment, last error: deployed driver different from expected deployment:\nlabel foo missing for \"pmem-csi-with-defaults-external-provisioner-cfg\" of type \"rbac.authorization.k8s.io/v1, Kind=Role\" in namespace \"default\"\nlabel foo missing for \"pmem-csi-with-defaults-external-provisioner-runner\" of type \"rbac.authorization.k8s.io/v1, Kind=ClusterRole\" in namespace \"\"\nlabel foo missing for \"pmem-csi-with-defaults-csi-provisioner-role-cfg\" of type \"rbac.authorization.k8s.io/v1, Kind=RoleBinding\" in namespace \"default\"\nlabel foo missing for \"pmem-csi-with-defaults-csi-provisioner-role\" of type \"rbac.authorization.k8s.io/v1, Kind=ClusterRoleBinding\" in namespace \"\"\nlabel foo missing for \"pmem-csi-with-defaults-controller\" of type \"/v1, Kind=ServiceAccount\" in namespace \"default\"\nlabel foo missing for \"pmem-csi-with-defaults-controller\" of type \"/v1, Kind=Service\" in namespace \"default\"\nlabel foo missing for \"pmem-csi-with-defaults-metrics\" of type \"/v1, Kind=Service\" in namespace \"default\"\nlabel foo missing for \"pmem-csi-with-defaults\" of type \"storage.k8s.io/v1beta1, Kind=CSIDriver\" in namespace \"\"",
[2020-09-22T13:58:27.965Z]     }
[2020-09-22T13:58:27.965Z]     timed out waiting for deployment, last error: deployed driver different from expected deployment:
[2020-09-22T13:58:27.965Z]     label foo missing for "pmem-csi-with-defaults-external-provisioner-cfg" of type "rbac.authorization.k8s.io/v1, Kind=Role" in namespace "default"
[2020-09-22T13:58:27.965Z]     label foo missing for "pmem-csi-with-defaults-external-provisioner-runner" of type "rbac.authorization.k8s.io/v1, Kind=ClusterRole" in namespace ""
[2020-09-22T13:58:27.965Z]     label foo missing for "pmem-csi-with-defaults-csi-provisioner-role-cfg" of type "rbac.authorization.k8s.io/v1, Kind=RoleBinding" in namespace "default"
[2020-09-22T13:58:27.965Z]     label foo missing for "pmem-csi-with-defaults-csi-provisioner-role" of type "rbac.authorization.k8s.io/v1, Kind=ClusterRoleBinding" in namespace ""
[2020-09-22T13:58:27.965Z]     label foo missing for "pmem-csi-with-defaults-controller" of type "/v1, Kind=ServiceAccount" in namespace "default"
[2020-09-22T13:58:27.965Z]     label foo missing for "pmem-csi-with-defaults-controller" of type "/v1, Kind=Service" in namespace "default"
[2020-09-22T13:58:27.965Z]     label foo missing for "pmem-csi-with-defaults-metrics" of type "/v1, Kind=Service" in namespace "default"
[2020-09-22T13:58:27.965Z]     label foo missing for "pmem-csi-with-defaults" of type "storage.k8s.io/v1beta1, Kind=CSIDriver" in namespace ""
[2020-09-22T13:58:27.965Z] occurred

@avalluri : this looks like #742

pohly · 2020-09-22T19:23:43Z

Testing on 1.17 and 1.18 had no failures and was making progress right until the jobs were killed, i.e. testing just ran too slowly.

In these two jobs, there's just a single "Still waiting" message, i.e. driver startup typically happened in less than a minute. Looks like the commits from PR #739 helped.

okartau · 2020-09-25T06:36:57Z

one of recent merges (#743, #745) made good impact, likely the one increasing of the timeout. Most recent job 925 succeeded, where run times were between 4 and 5 hours.

okartau · 2020-09-28T06:50:48Z

Recent 5 runs all good, closing issue.

pohly changed the title ~~CI job pmem-csi-periodic success rate very low after Sep-8~~ CI job pmem-csi-periodic success rate very low after Sep-8: connection refused Sep 21, 2020

pohly added the 0.8 needs to be fixed in 0.8.x label Sep 22, 2020

pohly self-assigned this Sep 22, 2020

okartau closed this as completed Sep 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI job pmem-csi-periodic success rate very low after Sep-8: connection refused #740

CI job pmem-csi-periodic success rate very low after Sep-8: connection refused #740

okartau commented Sep 19, 2020

pohly commented Sep 21, 2020

pohly commented Sep 22, 2020

okartau commented Sep 22, 2020

pohly commented Sep 22, 2020

pohly commented Sep 22, 2020

okartau commented Sep 25, 2020

okartau commented Sep 28, 2020

CI job pmem-csi-periodic success rate very low after Sep-8: connection refused #740

CI job pmem-csi-periodic success rate very low after Sep-8: connection refused #740

Comments

okartau commented Sep 19, 2020

pohly commented Sep 21, 2020

pohly commented Sep 22, 2020

okartau commented Sep 22, 2020

pohly commented Sep 22, 2020

pohly commented Sep 22, 2020

okartau commented Sep 25, 2020

okartau commented Sep 28, 2020