Skip to content
This repository has been archived by the owner on Oct 22, 2024. It is now read-only.

CI job pmem-csi-periodic success rate very low after Sep-8: connection refused #740

Closed
okartau opened this issue Sep 19, 2020 · 7 comments
Closed
Assignees
Labels
0.8 needs to be fixed in 0.8.x

Comments

@okartau
Copy link
Contributor

okartau commented Sep 19, 2020

After Sep-8th we started to see many failures. Success rate 5/22 jobs since then.
Do we know already and/or can we track down the cause(s), changes that may have contributed ?

@pohly pohly changed the title CI job pmem-csi-periodic success rate very low after Sep-8 CI job pmem-csi-periodic success rate very low after Sep-8: connection refused Sep 21, 2020
@pohly
Copy link
Contributor

pohly commented Sep 21, 2020

After running tests in a loop, I just caught one instance of slow (but not fatally slow) test startup with four minutes of "connection refused" errors:

  • "lvm-testing sanity PMEM-CSI can mount again after reboot" killed all pods because it had to reboot a node, which increments the restart counter at 12:11:05
  • "lvm-testing sanity PMEM-CSI can publish volume after a node driver restart" starts and begins waiting the driver at 14:11:15
  • it logs four "connection refused" errors, each one minute apart, before finally the missing socat pods get created at 12:16:40

So what happens here is that the daemonset controller takes a long time to create the pods.

Looking into the kube-controller log, one can see:

E0921 12:11:07.532563       1 daemon_controller.go:292] default/pmem-csi-node-testing failed with : Internal error occurred: failed calling webhook "pod-hook.pmem-csi.intel.com": Post https://pmem-csi-scheduler.default.svc:443/pod/mutate?timeout=30s: dial tcp 10.110.59.69:443: connect: connection refused

Not surprising, we just killed the pod which provides the webhook... It looks like the controller then goes into an exponential backoff and depending on the exact timing, that then can lead to minutes before it tries again.

The reason why we see that more now is that 35c1182#diff-84cb29332213948b2ef38a9d0c899903 introduced the container restart test and with it the code which kills all pods.

The solution could be two-fold:

  • make the webhook optional, i.e. allow pod scheduling while it is down
  • exclude the socat pods from the webhook

The second point is obviously correct and should solve the issue. Whether we need the first point is up for debate...

@pohly
Copy link
Contributor

pohly commented Sep 22, 2020

Might have been fixed by PR #739

@pohly pohly added the 0.8 needs to be fixed in 0.8.x label Sep 22, 2020
@pohly pohly self-assigned this Sep 22, 2020
@okartau
Copy link
Contributor Author

okartau commented Sep 22, 2020

most recent job 920 which I believe had latest changes, has timed out (4h run time) in all versions. This is different profile from previous runs.

@pohly
Copy link
Contributor

pohly commented Sep 22, 2020

Testing on 1.17 and 1.18 had no failures and was making progress right until the jobs were killed, i.e. testing just ran too slowly.

On 1.19 (https://cloudnative-k8sci.southcentralus.cloudapp.azure.com/blue/rest/organizations/jenkins/pipelines/pmem-csi/branches/devel/runs/242/nodes/93/steps/110/log/?start=0), there was:

[2020-09-22T13:58:27.965Z] Sep 22 13:58:27.783: FAIL: validate driver after update and restart
[2020-09-22T13:58:27.965Z] Unexpected error:
[2020-09-22T13:58:27.965Z]     <*errors.errorString | 0xc003a7be80>: {
[2020-09-22T13:58:27.965Z]         s: "timed out waiting for deployment, last error: deployed driver different from expected deployment:\nlabel foo missing for \"pmem-csi-with-defaults-external-provisioner-cfg\" of type \"rbac.authorization.k8s.io/v1, Kind=Role\" in namespace \"default\"\nlabel foo missing for \"pmem-csi-with-defaults-external-provisioner-runner\" of type \"rbac.authorization.k8s.io/v1, Kind=ClusterRole\" in namespace \"\"\nlabel foo missing for \"pmem-csi-with-defaults-csi-provisioner-role-cfg\" of type \"rbac.authorization.k8s.io/v1, Kind=RoleBinding\" in namespace \"default\"\nlabel foo missing for \"pmem-csi-with-defaults-csi-provisioner-role\" of type \"rbac.authorization.k8s.io/v1, Kind=ClusterRoleBinding\" in namespace \"\"\nlabel foo missing for \"pmem-csi-with-defaults-controller\" of type \"/v1, Kind=ServiceAccount\" in namespace \"default\"\nlabel foo missing for \"pmem-csi-with-defaults-controller\" of type \"/v1, Kind=Service\" in namespace \"default\"\nlabel foo missing for \"pmem-csi-with-defaults-metrics\" of type \"/v1, Kind=Service\" in namespace \"default\"\nlabel foo missing for \"pmem-csi-with-defaults\" of type \"storage.k8s.io/v1beta1, Kind=CSIDriver\" in namespace \"\"",
[2020-09-22T13:58:27.965Z]     }
[2020-09-22T13:58:27.965Z]     timed out waiting for deployment, last error: deployed driver different from expected deployment:
[2020-09-22T13:58:27.965Z]     label foo missing for "pmem-csi-with-defaults-external-provisioner-cfg" of type "rbac.authorization.k8s.io/v1, Kind=Role" in namespace "default"
[2020-09-22T13:58:27.965Z]     label foo missing for "pmem-csi-with-defaults-external-provisioner-runner" of type "rbac.authorization.k8s.io/v1, Kind=ClusterRole" in namespace ""
[2020-09-22T13:58:27.965Z]     label foo missing for "pmem-csi-with-defaults-csi-provisioner-role-cfg" of type "rbac.authorization.k8s.io/v1, Kind=RoleBinding" in namespace "default"
[2020-09-22T13:58:27.965Z]     label foo missing for "pmem-csi-with-defaults-csi-provisioner-role" of type "rbac.authorization.k8s.io/v1, Kind=ClusterRoleBinding" in namespace ""
[2020-09-22T13:58:27.965Z]     label foo missing for "pmem-csi-with-defaults-controller" of type "/v1, Kind=ServiceAccount" in namespace "default"
[2020-09-22T13:58:27.965Z]     label foo missing for "pmem-csi-with-defaults-controller" of type "/v1, Kind=Service" in namespace "default"
[2020-09-22T13:58:27.965Z]     label foo missing for "pmem-csi-with-defaults-metrics" of type "/v1, Kind=Service" in namespace "default"
[2020-09-22T13:58:27.965Z]     label foo missing for "pmem-csi-with-defaults" of type "storage.k8s.io/v1beta1, Kind=CSIDriver" in namespace ""
[2020-09-22T13:58:27.965Z] occurred

@avalluri : this looks like #742

@pohly
Copy link
Contributor

pohly commented Sep 22, 2020

Testing on 1.17 and 1.18 had no failures and was making progress right until the jobs were killed, i.e. testing just ran too slowly.

In these two jobs, there's just a single "Still waiting" message, i.e. driver startup typically happened in less than a minute. Looks like the commits from PR #739 helped.

@okartau
Copy link
Contributor Author

okartau commented Sep 25, 2020

one of recent merges (#743, #745) made good impact, likely the one increasing of the timeout. Most recent job 925 succeeded, where run times were between 4 and 5 hours.

@okartau
Copy link
Contributor Author

okartau commented Sep 28, 2020

Recent 5 runs all good, closing issue.

@okartau okartau closed this as completed Sep 28, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
0.8 needs to be fixed in 0.8.x
Projects
None yet
Development

No branches or pull requests

2 participants