-
Notifications
You must be signed in to change notification settings - Fork 55
CI job pmem-csi-periodic success rate very low after Sep-8: connection refused #740
Comments
After running tests in a loop, I just caught one instance of slow (but not fatally slow) test startup with four minutes of "connection refused" errors:
So what happens here is that the daemonset controller takes a long time to create the pods. Looking into the kube-controller log, one can see:
Not surprising, we just killed the pod which provides the webhook... It looks like the controller then goes into an exponential backoff and depending on the exact timing, that then can lead to minutes before it tries again. The reason why we see that more now is that 35c1182#diff-84cb29332213948b2ef38a9d0c899903 introduced the container restart test and with it the code which kills all pods. The solution could be two-fold:
The second point is obviously correct and should solve the issue. Whether we need the first point is up for debate... |
Might have been fixed by PR #739 |
most recent job 920 which I believe had latest changes, has timed out (4h run time) in all versions. This is different profile from previous runs. |
Testing on 1.17 and 1.18 had no failures and was making progress right until the jobs were killed, i.e. testing just ran too slowly. On 1.19 (https://cloudnative-k8sci.southcentralus.cloudapp.azure.com/blue/rest/organizations/jenkins/pipelines/pmem-csi/branches/devel/runs/242/nodes/93/steps/110/log/?start=0), there was:
|
In these two jobs, there's just a single "Still waiting" message, i.e. driver startup typically happened in less than a minute. Looks like the commits from PR #739 helped. |
Recent 5 runs all good, closing issue. |
After Sep-8th we started to see many failures. Success rate 5/22 jobs since then.
Do we know already and/or can we track down the cause(s), changes that may have contributed ?
The text was updated successfully, but these errors were encountered: