-
Notifications
You must be signed in to change notification settings - Fork 218
Reduce e2e test flakes #824
Comments
PR e2eSmoke test (probably actually reboot test): 2018-01-08 https://jenkins-kube-lifecycle.prod.coreos.systems/job/bootkube-pr-e2e/683/console |
/assign @colemickens |
@ericchiang: GitHub didn't allow me to assign the following users: colemickens. Note that only kubernetes-incubator members can be assigned. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign @colemickens |
Even with Cole's PR we're still seeing flakes in TestReboot and consequentially TestSmoke. #864 When reboots fail it's really hard to debug the underlying issue. I think it'd be worth re-investing in efforts to make our log collection better (#783). We also might want to add retries to TestSmoke, or ensure during TestReboot that the control plane is up for some amount of time before marking it successful. |
I think I have a handle on the TestReboot flakes. The TestReboot test errors in the TestDeleteAPI test case. The APIServer is deleted from the checkpointer, and the API server never comes back. From what I understand of the process, the checkpointer would restore the APIServer, but it cannot do this because it is in a loop to request secrets and failing. There is a problem as well in the TestDeleteAPI to wait for any API server to return. It instead should wait for the control plane to be ready again, which is fixed in this PR: #892. /cc @diegs checkpointer logs:
|
I looked into the failure case you described a little more, and the checkpointer is working as intended. The issue is indeed triggered by having the Specifically, what I observed was:
We could possibly mitigate this issue by increasing the checkpoint grace period but that feels like a bandaid solution. The real fix is to find out why the controller-manager is crash looping when starting up against a self-hosted apiserver. These are the logs I saw:
|
Ok, some more experimenting and extending the grace period longer (currently 3 minutes) allows everything else to recover. It just takes a while after a reboot for kube-flannel to get healthy (needs an apiserver) and it seems that other pods are keyed off that. Given the kubelet's restart / exponential backoff cycle this can end up taking a little while. However, this also means that the other tests (e.g. Smoke) finish while the cluster is still recovering. But it does recover. I propose:
The downside of increasing the grace period is that sometimes we'll run an old pod for longer than it technically should. But since the checkpoints are only activated in adverse conditions I think this is ok. |
When nodes reboot, such as in the TestReboot e2e test case, it can take a while for the cluster to get stable due to the dependency chain between the apiserver, flannel, and the controller manager and so on. If the controller manager was in the middle of doing something (e.g. rolling the apiserver) while a reboot occurs, we need to ensure that the controller manager gets healthy again. This requires keeping the checkpointed apiserver up. The downside is that this may run pods considerably longer than they ought to. However, this is a failure recovery scenario, and running an old pod is not a huge violation of k8s semantics (daemonsets strive for 1-at-a-time semantics but don't guarantee it). This should alleviate the flakes observed in kubernetes-retired#824.
@rphillips yeah, waiting for no checkpointed pods is part of it. I would replace the hacky test (it's basically useless now) https://github.com/kubernetes-incubator/bootkube/blob/master/e2e/deleteapi_test.go#L36 with a more targeted test:
I can write a quick PR for this. |
e2e tests have been much less flakely lately. Closing |
PR e2e
TestCheckpointer/UnscheduleCheckpointer
2018-01-08 https://jenkins-kube-lifecycle.prod.coreos.systems/job/bootkube-pr-e2e/687/console
PR e2e Calico
Between 259 and 228 inclusive 12 builds failed (~35%).
TestCheckpointer/UnscheduleCheckpointer
2018-01-08 https://jenkins-kube-lifecycle.prod.coreos.systems/job/bootkube-pr-e2e-calico/271/console
2018-01-08 https://jenkins-kube-lifecycle.prod.coreos.systems/job/bootkube-pr-e2e-calico/259/console
2017-12-20 https://jenkins-kube-lifecycle.prod.coreos.systems/job/bootkube-pr-e2e-calico/243/console
2017-12-20 https://jenkins-kube-lifecycle.prod.coreos.systems/job/bootkube-pr-e2e-calico/241/console
2017-12-19 https://jenkins-kube-lifecycle.prod.coreos.systems/job/bootkube-pr-e2e-calico/236/console
2017-12-19 https://jenkins-kube-lifecycle.prod.coreos.systems/job/bootkube-pr-e2e-calico/233/console
2017-12-19 https://jenkins-kube-lifecycle.prod.coreos.systems/job/bootkube-pr-e2e-calico/230/console
2017-12-18 https://jenkins-kube-lifecycle.prod.coreos.systems/job/bootkube-pr-e2e-calico/228/console
TestDeleteAPI
2018-01-02 https://jenkins-kube-lifecycle.prod.coreos.systems/job/bootkube-pr-e2e-calico/253/console
Network failures
2018-01-02 https://jenkins-kube-lifecycle.prod.coreos.systems/job/bootkube-pr-e2e-calico/254/console
Unit test failures
2017-12-20 https://jenkins-kube-lifecycle.prod.coreos.systems/job/bootkube-pr-e2e-calico/239/console
Log collection
2017-12-19 https://jenkins-kube-lifecycle.prod.coreos.systems/job/bootkube-pr-e2e-calico/232/console
Control plane failed
2017-12-18 https://jenkins-kube-lifecycle.prod.coreos.systems/job/bootkube-pr-e2e-calico/229/console
Test Smoke (probably actually reboot test)
2018-01-08 https://jenkins-kube-lifecycle.prod.coreos.systems/job/bootkube-pr-e2e/683/console
2018-01-12 https://jenkins-kube-lifecycle.prod.coreos.systems/job/bootkube-pr-e2e/707/console
The text was updated successfully, but these errors were encountered: