Skip to content
This repository has been archived by the owner on Jul 30, 2021. It is now read-only.

Reduce e2e test flakes #824

Closed
ericchiang opened this issue Jan 8, 2018 · 11 comments
Closed

Reduce e2e test flakes #824

ericchiang opened this issue Jan 8, 2018 · 11 comments
Assignees

Comments

@ericchiang
Copy link
Contributor

ericchiang commented Jan 8, 2018

PR e2e

TestCheckpointer/UnscheduleCheckpointer

2018-01-08 https://jenkins-kube-lifecycle.prod.coreos.systems/job/bootkube-pr-e2e/687/console

PR e2e Calico

Between 259 and 228 inclusive 12 builds failed (~35%).

TestCheckpointer/UnscheduleCheckpointer

2018-01-08 https://jenkins-kube-lifecycle.prod.coreos.systems/job/bootkube-pr-e2e-calico/271/console
2018-01-08 https://jenkins-kube-lifecycle.prod.coreos.systems/job/bootkube-pr-e2e-calico/259/console
2017-12-20 https://jenkins-kube-lifecycle.prod.coreos.systems/job/bootkube-pr-e2e-calico/243/console
2017-12-20 https://jenkins-kube-lifecycle.prod.coreos.systems/job/bootkube-pr-e2e-calico/241/console
2017-12-19 https://jenkins-kube-lifecycle.prod.coreos.systems/job/bootkube-pr-e2e-calico/236/console
2017-12-19 https://jenkins-kube-lifecycle.prod.coreos.systems/job/bootkube-pr-e2e-calico/233/console
2017-12-19 https://jenkins-kube-lifecycle.prod.coreos.systems/job/bootkube-pr-e2e-calico/230/console
2017-12-18 https://jenkins-kube-lifecycle.prod.coreos.systems/job/bootkube-pr-e2e-calico/228/console

TestDeleteAPI

2018-01-02 https://jenkins-kube-lifecycle.prod.coreos.systems/job/bootkube-pr-e2e-calico/253/console

Network failures

2018-01-02 https://jenkins-kube-lifecycle.prod.coreos.systems/job/bootkube-pr-e2e-calico/254/console

Unit test failures

2017-12-20 https://jenkins-kube-lifecycle.prod.coreos.systems/job/bootkube-pr-e2e-calico/239/console

Log collection

2017-12-19 https://jenkins-kube-lifecycle.prod.coreos.systems/job/bootkube-pr-e2e-calico/232/console

Control plane failed

2017-12-18 https://jenkins-kube-lifecycle.prod.coreos.systems/job/bootkube-pr-e2e-calico/229/console

Test Smoke (probably actually reboot test)

2018-01-08 https://jenkins-kube-lifecycle.prod.coreos.systems/job/bootkube-pr-e2e/683/console
2018-01-12 https://jenkins-kube-lifecycle.prod.coreos.systems/job/bootkube-pr-e2e/707/console

@diegs
Copy link
Contributor

diegs commented Jan 8, 2018

PR e2e

Smoke test (probably actually reboot test):

2018-01-08 https://jenkins-kube-lifecycle.prod.coreos.systems/job/bootkube-pr-e2e/683/console

@ericchiang
Copy link
Contributor Author

/assign @colemickens

@k8s-ci-robot
Copy link
Contributor

@ericchiang: GitHub didn't allow me to assign the following users: colemickens.

Note that only kubernetes-incubator members can be assigned.

In response to this:

/assign @colemickens

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@colemickens
Copy link
Contributor

/assign @colemickens

@ericchiang
Copy link
Contributor Author

Even with Cole's PR we're still seeing flakes in TestReboot and consequentially TestSmoke. #864

When reboots fail it's really hard to debug the underlying issue. I think it'd be worth re-investing in efforts to make our log collection better (#783).

We also might want to add retries to TestSmoke, or ensure during TestReboot that the control plane is up for some amount of time before marking it successful.

@rphillips
Copy link
Contributor

rphillips commented Feb 13, 2018

I think I have a handle on the TestReboot flakes. The TestReboot test errors in the TestDeleteAPI test case. The APIServer is deleted from the checkpointer, and the API server never comes back. From what I understand of the process, the checkpointer would restore the APIServer, but it cannot do this because it is in a loop to request secrets and failing.

There is a problem as well in the TestDeleteAPI to wait for any API server to return. It instead should wait for the control plane to be ready again, which is fixed in this PR: #892.

/cc @diegs

checkpointer logs:

I0213 21:11:36.919101       1 main.go:47] Determining environment from downward API
I0213 21:11:36.919704       1 main.go:53] Trying to acquire the flock at "/var/run/lock/pod-checkpointer.lock"
I0213 21:11:36.919743       1 main.go:58] Starting checkpointer for node: ip-10-8-0-4.us-west-1.compute.internal
I0213 21:11:36.950157       1 runtime_service.go:22] Connecting to runtime service unix:///var/run/dockershim.sock
I0213 21:11:52.042476       1 process.go:197] Checkpoint kube-system/kube-apiserver-qqh5w (active) transitioning from state active -> state active (grace period)
I0213 21:12:56.302879       1 state.go:306] Grace period exceeded for state active (grace period)
I0213 21:12:56.303016       1 state.go:207] Grace period exceeded for state inactive (grace period)
I0213 21:12:56.303054       1 process.go:197] Checkpoint kube-system/kube-apiserver-qqh5w (active (grace period)) transitioning from state active (grace period) -> state remove
I0213 21:12:56.303120       1 process.go:268] Removing checkpoint of: kube-system/kube-apiserver-qqh5w
W0213 21:13:01.328985       1 apiserver.go:20] Unable to contact APIServer, skipping garbage collection: Get https://13.57.193.74:6443/api/v1/namespaces/kube-system/pods?fieldSelector=spec.nodeName%3Dip-10-8-0-4.us-west-1.compute.internal
: dial tcp 13.57.193.74:6443: getsockopt: connection refused
W0213 21:13:06.408007       1 apiserver.go:20] Unable to contact APIServer, skipping garbage collection: Get https://13.57.193.74:6443/api/v1/namespaces/kube-system/pods?fieldSelector=spec.nodeName%3Dip-10-8-0-4.us-west-1.compute.internal
: dial tcp 13.57.193.74:6443: getsockopt: connection refused
W0213 21:13:11.441253       1 apiserver.go:20] Unable to contact APIServer, skipping garbage collection: Get https://13.57.193.74:6443/api/v1/namespaces/kube-system/pods?fieldSelector=spec.nodeName%3Dip-10-8-0-4.us-west-1.compute.internal
: dial tcp 13.57.193.74:6443: getsockopt: connection refused
W0213 21:13:16.469474       1 apiserver.go:20] Unable to contact APIServer, skipping garbage collection: Get https://13.57.193.74:6443/api/v1/namespaces/kube-system/pods?fieldSelector=spec.nodeName%3Dip-10-8-0-4.us-west-1.compute.internal
: dial tcp 13.57.193.74:6443: getsockopt: connection refused
W0213 21:13:21.499719       1 apiserver.go:20] Unable to contact APIServer, skipping garbage collection: Get https://13.57.193.74:6443/api/v1/namespaces/kube-system/pods?fieldSelector=spec.nodeName%3Dip-10-8-0-4.us-west-1.compute.internal
: dial tcp 13.57.193.74:6443: getsockopt: connection refused
W0213 21:13:26.525189       1 apiserver.go:20] Unable to contact APIServer, skipping garbage collection: Get https://13.57.193.74:6443/api/v1/namespaces/kube-system/pods?fieldSelector=spec.nodeName%3Dip-10-8-0-4.us-west-1.compute.internal
: dial tcp 13.57.193.74:6443: getsockopt: connection refused
W0213 21:13:31.639555       1 apiserver.go:20] Unable to contact APIServer, skipping garbage collection: Get https://13.57.193.74:6443/api/v1/namespaces/kube-system/pods?fieldSelector=spec.nodeName%3Dip-10-8-0-4.us-west-1.compute.internal
: dial tcp 13.57.193.74:6443: getsockopt: connection refused
W0213 21:13:36.666749       1 apiserver.go:20] Unable to contact APIServer, skipping garbage collection: Get https://13.57.193.74:6443/api/v1/namespaces/kube-system/pods?fieldSelector=spec.nodeName%3Dip-10-8-0-4.us-west-1.compute.internal
: dial tcp 13.57.193.74:6443: getsockopt: connection refused
W0213 21:13:41.727572       1 apiserver.go:20] Unable to contact APIServer, skipping garbage collection: Get https://13.57.193.74:6443/api/v1/namespaces/kube-system/pods?fieldSelector=spec.nodeName%3Dip-10-8-0-4.us-west-1.compute.internal
: dial tcp 13.57.193.74:6443: getsockopt: connection refused
W0213 21:13:46.755042       1 apiserver.go:20] Unable to contact APIServer, skipping garbage collection: Get https://13.57.193.74:6443/api/v1/namespaces/kube-system/pods?fieldSelector=spec.nodeName%3Dip-10-8-0-4.us-west-1.compute.internal
: dial tcp 13.57.193.74:6443: getsockopt: connection refused
W0213 21:13:51.818417       1 apiserver.go:20] Unable to contact APIServer, skipping garbage collection: Get https://13.57.193.74:6443/api/v1/namespaces/kube-system/pods?fieldSelector=spec.nodeName%3Dip-10-8-0-4.us-west-1.compute.internal
: dial tcp 13.57.193.74:6443: getsockopt: connection refused
W0213 21:13:56.851908       1 apiserver.go:20] Unable to contact APIServer, skipping garbage collection: Get https://13.57.193.74:6443/api/v1/namespaces/kube-system/pods?fieldSelector=spec.nodeName%3Dip-10-8-0-4.us-west-1.compute.internal
: dial tcp 13.57.193.74:6443: getsockopt: connection refused
E0213 21:13:56.853150       1 process.go:246] Failed to checkpoint secrets for pod kube-system/pod-checkpointer-bpl4p: failed to checkpoint secret for pod kube-system/pod-checkpointer-bpl4p: failed to retrieve secret kube-system/pod-check
pointer-token-tc724: Get https://13.57.193.74:6443/api/v1/namespaces/kube-system/secrets/pod-checkpointer-token-tc724: dial tcp 13.57.193.74:6443: getsockopt: connection refused
W0213 21:14:01.886547       1 apiserver.go:20] Unable to contact APIServer, skipping garbage collection: Get https://13.57.193.74:6443/api/v1/namespaces/kube-system/pods?fieldSelector=spec.nodeName%3Dip-10-8-0-4.us-west-1.compute.internal
: dial tcp 13.57.193.74:6443: getsockopt: connection refused
W0213 21:14:06.911519       1 apiserver.go:20] Unable to contact APIServer, skipping garbage collection: Get https://13.57.193.74:6443/api/v1/namespaces/kube-system/pods?fieldSelector=spec.nodeName%3Dip-10-8-0-4.us-west-1.compute.internal
: dial tcp 13.57.193.74:6443: getsockopt: connection refused

@diegs
Copy link
Contributor

diegs commented Feb 15, 2018

cc @rphillips @ericchiang

I looked into the failure case you described a little more, and the checkpointer is working as intended. The issue is indeed triggered by having the TestDeleteAPI and TestReboot cases run back to back, and adding the extra wait may help, but the root cause is something else.

Specifically, what I observed was:

  1. TestDeleteAPI deletes the pod but then finishes immediately (the wait logic there is not great, as the comment mentions).
  2. The suite moves on to the TestReboot case, but a new apiserver pod was not scheduled by the controller-manager yet. So instead it's still running on the old checkpointed apiserver (this is technically fine).
  3. The node reboots, the checkpointer starts the apiserver back up, and the kubelet starts running all the other pods. So far so good!
  4. However, for some reason the controller-manager gets into a crash-loop. So the new apiserver pod to replace the one that was deleted in step 1 is never created.
  5. Since the checkpointed apiserver is running off of a checkpoint of the old apiserver pod that is now gone, the checkpointer is actually just waiting for the grace period to expire in order to remove it. Once the grace period is up it cleans up that checkpoint (correct behavior), but now there are no apiservers.

We could possibly mitigate this issue by increasing the checkpoint grace period but that feels like a bandaid solution. The real fix is to find out why the controller-manager is crash looping when starting up against a self-hosted apiserver. These are the logs I saw:

[master +=] $ kubectl -n kube-system logs kube-controller-manager-d56d486cc-85jmx                                                                                                                                                                                                                                          [0/0]
I0215 23:15:16.744282       1 controllermanager.go:108] Version: v1.9.2
W0215 23:15:16.744396       1 client_config.go:529] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0215 23:15:16.749116       1 leaderelection.go:174] attempting to acquire leader lease...
E0215 23:15:38.095140       1 leaderelection.go:224] error retrieving resource lock kube-system/kube-controller-manager: endpoints "kube-controller-manager" is forbidden: User "system:serviceaccount:kube-system:kube-controller-manager" cannot get endpoints in the namespace "kube-system"
E0215 23:16:36.923111       1 leaderelection.go:224] error retrieving resource lock kube-system/kube-controller-manager: Get https://10.3.0.1:443/api/v1/namespaces/kube-system/endpoints/kube-controller-manager: dial tcp 10.3.0.1:443: getsockopt: connection refused
E0215 23:16:43.735233       1 leaderelection.go:224] error retrieving resource lock kube-system/kube-controller-manager: endpoints "kube-controller-manager" is forbidden: User "system:serviceaccount:kube-system:kube-controller-manager" cannot get endpoints in the namespace "kube-system"

@diegs
Copy link
Contributor

diegs commented Feb 16, 2018

Ok, some more experimenting and extending the grace period longer (currently 3 minutes) allows everything else to recover. It just takes a while after a reboot for kube-flannel to get healthy (needs an apiserver) and it seems that other pods are keyed off that. Given the kubelet's restart / exponential backoff cycle this can end up taking a little while.

However, this also means that the other tests (e.g. Smoke) finish while the cluster is still recovering. But it does recover.

I propose:

  1. Increase the grace period for the real checkpointer to 3 minutes instead of 30 seconds. This is the actual, meaningful fix for root cause that can lead to a control plane outage.
  2. Update TestDeleteAPI and TestReboot to have better wait conditions so that the cluster is fully recovered in each before it moves on (this probably means looking at specific pod names and such). This will actually serve to remove the flakiness cause as well, and isn't needed with this fix, but feels more correct in terms of what the tests are trying to assert.

The downside of increasing the grace period is that sometimes we'll run an old pod for longer than it technically should. But since the checkpoints are only activated in adverse conditions I think this is ok.

diegs added a commit to diegs/bootkube that referenced this issue Feb 16, 2018
When nodes reboot, such as in the TestReboot e2e test case, it can take
a while for the cluster to get stable due to the dependency chain
between the apiserver, flannel, and the controller manager and so on.

If the controller manager was in the middle of doing something (e.g.
rolling the apiserver) while a reboot occurs, we need to ensure that the
controller manager gets healthy again. This requires keeping the
checkpointed apiserver up.

The downside is that this may run pods considerably longer than they
ought to. However, this is a failure recovery scenario, and running an
old pod is not a huge violation of k8s semantics (daemonsets strive for
1-at-a-time semantics but don't guarantee it).

This should alleviate the flakes observed in kubernetes-retired#824.
@rphillips
Copy link
Contributor

@diegs latest master has an improved wait within TestDeleteAPI code. Is there further improvement to this wait you are considering?

@diegs
Copy link
Contributor

diegs commented Feb 16, 2018

@rphillips yeah, waiting for no checkpointed pods is part of it.

I would replace the hacky test (it's basically useless now) https://github.com/kubernetes-incubator/bootkube/blob/master/e2e/deleteapi_test.go#L36 with a more targeted test:

  1. Record the apiserver pod's name before you delete it.
  2. Keep hitting apiserver pods endpoint to make sure it's gone (and perhaps that the checkpoint is running).
    3, Then go into the controlPlaneReady loop.

I can write a quick PR for this.

@ericchiang
Copy link
Contributor Author

e2e tests have been much less flakely lately. Closing

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants