Reduce e2e test flakes #824

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

colemickens · 2018-01-11T18:14:52Z

/assign @colemickens

ericchiang · 2018-01-24T23:18:01Z

Even with Cole's PR we're still seeing flakes in TestReboot and consequentially TestSmoke. #864

When reboots fail it's really hard to debug the underlying issue. I think it'd be worth re-investing in efforts to make our log collection better (#783).

We also might want to add retries to TestSmoke, or ensure during TestReboot that the control plane is up for some amount of time before marking it successful.

rphillips · 2018-02-13T22:04:27Z

I think I have a handle on the TestReboot flakes. The TestReboot test errors in the TestDeleteAPI test case. The APIServer is deleted from the checkpointer, and the API server never comes back. From what I understand of the process, the checkpointer would restore the APIServer, but it cannot do this because it is in a loop to request secrets and failing.

There is a problem as well in the TestDeleteAPI to wait for any API server to return. It instead should wait for the control plane to be ready again, which is fixed in this PR: #892.

/cc @diegs

checkpointer logs:

I0213 21:11:36.919101       1 main.go:47] Determining environment from downward API
I0213 21:11:36.919704       1 main.go:53] Trying to acquire the flock at "/var/run/lock/pod-checkpointer.lock"
I0213 21:11:36.919743       1 main.go:58] Starting checkpointer for node: ip-10-8-0-4.us-west-1.compute.internal
I0213 21:11:36.950157       1 runtime_service.go:22] Connecting to runtime service unix:///var/run/dockershim.sock
I0213 21:11:52.042476       1 process.go:197] Checkpoint kube-system/kube-apiserver-qqh5w (active) transitioning from state active -> state active (grace period)
I0213 21:12:56.302879       1 state.go:306] Grace period exceeded for state active (grace period)
I0213 21:12:56.303016       1 state.go:207] Grace period exceeded for state inactive (grace period)
I0213 21:12:56.303054       1 process.go:197] Checkpoint kube-system/kube-apiserver-qqh5w (active (grace period)) transitioning from state active (grace period) -> state remove
I0213 21:12:56.303120       1 process.go:268] Removing checkpoint of: kube-system/kube-apiserver-qqh5w
W0213 21:13:01.328985       1 apiserver.go:20] Unable to contact APIServer, skipping garbage collection: Get https://13.57.193.74:6443/api/v1/namespaces/kube-system/pods?fieldSelector=spec.nodeName%3Dip-10-8-0-4.us-west-1.compute.internal
: dial tcp 13.57.193.74:6443: getsockopt: connection refused
W0213 21:13:06.408007       1 apiserver.go:20] Unable to contact APIServer, skipping garbage collection: Get https://13.57.193.74:6443/api/v1/namespaces/kube-system/pods?fieldSelector=spec.nodeName%3Dip-10-8-0-4.us-west-1.compute.internal
: dial tcp 13.57.193.74:6443: getsockopt: connection refused
W0213 21:13:11.441253       1 apiserver.go:20] Unable to contact APIServer, skipping garbage collection: Get https://13.57.193.74:6443/api/v1/namespaces/kube-system/pods?fieldSelector=spec.nodeName%3Dip-10-8-0-4.us-west-1.compute.internal
: dial tcp 13.57.193.74:6443: getsockopt: connection refused
W0213 21:13:16.469474       1 apiserver.go:20] Unable to contact APIServer, skipping garbage collection: Get https://13.57.193.74:6443/api/v1/namespaces/kube-system/pods?fieldSelector=spec.nodeName%3Dip-10-8-0-4.us-west-1.compute.internal
: dial tcp 13.57.193.74:6443: getsockopt: connection refused
W0213 21:13:21.499719       1 apiserver.go:20] Unable to contact APIServer, skipping garbage collection: Get https://13.57.193.74:6443/api/v1/namespaces/kube-system/pods?fieldSelector=spec.nodeName%3Dip-10-8-0-4.us-west-1.compute.internal
: dial tcp 13.57.193.74:6443: getsockopt: connection refused
W0213 21:13:26.525189       1 apiserver.go:20] Unable to contact APIServer, skipping garbage collection: Get https://13.57.193.74:6443/api/v1/namespaces/kube-system/pods?fieldSelector=spec.nodeName%3Dip-10-8-0-4.us-west-1.compute.internal
: dial tcp 13.57.193.74:6443: getsockopt: connection refused
W0213 21:13:31.639555       1 apiserver.go:20] Unable to contact APIServer, skipping garbage collection: Get https://13.57.193.74:6443/api/v1/namespaces/kube-system/pods?fieldSelector=spec.nodeName%3Dip-10-8-0-4.us-west-1.compute.internal
: dial tcp 13.57.193.74:6443: getsockopt: connection refused
W0213 21:13:36.666749       1 apiserver.go:20] Unable to contact APIServer, skipping garbage collection: Get https://13.57.193.74:6443/api/v1/namespaces/kube-system/pods?fieldSelector=spec.nodeName%3Dip-10-8-0-4.us-west-1.compute.internal
: dial tcp 13.57.193.74:6443: getsockopt: connection refused
W0213 21:13:41.727572       1 apiserver.go:20] Unable to contact APIServer, skipping garbage collection: Get https://13.57.193.74:6443/api/v1/namespaces/kube-system/pods?fieldSelector=spec.nodeName%3Dip-10-8-0-4.us-west-1.compute.internal
: dial tcp 13.57.193.74:6443: getsockopt: connection refused
W0213 21:13:46.755042       1 apiserver.go:20] Unable to contact APIServer, skipping garbage collection: Get https://13.57.193.74:6443/api/v1/namespaces/kube-system/pods?fieldSelector=spec.nodeName%3Dip-10-8-0-4.us-west-1.compute.internal
: dial tcp 13.57.193.74:6443: getsockopt: connection refused
W0213 21:13:51.818417       1 apiserver.go:20] Unable to contact APIServer, skipping garbage collection: Get https://13.57.193.74:6443/api/v1/namespaces/kube-system/pods?fieldSelector=spec.nodeName%3Dip-10-8-0-4.us-west-1.compute.internal
: dial tcp 13.57.193.74:6443: getsockopt: connection refused
W0213 21:13:56.851908       1 apiserver.go:20] Unable to contact APIServer, skipping garbage collection: Get https://13.57.193.74:6443/api/v1/namespaces/kube-system/pods?fieldSelector=spec.nodeName%3Dip-10-8-0-4.us-west-1.compute.internal
: dial tcp 13.57.193.74:6443: getsockopt: connection refused
E0213 21:13:56.853150       1 process.go:246] Failed to checkpoint secrets for pod kube-system/pod-checkpointer-bpl4p: failed to checkpoint secret for pod kube-system/pod-checkpointer-bpl4p: failed to retrieve secret kube-system/pod-check
pointer-token-tc724: Get https://13.57.193.74:6443/api/v1/namespaces/kube-system/secrets/pod-checkpointer-token-tc724: dial tcp 13.57.193.74:6443: getsockopt: connection refused
W0213 21:14:01.886547       1 apiserver.go:20] Unable to contact APIServer, skipping garbage collection: Get https://13.57.193.74:6443/api/v1/namespaces/kube-system/pods?fieldSelector=spec.nodeName%3Dip-10-8-0-4.us-west-1.compute.internal
: dial tcp 13.57.193.74:6443: getsockopt: connection refused
W0213 21:14:06.911519       1 apiserver.go:20] Unable to contact APIServer, skipping garbage collection: Get https://13.57.193.74:6443/api/v1/namespaces/kube-system/pods?fieldSelector=spec.nodeName%3Dip-10-8-0-4.us-west-1.compute.internal
: dial tcp 13.57.193.74:6443: getsockopt: connection refused

diegs · 2018-02-15T23:34:24Z

cc @rphillips @ericchiang

I looked into the failure case you described a little more, and the checkpointer is working as intended. The issue is indeed triggered by having the TestDeleteAPI and TestReboot cases run back to back, and adding the extra wait may help, but the root cause is something else.

Specifically, what I observed was:

TestDeleteAPI deletes the pod but then finishes immediately (the wait logic there is not great, as the comment mentions).
The suite moves on to the TestReboot case, but a new apiserver pod was not scheduled by the controller-manager yet. So instead it's still running on the old checkpointed apiserver (this is technically fine).
The node reboots, the checkpointer starts the apiserver back up, and the kubelet starts running all the other pods. So far so good!
However, for some reason the controller-manager gets into a crash-loop. So the new apiserver pod to replace the one that was deleted in step 1 is never created.
Since the checkpointed apiserver is running off of a checkpoint of the old apiserver pod that is now gone, the checkpointer is actually just waiting for the grace period to expire in order to remove it. Once the grace period is up it cleans up that checkpoint (correct behavior), but now there are no apiservers.

We could possibly mitigate this issue by increasing the checkpoint grace period but that feels like a bandaid solution. The real fix is to find out why the controller-manager is crash looping when starting up against a self-hosted apiserver. These are the logs I saw:

[master +=] $ kubectl -n kube-system logs kube-controller-manager-d56d486cc-85jmx                                                                                                                                                                                                                                          [0/0]
I0215 23:15:16.744282       1 controllermanager.go:108] Version: v1.9.2
W0215 23:15:16.744396       1 client_config.go:529] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0215 23:15:16.749116       1 leaderelection.go:174] attempting to acquire leader lease...
E0215 23:15:38.095140       1 leaderelection.go:224] error retrieving resource lock kube-system/kube-controller-manager: endpoints "kube-controller-manager" is forbidden: User "system:serviceaccount:kube-system:kube-controller-manager" cannot get endpoints in the namespace "kube-system"
E0215 23:16:36.923111       1 leaderelection.go:224] error retrieving resource lock kube-system/kube-controller-manager: Get https://10.3.0.1:443/api/v1/namespaces/kube-system/endpoints/kube-controller-manager: dial tcp 10.3.0.1:443: getsockopt: connection refused
E0215 23:16:43.735233       1 leaderelection.go:224] error retrieving resource lock kube-system/kube-controller-manager: endpoints "kube-controller-manager" is forbidden: User "system:serviceaccount:kube-system:kube-controller-manager" cannot get endpoints in the namespace "kube-system"

diegs · 2018-02-16T00:22:25Z

Ok, some more experimenting and extending the grace period longer (currently 3 minutes) allows everything else to recover. It just takes a while after a reboot for kube-flannel to get healthy (needs an apiserver) and it seems that other pods are keyed off that. Given the kubelet's restart / exponential backoff cycle this can end up taking a little while.

However, this also means that the other tests (e.g. Smoke) finish while the cluster is still recovering. But it does recover.

I propose:

Increase the grace period for the real checkpointer to 3 minutes instead of 30 seconds. This is the actual, meaningful fix for root cause that can lead to a control plane outage.
Update TestDeleteAPI and TestReboot to have better wait conditions so that the cluster is fully recovered in each before it moves on (this probably means looking at specific pod names and such). This will actually serve to remove the flakiness cause as well, and isn't needed with this fix, but feels more correct in terms of what the tests are trying to assert.

The downside of increasing the grace period is that sometimes we'll run an old pod for longer than it technically should. But since the checkpoints are only activated in adverse conditions I think this is ok.

When nodes reboot, such as in the TestReboot e2e test case, it can take a while for the cluster to get stable due to the dependency chain between the apiserver, flannel, and the controller manager and so on. If the controller manager was in the middle of doing something (e.g. rolling the apiserver) while a reboot occurs, we need to ensure that the controller manager gets healthy again. This requires keeping the checkpointed apiserver up. The downside is that this may run pods considerably longer than they ought to. However, this is a failure recovery scenario, and running an old pod is not a huge violation of k8s semantics (daemonsets strive for 1-at-a-time semantics but don't guarantee it). This should alleviate the flakes observed in kubernetes-retired#824.

rphillips · 2018-02-16T02:30:11Z

@diegs latest master has an improved wait within TestDeleteAPI code. Is there further improvement to this wait you are considering?

diegs · 2018-02-16T19:38:30Z

@rphillips yeah, waiting for no checkpointed pods is part of it.

I would replace the hacky test (it's basically useless now) https://github.com/kubernetes-incubator/bootkube/blob/master/e2e/deleteapi_test.go#L36 with a more targeted test:

Record the apiserver pod's name before you delete it.
Keep hitting apiserver pods endpoint to make sure it's gone (and perhaps that the checkpoint is running).
3, Then go into the controlPlaneReady loop.

I can write a quick PR for this.

ericchiang · 2018-03-20T23:20:43Z

e2e tests have been much less flakely lately. Closing

diegs mentioned this issue Jan 8, 2018

checkpointer: bump default checkpointer version. #826

Merged

k8s-ci-robot assigned colemickens Jan 11, 2018

colemickens mentioned this issue Jan 11, 2018

prow issue/pr assignment failure error is slightly inaccurate kubernetes/test-infra#6254

Closed

rphillips mentioned this issue Jan 12, 2018

remove self hosted etcd functionality #828

Merged

ericchiang mentioned this issue Jan 12, 2018

templates: increase API server verbosity #831

Closed

ericchiang mentioned this issue Jan 24, 2018

More TestReboot/TestSmoke flakes #864

Closed

diegs mentioned this issue Feb 16, 2018

checkpointer: add 5m grace period flag. #897

Merged

diegs mentioned this issue Feb 16, 2018

e2e: better delete api wait condition. #900

Merged

ericchiang closed this as completed Mar 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce e2e test flakes #824

Reduce e2e test flakes #824

ericchiang commented Jan 8, 2018 •

edited by rphillips

Loading

diegs commented Jan 8, 2018 •

edited

Loading

ericchiang commented Jan 11, 2018

k8s-ci-robot commented Jan 11, 2018

colemickens commented Jan 11, 2018

ericchiang commented Jan 24, 2018

rphillips commented Feb 13, 2018 •

edited

Loading

diegs commented Feb 15, 2018 •

edited

Loading

diegs commented Feb 16, 2018

rphillips commented Feb 16, 2018

diegs commented Feb 16, 2018

ericchiang commented Mar 20, 2018

Reduce e2e test flakes #824

Reduce e2e test flakes #824

Comments

ericchiang commented Jan 8, 2018 • edited by rphillips Loading

PR e2e

PR e2e Calico

Test Smoke (probably actually reboot test)

diegs commented Jan 8, 2018 • edited Loading

PR e2e

ericchiang commented Jan 11, 2018

k8s-ci-robot commented Jan 11, 2018

colemickens commented Jan 11, 2018

ericchiang commented Jan 24, 2018

rphillips commented Feb 13, 2018 • edited Loading

diegs commented Feb 15, 2018 • edited Loading

diegs commented Feb 16, 2018

rphillips commented Feb 16, 2018

diegs commented Feb 16, 2018

ericchiang commented Mar 20, 2018

ericchiang commented Jan 8, 2018 •

edited by rphillips

Loading

diegs commented Jan 8, 2018 •

edited

Loading

rphillips commented Feb 13, 2018 •

edited

Loading

diegs commented Feb 15, 2018 •

edited

Loading