master reboot test is failing #262

peebs · 2017-01-06T19:54:14Z

This seemed to slip into the master branch during when the tests were temporarily broken over some of the self-hosted-flannel changes. If I reboot a master node, the api-server doesn't come back up. @aaronlevy is already on it and suspects what the problem is.

aaronlevy · 2017-01-07T02:18:40Z

There are a few different issues at play here - but the core issue seems to be that the checkpointer is relying on determining local state from the kubelet /pods api - but that api may not reflect accurate local state.

Unfortunately that kubelet api endpoint will just report state from the last time it was able to successfully contact an api-server. Essentially a cache of the last state reported to api.

We didn't have this same issue with the old checkpointer - because it would determine "api is running" by reaching out to an api -- but that isn't reliable in multi-master, or for generic checkpointing.

This is a rather disappointing discovery -- as it seems there is no way to easily determine the local pod state if an api-server is not available.

Some options moving forward:

Go back to hard-coded expectations (e.g. curl localhost:8080)
Extract liveness probes from pod specs and try to use those if available
Add some kind of "last seen" annotation to checkpoint parent pods, where we expect to see an update within a particular time window.
Longer term option: make use of CRI interface and inspect state ourselves

/cc @yifan-gu @dghubble

Quentin-M · 2017-01-09T10:49:16Z

I am not too familiar with the checkpointer itself and the contraints that surround it but I believe that relying on liveness probes could be a decent solution. It is simple, users are familiar with the concept and they would expect the checkpointer to rely on them. It is also user customizable in a way. Additionally, the project also vendors Kubernetes so the functions that achieve that could be re-used directly (less code to maintain, better integration overall).

The downside is that if we do not want to add the requirement of defining probes, this is only part of the solution.

aaronlevy · 2017-01-09T18:13:38Z

The other part of this issue is that api-server behavior has changed in v1.5:

In earlier versions, the api-server would continually re-try to bind to particular addresses (:443 & :8080). If they were already in use, it would just try again in 15 seconds.

The behavior in v1.5 is that the api-server will exit immediately if it is not able to listen on those addresses (and would rely on external mechanisms of systemd/kubelet/etc to restart it again).

In terms of the checkpointer - we need a reliable way to determine "real api-server is running, or it is trying to be run". The "trying to be run" is important, because we need to remove an active checkpoint in this situation, so the real server can actually start (and bind on 443/8080).

Even if we check the liveness probe, we don't know "what" we are checking (someone happens to be listening on 8080). It could be an active checkpoint, or it could be the active parent. All we get to determine is that one of them happens to be running - but we can't make super reliable / actionable decisions based on just that information (I think it will work to add the liveness check - just not in a particularly clean way).

For example, the issue I was seeing was:

After reboot the kubelet starts (no local pod state)
Checkpointer sees that it has an inactive checkpoint for api-server, and should start it
Checkpointed-apiserver starts, and now kubelet can ask api what pods should be running
Kubelet starts the real apiserver pod, but it immediately fails because it cannot bind to 8080/443
Even though the api-server pod immediately failed, it was in fact "started" by the kubelet - so the last reported state to the api was: apiserver podState=running.
Checkpointer inspects the local kubelet-api, and sees apiserver podState=running - even though that is no longer true.
We are now in a state where kubelet-api thinks api-pod is running - but this could be stale information.
the issue: The desired state is that the checkpointed copy is deactivated (happens), and that the real apiserver pod takes over (does not happen). What I'm seeing is that the kubelet is not trying to restart the real api-server pod (it had failed to bind prior) - even though it could successfully start now. But then the checkpointer only sees that /pods api reports (stale) state of apiserver podState=running - so it makes no changes (it thinks everything is a-ok).

One other option which comes to mind after typing this: ensuring that the local docker state (of failed pods) exists for a longer window (--minimum-image-ttl-duration).

I think this might help because I believe the kubelet determines what pods it needs to restart by inspecting information serialized into the docker containers (otherwise a reboot of a kubelet would mean all local state is lost until api-server is available). So my hunch is that we have an issue where the api-server pod is being garbage collected in the window before kubelet knows to restart it. But if we leave this sitting around longer -- we could have a better recovery window.

Quentin-M · 2017-01-09T20:49:46Z

Both the old and new checkpointers are affected by the API server change in 1.5
The --minimum-image-ttl-duration doesn't seem to help
The old checkpointer is slightly better because the race can potentially be won but the odds are low and it takes forever to get the right timing

Will experiment waiting for a file lock on the API server start.

aaronlevy · 2017-01-09T23:10:34Z

For posterity:

There is only minimal state which is actually stored with the docker containers: https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/dockertools/labels.go#L70

So it's not actually possible to recover local state from this info alone (this is mapped to the internal kubelet pod state).

Another option @Quentin-M and I came up with is to have the api-server use file locks to coordinate between the parent / checkpoint. This way we don't end up in failure loops when both are running, but only one can successfully listen on host address.

aaronlevy · 2017-01-10T02:50:44Z

Should be closed by: #264

This commit represents a workaround for kubernetes-retired#262. By maintaining a file lock while the API server is running (either temporary or self-hosted), we prevent the self-hosted API server from starting and trying to bind ports, until the temporary one is stopped. Therefore, we avoid the loop where the self-hosted API server would crash as soon as it is brought up due to the ports already being bound by the stopping temporary server.

aaronlevy self-assigned this Jan 6, 2017

aaronlevy added kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. priority/P0 labels Jan 6, 2017

dghubble mentioned this issue Jan 9, 2017

examples: Update Kubernetes to use self-hosted flannel poseidon/matchbox#411

Merged

Quentin-M mentioned this issue Jan 9, 2017

pkg/asset/internal: Introduce a Flock around the API server #264

Merged

aaronlevy closed this as completed Jan 10, 2017

luxas mentioned this issue Feb 16, 2017

Self-hosting should use a file lock kubernetes/kubeadm#168

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

master reboot test is failing #262

master reboot test is failing #262

peebs commented Jan 6, 2017

aaronlevy commented Jan 7, 2017

Quentin-M commented Jan 9, 2017 •

edited

Loading

aaronlevy commented Jan 9, 2017

Quentin-M commented Jan 9, 2017

aaronlevy commented Jan 9, 2017

aaronlevy commented Jan 10, 2017

master reboot test is failing #262

master reboot test is failing #262

Comments

peebs commented Jan 6, 2017

aaronlevy commented Jan 7, 2017

Quentin-M commented Jan 9, 2017 • edited Loading

aaronlevy commented Jan 9, 2017

Quentin-M commented Jan 9, 2017

aaronlevy commented Jan 9, 2017

aaronlevy commented Jan 10, 2017

Quentin-M commented Jan 9, 2017 •

edited

Loading