Proposal: Container Rescheduling #1488

aluzzardi · 2015-12-03T08:24:50Z

Background

The goal of this proposal is to reschedule containers automatically in case of node failure.

This is currently one of the top requested feature for Swarm.

Configuration

The behavior should be user controllable and disabled by default since rescheduling can have nasty effects on stateful containers.

The user can select the policy at run time use the reschedule environment variable:

docker run -e reschedule:on-node-failure redis

Possible values for reschedule are:

no: Never reschedule the container (default).
on-node-failure: Reschedule the container whenever the node fails.

The reason this is more complicated than yes/no is in the future we might have more complicated rescheduling policies (for instance, we might want reschedule containers to re-spread or re-pack them). Open question: Is this really necessary?

Rescheduling policies will be stored as a container label: com.docker.swarm.reschedule-policy

Persistance

Ideally, Swarm would store all containers (at least those that should be rescheduled) persistently. That way, the manager can figure out which containers are down and take action.

Unfortunately, we currently don't have a shared state and this feature has been postponed because of that for a long time.

Since this is one of the top requested feature, I propose we take a different approach until we have shared state (that feature has been postponed for usability concerns - we don't want to make a kv store a dependency for Swarm).

By storing the rescheduling policy as a container label, we are able to reconstruct the desired state at startup time.

Since we are already storing constraints, affinities etc as container labels (exactly for this reason), the manager will have all the information it needs to perform rescheduling.

This means we can restart the manager as much as we want and it will resume rescheduling as expected.

However, the problem arises when a node goes down while the manager is not running: in that case, we won't "remember" that container even existed when the manager is started again.

This situation can be counter-balanced by using replication. The rescheduler would be running on the primary manager and, upon failure, the replica that gets elected primary would be taking over rescheduling responsibilities.

Since every manager is aware of the cluster state (containers & rescheduling policy), it means that as long as at least one manager is still running we won't forget about containers.

Failure detection

This functionality is already provided by cluster/engine.

Engine actively heartbeats nodes in the cluster every X seconds. After Y failures, the node is marked as unhealthy.

Rescheduling can rely on the health status already available.

Resurrection

Eventually, a node may come back to life and re-join the cluster. If the node has containers that were rescheduled, we will end up with duplicates.

Swarm should monitor incoming nodes and, upon detecting a duplicate container, it should destroy the oldest one (keeping the most recently created container alive). This behavior could eventually be configurable by the user (keep oldest, keep newer, ...), although we may want to avoid providing that option until we see a valid use case.

If duplicate containers were started with a --restart option, there is going to be a small window during which both containers are running at the same time. This can be a serious problem if only one instance of that container is supposed to run at one point.

We could force all containers that have rescheduling enabled to never automatically restart. In that case, whenever a node joins, Swarm could decide to either start the containers of destroy them if they are duplicates.

However, there are many drawbacks to this approach:

Restart policies have to be handled by Swarm. This introduces high complexity since we'd have to re-implement things such as --restart=on-failure:5 which require maintaining lots of state
If the manager is down, containers won't start automatically. This is a serious issue since it could lead to outages. Up until now, if Swarm is down the engines continue to operate normally and this change would break that contract.
Swarm might miss events, leading to containers not getting properly restarted

Furthermore, It doesn't actually entirely solve the issue. If the node didn't actually die (e.g. it just froze for a while, there was a netsplit, networking temporarily dropped, ...) we will end up with duplicate containers running for a while anyway.

Given all the potential issues that might arise by handling the restart policy on the Swarm side and the fact that duplicate containers may end up running at the same time anyway, I suggest we do not interfere with --restart and document the fact that rescheduled containers may be running in parallel for a short time window.

Networking

When rescheduling containers, Swarm must handle multi-host networking properly.

The goal is for the new container to take over the previous one.

In an overlay network setup, this may involve:

Making sure the new container takes over the IP address of the old container
Ensuring service discovery works properly
Cutting the old container off the network before starting the new one. Even though we presume the node to be down, it might still be up and running. Disconnecting the old container would alleviate side effects of duplicate containers.

The text was updated successfully, but these errors were encountered:

aluzzardi · 2015-12-03T08:25:34Z

/cc @docker/swarm-maintainers @mavenugo @mrjana @jpetazzo @dnephin

chanwit · 2015-12-03T08:27:00Z

Me too chose this in the survey 👍

cultureulterior · 2015-12-03T16:45:37Z

I will note that Amazon ECS (which I'm now using, because swarm did not deliver this feature in time), does rebalancing by disabling local container restarts. I also don't think having a KV store as a dependency as out of the question, as long as it is swappable

mrjana · 2015-12-03T20:36:31Z

@aluzzardi Is IP stability a hard requirement? As long as we remap the container name to a different IP, the service discovery part should handle mapping any new client requests to the new IP(and hence to the new container). Wouldn't that be enough?

calind · 2015-12-04T19:32:22Z

For me this is the missing piece for running swarm in production.

I think that it's ok to have this feature dependant on a KV store, as docker overlay network requires it and I don't see any point of running swarm without overlay network (ok, this can be swapped but the majority of implementations rely on a kv store).

clintkitson · 2015-12-10T03:50:58Z

I would add the volume driver portion in here as well. If there is a container that has external volumes attached and is requested on another host, then it should be the case that the same volumes are brought to the new host.

In the case of REX-Ray (rexray/rexray#190) it now has pre-emption built into most drivers. This means that the new requesting container runtime will cause a forceful mount which detaches it from any host that currently has it. The setting is currently a global setting at a driver level for us, but it would be an interesting addition to the volume plugins to allow a flag on mount that gets used by Swarm to tell it to pre-empt or force mount in the case of re-scheduling. Typically we wouldn't want to enable pre-emption since it is a safety feature to block mounting from multiple hosts or block detaching/attching unintentionally. cc @cpuguy83

Drivers that don't have a forceful mount option or pre-emption will cause the containers that get requested on a new host to error since their volume will not be able to be dismounted. The exception here depends on the storage platform. For example, EC2 and OpenStack disallows this by default. This makes sense for safety, as we want to make people be explicit about mounting a volume to multiple hosts or doing detach/attach operations.

vieux · 2015-12-11T23:02:55Z

@clintonskitson thanks, I'll update the proposal to include some text about volumes

cpuguy83 · 2015-12-12T13:36:56Z

How come you want to use an extra label?
Using the restart policies seems like it would be good enough for rescheduling.
restart=always -- always reschedule
restart=on-failure -- reschedule if the failure count is ok, node failure should not affect the failure count
restart=unless-stopped -- reschedule if the last state was not stopped

Also need to account for paused containers... I have a feeling that these should not be rescheduled ever.

schmunk42 · 2015-12-12T15:26:30Z

I'd have a question about this topic.

Currently the docker daemon handles restarting of containers. But isn't there a conflict between daemon and swarm manager when it comes to rescheduling?

The (already discussed) scenario I am referring to is: There's a node-failure, swarm master would reschedule containers to healthy nodes. Now the failed node gets healed...

It comes up and the daemon starts containers with restart policy, but they will be duplicated, since swarm manager already rescheduled them.

So, should restarting/rescheduling handled by exclusively either docker daemon or swarm-manager?

abronan · 2015-12-12T18:58:45Z

@schmunk42 Agreed, I commented on that on the old proposal: #599 (comment)

We need to clean one or the other or the container will end up being duplicated.

geovanisouza92 · 2015-12-13T12:54:14Z

+1

clintkitson · 2015-12-13T15:37:18Z

I wanted to throw in another idea re volumes here.

Volumes could also have to do with container placement. For example, if a
volume is specified with multiple containers than the assumption is that
you would want to share data between those containers. Swarm would then
place them on the same host.

A second would be for the volumesfrom flag. This should have similar
functionality.

Otherwise the volume bring requested is going to fail to start for those
volume drivers that cannot share volumes between hosts.

On Sunday, December 13, 2015, Geovani de Souza [email protected]
wrote:

+1

—
Reply to this email directly or view it on GitHub
#1488 (comment).

aluzzardi · 2015-12-14T04:15:21Z

@cpuguy83 I think --restart and re-scheduling are incompatible.

For instance, let's say that you start a mysql with --restart=always.

Unless you are using a distributed volume, you definitely DO NOT want Swarm to create a brand new mysql somewhere else with no data. And you definitely do not want for Swarm to destroy that container when the machine finally comes back.

You might want to always restart but never re-schedule, or you might want to get both.

aluzzardi · 2015-12-14T04:16:25Z

As an alternative, we would be free to mess around with the restart policies such as --restart=reschedule:on-node-failure.

cpuguy83 · 2015-12-14T15:14:21Z

@aluzzardi Is there something we can add to hint to swarm the scope of the volume (local vs global?)

aluzzardi · 2015-12-14T23:22:15Z

@cpuguy83 Well it's totally fine to re-schedule containers that have a local volume if the user says so

cpuguy83 · 2015-12-15T02:55:32Z

In such a case it may be better to only support the explicit case of not rescheduling containers that do have a restart policy.
So, --restart=always -e reschedule=no
This way we can use the restart policy for normal use-cases.

Alternatively, maybe restart policies could be modified to accept conditions like --restart=always,!node-failure.
Engine would continue to care about only the values it currently does (always, on-failure, unless-stopped), and other things in the stack can add their own w/o affecting the actual runtime.

aluzzardi · 2015-12-15T06:07:30Z

@cpuguy83 The thing is, if the user makes a mistake in setting those flags (or simply doesn't know) we're talking about nuking a production database

cpuguy83 · 2015-12-18T16:10:26Z

@aluzzardi In that case, I'd almost prefer to not reschedule containers with volumes (unless explicitly specified through some configuration) until we can figure out a way to make it just work with restart policies... but maybe there is no perfect world here.

Also wondering if there's a plan to have some delay once a host is marked as unhealthy to do rescheduling (or maybe that's just the health check itself).

vieux · 2015-12-19T00:29:43Z

@cpuguy83 by default containers aren't rescheduled at all, volumes or not, it's already opt-in feature.

we could add a delay, what would be the usecase ?

cpuguy83 · 2015-12-19T03:00:55Z

@vieux I was thinking in favor of using --restart... but you are right.

For the delay, the intention would be to allow the supposedly down node a recovery period.

HackToday · 2015-12-23T07:37:15Z

@aluzzardi from your suggestion:

Given all the potential issues that might arise by handling the restart policy on the Swarm >side and the fact that duplicate containers may end up running at the same time anyway, >I suggest we do not interfere with --restart and document the fact that rescheduled >containers may be running in parallel for a short time window.

Do you mean the reschedulted containers are allowed to duplicate sometime, and finally, swarm would delete the oldest container, and keep the newest container, is it ?

vieux · 2015-12-23T19:23:33Z

@HackToday yes, or we might add a flag to decide which one we should keep (newest or oldest)

cblomart · 2015-12-30T22:16:32Z

speeking of restarts, reschedule and eventually rebalancing seems like speeking of different functionnalities.

the way i look at it after reading these few posts is that rebalancing is another world with considerations like what to do to minimize downtime and volumes access (allow duplicates or not).

restarts and reschedules are the key features that looks like high availability.

restarts might very well be more suitable for statefull services which would require specific volumes.
And it looks to me like reschedules might be more suitable for stateless services.

In the end only the one running the full stack can say what is best.

restrarts are handled at the docker level and reschedules more likely at the swarm level. although if you have different tanks linked together if you add a bucket of water in on, it will naturally spill to the other ones. in this sense docker could check with swarm if it realy is up to him to start a container and eventually leave the job to swarm to decide (certainly not mandatory)

lastly, what i don't grasp is the network implications...
"make sure the new container takes over the ip of the old one"
This certainly looks like a blocking point to start duplicates... idoioticaly the incapability to get back the same ip twice might block the restart of an already rescheduled container.

dongluochen · 2015-12-30T23:13:24Z

I think "make sure the new container takes over the ip of the old one" is unnecessary and may be harmful. Swarm do not specify the IP for the original container. It only attaches the container to an overlay network where IP is dynamically assigned. How this IP is used is up to user. The same logic applies to the new container. Generally speaking, distributed service should use names, not IPs.

Persisting IP usually happens on VM live migration where traffic shouldn't be interrupted. It needs accurate coordination where you have control on both old and new VMs. It's not the case for failure rescheduling.

ezrasilvera · 2016-01-05T19:36:23Z

Few comments:
We should remember that when the master decides that a node "failed" it doesn't necessarily mean that the node is fully disconnected from the rest of the world!! There might be some partial network failures which might lead (after the rebalancing) to two identical containers running together, this may result in various conflicts and errors. For example, the "failed" node/containers may access the storage storage, causing data corruption, or create an IP address or DNS conflicts. This may also happen due to transient network failures.

In order to safely perform such "rebalancing" the failed node (and/or containers on that node) need first to be "fenced". There are several approaches we can perform such a fence:

Power fence - i.e., shutdown the node
Resource fencing - isolate the node/containers from the storage and network. This is done externally to the node (e.g., fence a node at the Switch it is connected to)
Self fencing - the node itself detects and isolate/shutdown (this approach has many risks and probably doesn't fit the current Docker/Swarm environment)

vieux · 2016-01-05T19:46:50Z

@dongluochen alright, let's make this optional

@ezrasilvera you are right, regarding the networking, it should be handled since the container will be disconnected by swarm from the network (we can do this even if the node is unreachable)

ezrasilvera · 2016-01-05T20:05:02Z

One more comment - would we be able to explicitly activate the reschedule functionality for "planned evacuation" (e.g., not as a result of failure) ? This might be helpful for planed maintenance for example.

vieux · 2016-01-05T21:11:35Z

@ezrasilvera it's planned in the UX but not in the initial PR.

At first we will only support on-node-failure as policy.

We we could definitely imagine on-maintenance if you combine this with the new node management @dongluochen wrote in #1569 we could extend it.

devdems · 2016-02-19T19:44:19Z

What if you have one node on one location and one on the other and then network fails between them? Would node on the other location also run the container as it would not see the other running container?

michaelzangerle · 2016-03-23T13:26:40Z

When rescheduling containers, Swarm must handle multi-host networking properly.

Is there anything planned in near future regarding rescheduling and multi-host network?

beverts312 · 2016-06-06T14:54:52Z

Are there plans to allow an explicit "rebalance" of all eligible containers in the cluster?

A container could be considered eligible for a rebalance it they had the reschedule policy on-node-failure or on-explicit (or something like that).
This could be invoked via the swarm API

A potential use case would be that we add node(s) to the cluster and want to utilize the newly available resources without having to explicitly choose which containers to go there.

viveky4d4v · 2017-01-31T12:48:43Z

I am eagerly waiting for this feature , it is really helpful in a clustered environment either small or big .
When can we expect this in docker swarm ?

nishanttotla · 2017-01-31T16:59:35Z

@viveky4d4v are you referring to rebalancing?

viveky4d4v · 2017-02-01T13:54:04Z

@nishanttotla Yes , rebalancing automatically when the node comes back in life . Kind of Resurrection .

Let's say if i have one manager & one worker .-

If worker dies , manager will schedule all containers on itself .
Worker comes back in life & manager doesn't rebalance containers & all containers sit on manager .
Now if manager dies , our stack goes down even if worker is up & running .

Ideally manager should automatically balance the containers when worker comes back in life by moving the oldest container back to the worker . I don't know how this can be done without downtime ( in case you have just one application container ) .
OR
if the scheduler makes some decision like in case of a single application container it should not move it back to worker to remove downtime but it should do it if we have multiple replicas.

PS - I understand it's not efficient way to use swarm, we should use at least 3 managers but i caught up in this situation so i thought to get some ideas from community.

nishanttotla · 2017-02-01T18:21:44Z

@viveky4d4v I want to confirm that you mean Docker Swarm standalone (this project docker/swarm) and not the new Swarm mode released in Docker 1.12. The manager/worker terminology, and running container replicas (services) is a feature of Swarm mode, not this project.

viveky4d4v · 2017-02-02T06:58:36Z

@nishanttotla : Yes , you are correct . MY BAD !

I will raise the issue at "https://github.com/docker/docker/issues/new"

piotrminkina · 2017-10-30T06:00:07Z

I think this issue is implemented. See https://docs.docker.com/swarm/scheduler/rescheduling/

nishanttotla · 2017-10-30T23:43:33Z

@piotrminkina right, I think we can close this issue. New issues can and should be opened for issues with rescheduling.

aluzzardi added area/scheduler kind/proposal labels Dec 3, 2015

This was referenced Dec 3, 2015

monitor or keep the suppose amount of desired containers running docker/compose#2435

Closed

Container Rebalancing #599

Closed

abronan mentioned this issue Dec 3, 2015

There should be documentation on how to remove a node from the swarm #1341

Closed

vieux self-assigned this Dec 9, 2015

aluzzardi added this to the 1.1.0 milestone Dec 15, 2015

vieux mentioned this issue Jan 4, 2016

[experimental] Simple container rescheduling on node failure #1578

Merged

vieux modified the milestones: 1.1.0, 1.2.0 Feb 10, 2016

jpetazzo mentioned this issue Feb 12, 2016

Can't access rescheduled container using its name after failed node recovers #1810

Closed

amitshukla modified the milestones: 1.3.0, 1.2.0 Mar 23, 2016

HackToday mentioned this issue Apr 11, 2016

rescheduling containers when swarm manager dies #2092

Closed

DerekTBrown mentioned this issue Dec 31, 2016

[FEATURE] Docker Host Scaling learnlinux/tuxlab-infra#19

Closed

2 tasks

nishanttotla closed this as completed Oct 30, 2017

Proposal: Container Rescheduling #1488

Proposal: Container Rescheduling #1488

Comments

aluzzardi commented Dec 3, 2015

Background

Configuration

Persistance

Failure detection

Resurrection

Networking

aluzzardi commented Dec 3, 2015

chanwit commented Dec 3, 2015

cultureulterior commented Dec 3, 2015

mrjana commented Dec 3, 2015

calind commented Dec 4, 2015

clintkitson commented Dec 10, 2015

vieux commented Dec 11, 2015

cpuguy83 commented Dec 12, 2015

schmunk42 commented Dec 12, 2015

abronan commented Dec 12, 2015

geovanisouza92 commented Dec 13, 2015

clintkitson commented Dec 13, 2015

aluzzardi commented Dec 14, 2015

aluzzardi commented Dec 14, 2015

cpuguy83 commented Dec 14, 2015

aluzzardi commented Dec 14, 2015

cpuguy83 commented Dec 15, 2015

aluzzardi commented Dec 15, 2015

cpuguy83 commented Dec 18, 2015

vieux commented Dec 19, 2015

cpuguy83 commented Dec 19, 2015

HackToday commented Dec 23, 2015

vieux commented Dec 23, 2015

cblomart commented Dec 30, 2015

dongluochen commented Dec 30, 2015

ezrasilvera commented Jan 5, 2016

vieux commented Jan 5, 2016

ezrasilvera commented Jan 5, 2016

vieux commented Jan 5, 2016

devdems commented Feb 19, 2016

michaelzangerle commented Mar 23, 2016

beverts312 commented Jun 6, 2016

viveky4d4v commented Jan 31, 2017

nishanttotla commented Jan 31, 2017

viveky4d4v commented Feb 1, 2017

nishanttotla commented Feb 1, 2017

viveky4d4v commented Feb 2, 2017

piotrminkina commented Oct 30, 2017

nishanttotla commented Oct 30, 2017