Skip to content
This repository has been archived by the owner on Feb 1, 2021. It is now read-only.

Proposal: Container Rescheduling #1488

Closed
aluzzardi opened this issue Dec 3, 2015 · 39 comments
Closed

Proposal: Container Rescheduling #1488

aluzzardi opened this issue Dec 3, 2015 · 39 comments

Comments

@aluzzardi
Copy link
Contributor

Background

The goal of this proposal is to reschedule containers automatically in case of node failure.

This is currently one of the top requested feature for Swarm.

Configuration

The behavior should be user controllable and disabled by default since rescheduling can have nasty effects on stateful containers.

The user can select the policy at run time use the reschedule environment variable:

docker run -e reschedule:on-node-failure redis

Possible values for reschedule are:

  • no: Never reschedule the container (default).
  • on-node-failure: Reschedule the container whenever the node fails.

The reason this is more complicated than yes/no is in the future we might have more complicated rescheduling policies (for instance, we might want reschedule containers to re-spread or re-pack them). Open question: Is this really necessary?

Rescheduling policies will be stored as a container label: com.docker.swarm.reschedule-policy

Persistance

Ideally, Swarm would store all containers (at least those that should be rescheduled) persistently. That way, the manager can figure out which containers are down and take action.

Unfortunately, we currently don't have a shared state and this feature has been postponed because of that for a long time.

Since this is one of the top requested feature, I propose we take a different approach until we have shared state (that feature has been postponed for usability concerns - we don't want to make a kv store a dependency for Swarm).

By storing the rescheduling policy as a container label, we are able to reconstruct the desired state at startup time.

Since we are already storing constraints, affinities etc as container labels (exactly for this reason), the manager will have all the information it needs to perform rescheduling.

This means we can restart the manager as much as we want and it will resume rescheduling as expected.

However, the problem arises when a node goes down while the manager is not running: in that case, we won't "remember" that container even existed when the manager is started again.

This situation can be counter-balanced by using replication. The rescheduler would be running on the primary manager and, upon failure, the replica that gets elected primary would be taking over rescheduling responsibilities.

Since every manager is aware of the cluster state (containers & rescheduling policy), it means that as long as at least one manager is still running we won't forget about containers.

Failure detection

This functionality is already provided by cluster/engine.

Engine actively heartbeats nodes in the cluster every X seconds. After Y failures, the node is marked as unhealthy.

Rescheduling can rely on the health status already available.

Resurrection

Eventually, a node may come back to life and re-join the cluster. If the node has containers that were rescheduled, we will end up with duplicates.

Swarm should monitor incoming nodes and, upon detecting a duplicate container, it should destroy the oldest one (keeping the most recently created container alive). This behavior could eventually be configurable by the user (keep oldest, keep newer, ...), although we may want to avoid providing that option until we see a valid use case.

If duplicate containers were started with a --restart option, there is going to be a small window during which both containers are running at the same time. This can be a serious problem if only one instance of that container is supposed to run at one point.

We could force all containers that have rescheduling enabled to never automatically restart. In that case, whenever a node joins, Swarm could decide to either start the containers of destroy them if they are duplicates.

However, there are many drawbacks to this approach:

  • Restart policies have to be handled by Swarm. This introduces high complexity since we'd have to re-implement things such as --restart=on-failure:5 which require maintaining lots of state
  • If the manager is down, containers won't start automatically. This is a serious issue since it could lead to outages. Up until now, if Swarm is down the engines continue to operate normally and this change would break that contract.
  • Swarm might miss events, leading to containers not getting properly restarted

Furthermore, It doesn't actually entirely solve the issue. If the node didn't actually die (e.g. it just froze for a while, there was a netsplit, networking temporarily dropped, ...) we will end up with duplicate containers running for a while anyway.

Given all the potential issues that might arise by handling the restart policy on the Swarm side and the fact that duplicate containers may end up running at the same time anyway, I suggest we do not interfere with --restart and document the fact that rescheduled containers may be running in parallel for a short time window.

Networking

When rescheduling containers, Swarm must handle multi-host networking properly.

The goal is for the new container to take over the previous one.

In an overlay network setup, this may involve:

  • Making sure the new container takes over the IP address of the old container
  • Ensuring service discovery works properly
  • Cutting the old container off the network before starting the new one. Even though we presume the node to be down, it might still be up and running. Disconnecting the old container would alleviate side effects of duplicate containers.
@aluzzardi
Copy link
Contributor Author

/cc @docker/swarm-maintainers @mavenugo @mrjana @jpetazzo @dnephin

@chanwit
Copy link
Contributor

chanwit commented Dec 3, 2015

Me too chose this in the survey 👍

@cultureulterior
Copy link

I will note that Amazon ECS (which I'm now using, because swarm did not deliver this feature in time), does rebalancing by disabling local container restarts. I also don't think having a KV store as a dependency as out of the question, as long as it is swappable

@mrjana
Copy link

mrjana commented Dec 3, 2015

@aluzzardi Is IP stability a hard requirement? As long as we remap the container name to a different IP, the service discovery part should handle mapping any new client requests to the new IP(and hence to the new container). Wouldn't that be enough?

@calind
Copy link

calind commented Dec 4, 2015

For me this is the missing piece for running swarm in production.

I think that it's ok to have this feature dependant on a KV store, as docker overlay network requires it and I don't see any point of running swarm without overlay network (ok, this can be swapped but the majority of implementations rely on a kv store).

@vieux vieux self-assigned this Dec 9, 2015
@clintkitson
Copy link

I would add the volume driver portion in here as well. If there is a container that has external volumes attached and is requested on another host, then it should be the case that the same volumes are brought to the new host.

In the case of REX-Ray (rexray/rexray#190) it now has pre-emption built into most drivers. This means that the new requesting container runtime will cause a forceful mount which detaches it from any host that currently has it. The setting is currently a global setting at a driver level for us, but it would be an interesting addition to the volume plugins to allow a flag on mount that gets used by Swarm to tell it to pre-empt or force mount in the case of re-scheduling. Typically we wouldn't want to enable pre-emption since it is a safety feature to block mounting from multiple hosts or block detaching/attching unintentionally. cc @cpuguy83

Drivers that don't have a forceful mount option or pre-emption will cause the containers that get requested on a new host to error since their volume will not be able to be dismounted. The exception here depends on the storage platform. For example, EC2 and OpenStack disallows this by default. This makes sense for safety, as we want to make people be explicit about mounting a volume to multiple hosts or doing detach/attach operations.

@vieux
Copy link
Contributor

vieux commented Dec 11, 2015

@clintonskitson thanks, I'll update the proposal to include some text about volumes

@cpuguy83
Copy link
Contributor

How come you want to use an extra label?
Using the restart policies seems like it would be good enough for rescheduling.
restart=always -- always reschedule
restart=on-failure -- reschedule if the failure count is ok, node failure should not affect the failure count
restart=unless-stopped -- reschedule if the last state was not stopped

Also need to account for paused containers... I have a feeling that these should not be rescheduled ever.

@schmunk42
Copy link

I'd have a question about this topic.

Currently the docker daemon handles restarting of containers. But isn't there a conflict between daemon and swarm manager when it comes to rescheduling?

The (already discussed) scenario I am referring to is: There's a node-failure, swarm master would reschedule containers to healthy nodes. Now the failed node gets healed...

It comes up and the daemon starts containers with restart policy, but they will be duplicated, since swarm manager already rescheduled them.

So, should restarting/rescheduling handled by exclusively either docker daemon or swarm-manager?

@abronan
Copy link
Contributor

abronan commented Dec 12, 2015

@schmunk42 Agreed, I commented on that on the old proposal: #599 (comment)

We need to clean one or the other or the container will end up being duplicated.

@geovanisouza92
Copy link

+1

@clintkitson
Copy link

I wanted to throw in another idea re volumes here.

Volumes could also have to do with container placement. For example, if a
volume is specified with multiple containers than the assumption is that
you would want to share data between those containers. Swarm would then
place them on the same host.

A second would be for the volumesfrom flag. This should have similar
functionality.

Otherwise the volume bring requested is going to fail to start for those
volume drivers that cannot share volumes between hosts.

On Sunday, December 13, 2015, Geovani de Souza [email protected]
wrote:

+1


Reply to this email directly or view it on GitHub
#1488 (comment).

@aluzzardi
Copy link
Contributor Author

@cpuguy83 I think --restart and re-scheduling are incompatible.

For instance, let's say that you start a mysql with --restart=always.

Unless you are using a distributed volume, you definitely DO NOT want Swarm to create a brand new mysql somewhere else with no data. And you definitely do not want for Swarm to destroy that container when the machine finally comes back.

You might want to always restart but never re-schedule, or you might want to get both.

@aluzzardi
Copy link
Contributor Author

As an alternative, we would be free to mess around with the restart policies such as --restart=reschedule:on-node-failure.

@cpuguy83
Copy link
Contributor

@aluzzardi Is there something we can add to hint to swarm the scope of the volume (local vs global?)

@aluzzardi
Copy link
Contributor Author

@cpuguy83 Well it's totally fine to re-schedule containers that have a local volume if the user says so

@cpuguy83
Copy link
Contributor

In such a case it may be better to only support the explicit case of not rescheduling containers that do have a restart policy.
So, --restart=always -e reschedule=no
This way we can use the restart policy for normal use-cases.

Alternatively, maybe restart policies could be modified to accept conditions like --restart=always,!node-failure.
Engine would continue to care about only the values it currently does (always, on-failure, unless-stopped), and other things in the stack can add their own w/o affecting the actual runtime.

@aluzzardi
Copy link
Contributor Author

@cpuguy83 The thing is, if the user makes a mistake in setting those flags (or simply doesn't know) we're talking about nuking a production database

@aluzzardi aluzzardi added this to the 1.1.0 milestone Dec 15, 2015
@cpuguy83
Copy link
Contributor

@aluzzardi In that case, I'd almost prefer to not reschedule containers with volumes (unless explicitly specified through some configuration) until we can figure out a way to make it just work with restart policies... but maybe there is no perfect world here.

Also wondering if there's a plan to have some delay once a host is marked as unhealthy to do rescheduling (or maybe that's just the health check itself).

@vieux
Copy link
Contributor

vieux commented Dec 19, 2015

@cpuguy83 by default containers aren't rescheduled at all, volumes or not, it's already opt-in feature.

we could add a delay, what would be the usecase ?

@cpuguy83
Copy link
Contributor

@vieux I was thinking in favor of using --restart... but you are right.

For the delay, the intention would be to allow the supposedly down node a recovery period.

@HackToday
Copy link
Contributor

@aluzzardi from your suggestion:

Given all the potential issues that might arise by handling the restart policy on the Swarm >side and the fact that duplicate containers may end up running at the same time anyway, >I suggest we do not interfere with --restart and document the fact that rescheduled >containers may be running in parallel for a short time window.

Do you mean the reschedulted containers are allowed to duplicate sometime, and finally, swarm would delete the oldest container, and keep the newest container, is it ?

@vieux
Copy link
Contributor

vieux commented Dec 23, 2015

@HackToday yes, or we might add a flag to decide which one we should keep (newest or oldest)

@cblomart
Copy link

speeking of restarts, reschedule and eventually rebalancing seems like speeking of different functionnalities.

the way i look at it after reading these few posts is that rebalancing is another world with considerations like what to do to minimize downtime and volumes access (allow duplicates or not).

restarts and reschedules are the key features that looks like high availability.

restarts might very well be more suitable for statefull services which would require specific volumes.
And it looks to me like reschedules might be more suitable for stateless services.

In the end only the one running the full stack can say what is best.

restrarts are handled at the docker level and reschedules more likely at the swarm level. although if you have different tanks linked together if you add a bucket of water in on, it will naturally spill to the other ones. in this sense docker could check with swarm if it realy is up to him to start a container and eventually leave the job to swarm to decide (certainly not mandatory)

lastly, what i don't grasp is the network implications...
"make sure the new container takes over the ip of the old one"
This certainly looks like a blocking point to start duplicates... idoioticaly the incapability to get back the same ip twice might block the restart of an already rescheduled container.

@dongluochen
Copy link
Contributor

I think "make sure the new container takes over the ip of the old one" is unnecessary and may be harmful. Swarm do not specify the IP for the original container. It only attaches the container to an overlay network where IP is dynamically assigned. How this IP is used is up to user. The same logic applies to the new container. Generally speaking, distributed service should use names, not IPs.

Persisting IP usually happens on VM live migration where traffic shouldn't be interrupted. It needs accurate coordination where you have control on both old and new VMs. It's not the case for failure rescheduling.

@ezrasilvera
Copy link
Contributor

Few comments:
We should remember that when the master decides that a node "failed" it doesn't necessarily mean that the node is fully disconnected from the rest of the world!! There might be some partial network failures which might lead (after the rebalancing) to two identical containers running together, this may result in various conflicts and errors. For example, the "failed" node/containers may access the storage storage, causing data corruption, or create an IP address or DNS conflicts. This may also happen due to transient network failures.

In order to safely perform such "rebalancing" the failed node (and/or containers on that node) need first to be "fenced". There are several approaches we can perform such a fence:

  1. Power fence - i.e., shutdown the node
  2. Resource fencing - isolate the node/containers from the storage and network. This is done externally to the node (e.g., fence a node at the Switch it is connected to)
  3. Self fencing - the node itself detects and isolate/shutdown (this approach has many risks and probably doesn't fit the current Docker/Swarm environment)

@vieux
Copy link
Contributor

vieux commented Jan 5, 2016

@dongluochen alright, let's make this optional

@ezrasilvera you are right, regarding the networking, it should be handled since the container will be disconnected by swarm from the network (we can do this even if the node is unreachable)

@ezrasilvera
Copy link
Contributor

One more comment - would we be able to explicitly activate the reschedule functionality for "planned evacuation" (e.g., not as a result of failure) ? This might be helpful for planed maintenance for example.

@vieux
Copy link
Contributor

vieux commented Jan 5, 2016

@ezrasilvera it's planned in the UX but not in the initial PR.

At first we will only support on-node-failure as policy.

We we could definitely imagine on-maintenance if you combine this with the new node management @dongluochen wrote in #1569 we could extend it.

@devdems
Copy link

devdems commented Feb 19, 2016

What if you have one node on one location and one on the other and then network fails between them? Would node on the other location also run the container as it would not see the other running container?

@michaelzangerle
Copy link

When rescheduling containers, Swarm must handle multi-host networking properly.

Is there anything planned in near future regarding rescheduling and multi-host network?

@beverts312
Copy link

Are there plans to allow an explicit "rebalance" of all eligible containers in the cluster?

  • A container could be considered eligible for a rebalance it they had the reschedule policy on-node-failure or on-explicit (or something like that).
  • This could be invoked via the swarm API

A potential use case would be that we add node(s) to the cluster and want to utilize the newly available resources without having to explicitly choose which containers to go there.

@viveky4d4v
Copy link

I am eagerly waiting for this feature , it is really helpful in a clustered environment either small or big .
When can we expect this in docker swarm ?

@nishanttotla
Copy link
Contributor

@viveky4d4v are you referring to rebalancing?

@viveky4d4v
Copy link

@nishanttotla Yes , rebalancing automatically when the node comes back in life . Kind of Resurrection .

Let's say if i have one manager & one worker .-

  • If worker dies , manager will schedule all containers on itself .
  • Worker comes back in life & manager doesn't rebalance containers & all containers sit on manager .
  • Now if manager dies , our stack goes down even if worker is up & running .

Ideally manager should automatically balance the containers when worker comes back in life by moving the oldest container back to the worker . I don't know how this can be done without downtime ( in case you have just one application container ) .
OR
if the scheduler makes some decision like in case of a single application container it should not move it back to worker to remove downtime but it should do it if we have multiple replicas.

PS - I understand it's not efficient way to use swarm, we should use at least 3 managers but i caught up in this situation so i thought to get some ideas from community.

@nishanttotla
Copy link
Contributor

@viveky4d4v I want to confirm that you mean Docker Swarm standalone (this project docker/swarm) and not the new Swarm mode released in Docker 1.12. The manager/worker terminology, and running container replicas (services) is a feature of Swarm mode, not this project.

@viveky4d4v
Copy link

@nishanttotla : Yes , you are correct . MY BAD !

I will raise the issue at "https://github.com/docker/docker/issues/new"

@piotrminkina
Copy link

I think this issue is implemented. See https://docs.docker.com/swarm/scheduler/rescheduling/

@nishanttotla
Copy link
Contributor

@piotrminkina right, I think we can close this issue. New issues can and should be opened for issues with rescheduling.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests