-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Proposal: Container Rescheduling #1488
Comments
Me too chose this in the survey 👍 |
I will note that Amazon ECS (which I'm now using, because swarm did not deliver this feature in time), does rebalancing by disabling local container restarts. I also don't think having a KV store as a dependency as out of the question, as long as it is swappable |
@aluzzardi Is IP stability a hard requirement? As long as we remap the container name to a different IP, the service discovery part should handle mapping any new client requests to the new IP(and hence to the new container). Wouldn't that be enough? |
For me this is the missing piece for running swarm in production. I think that it's ok to have this feature dependant on a KV store, as docker overlay network requires it and I don't see any point of running swarm without overlay network (ok, this can be swapped but the majority of implementations rely on a kv store). |
I would add the volume driver portion in here as well. If there is a container that has external volumes attached and is requested on another host, then it should be the case that the same volumes are brought to the new host. In the case of REX-Ray (rexray/rexray#190) it now has pre-emption built into most drivers. This means that the new requesting container runtime will cause a forceful mount which detaches it from any host that currently has it. The setting is currently a global setting at a driver level for us, but it would be an interesting addition to the volume plugins to allow a flag on mount that gets used by Swarm to tell it to pre-empt or force mount in the case of re-scheduling. Typically we wouldn't want to enable pre-emption since it is a safety feature to block mounting from multiple hosts or block detaching/attching unintentionally. cc @cpuguy83 Drivers that don't have a forceful mount option or pre-emption will cause the containers that get requested on a new host to error since their volume will not be able to be dismounted. The exception here depends on the storage platform. For example, EC2 and OpenStack disallows this by default. This makes sense for safety, as we want to make people be explicit about mounting a volume to multiple hosts or doing detach/attach operations. |
@clintonskitson thanks, I'll update the proposal to include some text about volumes |
How come you want to use an extra label? Also need to account for paused containers... I have a feeling that these should not be rescheduled ever. |
I'd have a question about this topic. Currently the docker daemon handles restarting of containers. But isn't there a conflict between daemon and swarm manager when it comes to rescheduling? The (already discussed) scenario I am referring to is: There's a node-failure, swarm master would reschedule containers to healthy nodes. Now the failed node gets healed... It comes up and the daemon starts containers with So, should restarting/rescheduling handled by exclusively either docker daemon or swarm-manager? |
@schmunk42 Agreed, I commented on that on the old proposal: #599 (comment) We need to clean one or the other or the container will end up being duplicated. |
+1 |
I wanted to throw in another idea re volumes here. Volumes could also have to do with container placement. For example, if a A second would be for the volumesfrom flag. This should have similar Otherwise the volume bring requested is going to fail to start for those On Sunday, December 13, 2015, Geovani de Souza [email protected]
|
@cpuguy83 I think For instance, let's say that you start a Unless you are using a distributed volume, you definitely DO NOT want Swarm to create a brand new You might want to always restart but never re-schedule, or you might want to get both. |
As an alternative, we would be free to mess around with the restart policies such as |
@aluzzardi Is there something we can add to hint to swarm the scope of the volume (local vs global?) |
@cpuguy83 Well it's totally fine to re-schedule containers that have a local volume if the user says so |
In such a case it may be better to only support the explicit case of not rescheduling containers that do have a restart policy. Alternatively, maybe restart policies could be modified to accept conditions like |
@cpuguy83 The thing is, if the user makes a mistake in setting those flags (or simply doesn't know) we're talking about nuking a production database |
@aluzzardi In that case, I'd almost prefer to not reschedule containers with volumes (unless explicitly specified through some configuration) until we can figure out a way to make it just work with restart policies... but maybe there is no perfect world here. Also wondering if there's a plan to have some delay once a host is marked as unhealthy to do rescheduling (or maybe that's just the health check itself). |
@cpuguy83 by default containers aren't rescheduled at all, volumes or not, it's already opt-in feature. we could add a delay, what would be the usecase ? |
@vieux I was thinking in favor of using --restart... but you are right. For the delay, the intention would be to allow the supposedly down node a recovery period. |
@aluzzardi from your suggestion:
Do you mean the reschedulted containers are allowed to duplicate sometime, and finally, swarm would delete the oldest container, and keep the newest container, is it ? |
@HackToday yes, or we might add a flag to decide which one we should keep (newest or oldest) |
speeking of restarts, reschedule and eventually rebalancing seems like speeking of different functionnalities. the way i look at it after reading these few posts is that rebalancing is another world with considerations like what to do to minimize downtime and volumes access (allow duplicates or not). restarts and reschedules are the key features that looks like high availability. restarts might very well be more suitable for statefull services which would require specific volumes. In the end only the one running the full stack can say what is best. restrarts are handled at the docker level and reschedules more likely at the swarm level. although if you have different tanks linked together if you add a bucket of water in on, it will naturally spill to the other ones. in this sense docker could check with swarm if it realy is up to him to start a container and eventually leave the job to swarm to decide (certainly not mandatory) lastly, what i don't grasp is the network implications... |
I think "make sure the new container takes over the ip of the old one" is unnecessary and may be harmful. Swarm do not specify the IP for the original container. It only attaches the container to an overlay network where IP is dynamically assigned. How this IP is used is up to user. The same logic applies to the new container. Generally speaking, distributed service should use names, not IPs. Persisting IP usually happens on VM |
Few comments: In order to safely perform such "rebalancing" the failed node (and/or containers on that node) need first to be "fenced". There are several approaches we can perform such a fence:
|
@dongluochen alright, let's make this optional @ezrasilvera you are right, regarding the networking, it should be handled since the container will be disconnected by swarm from the network (we can do this even if the node is unreachable) |
One more comment - would we be able to explicitly activate the reschedule functionality for "planned evacuation" (e.g., not as a result of failure) ? This might be helpful for planed maintenance for example. |
@ezrasilvera it's planned in the UX but not in the initial PR. At first we will only support We we could definitely imagine |
What if you have one node on one location and one on the other and then network fails between them? Would node on the other location also run the container as it would not see the other running container? |
Is there anything planned in near future regarding rescheduling and multi-host network? |
Are there plans to allow an explicit "rebalance" of all eligible containers in the cluster?
A potential use case would be that we add node(s) to the cluster and want to utilize the newly available resources without having to explicitly choose which containers to go there. |
I am eagerly waiting for this feature , it is really helpful in a clustered environment either small or big . |
@viveky4d4v are you referring to rebalancing? |
@nishanttotla Yes , rebalancing automatically when the node comes back in life . Kind of Resurrection . Let's say if i have one manager & one worker .-
Ideally manager should automatically balance the containers when worker comes back in life by moving the oldest container back to the worker . I don't know how this can be done without downtime ( in case you have just one application container ) . PS - I understand it's not efficient way to use swarm, we should use at least 3 managers but i caught up in this situation so i thought to get some ideas from community. |
@viveky4d4v I want to confirm that you mean Docker Swarm standalone (this project |
@nishanttotla : Yes , you are correct . MY BAD ! I will raise the issue at "https://github.com/docker/docker/issues/new" |
I think this issue is implemented. See https://docs.docker.com/swarm/scheduler/rescheduling/ |
@piotrminkina right, I think we can close this issue. New issues can and should be opened for issues with rescheduling. |
Background
The goal of this proposal is to reschedule containers automatically in case of node failure.
This is currently one of the top requested feature for Swarm.
Configuration
The behavior should be user controllable and disabled by default since rescheduling can have nasty effects on stateful containers.
The user can select the policy at
run
time use thereschedule
environment variable:Possible values for
reschedule
are:The reason this is more complicated than
yes
/no
is in the future we might have more complicated rescheduling policies (for instance, we might want reschedule containers to re-spread or re-pack them). Open question: Is this really necessary?Rescheduling policies will be stored as a container label:
com.docker.swarm.reschedule-policy
Persistance
Ideally, Swarm would store all containers (at least those that should be rescheduled) persistently. That way, the manager can figure out which containers are down and take action.
Unfortunately, we currently don't have a shared state and this feature has been postponed because of that for a long time.
Since this is one of the top requested feature, I propose we take a different approach until we have shared state (that feature has been postponed for usability concerns - we don't want to make a kv store a dependency for Swarm).
By storing the rescheduling policy as a container label, we are able to reconstruct the desired state at startup time.
Since we are already storing constraints, affinities etc as container labels (exactly for this reason), the manager will have all the information it needs to perform rescheduling.
This means we can restart the manager as much as we want and it will resume rescheduling as expected.
However, the problem arises when a node goes down while the manager is not running: in that case, we won't "remember" that container even existed when the manager is started again.
This situation can be counter-balanced by using
replication
. Therescheduler
would be running on the primary manager and, upon failure, the replica that gets elected primary would be taking over rescheduling responsibilities.Since every manager is aware of the cluster state (containers & rescheduling policy), it means that as long as at least one manager is still running we won't forget about containers.
Failure detection
This functionality is already provided by
cluster/engine
.Engine
actively heartbeats nodes in the cluster everyX
seconds. AfterY
failures, the node is marked as unhealthy.Rescheduling can rely on the health status already available.
Resurrection
Eventually, a node may come back to life and re-join the cluster. If the node has containers that were rescheduled, we will end up with duplicates.
Swarm should monitor incoming nodes and, upon detecting a duplicate container, it should destroy the oldest one (keeping the most recently created container alive). This behavior could eventually be configurable by the user (keep oldest, keep newer, ...), although we may want to avoid providing that option until we see a valid use case.
If duplicate containers were started with a
--restart
option, there is going to be a small window during which both containers are running at the same time. This can be a serious problem if only one instance of that container is supposed to run at one point.We could force all containers that have rescheduling enabled to never automatically restart. In that case, whenever a node joins, Swarm could decide to either start the containers of destroy them if they are duplicates.
However, there are many drawbacks to this approach:
--restart=on-failure:5
which require maintaining lots of stateFurthermore, It doesn't actually entirely solve the issue. If the node didn't actually die (e.g. it just froze for a while, there was a netsplit, networking temporarily dropped, ...) we will end up with duplicate containers running for a while anyway.
Given all the potential issues that might arise by handling the restart policy on the Swarm side and the fact that duplicate containers may end up running at the same time anyway, I suggest we do not interfere with
--restart
and document the fact that rescheduled containers may be running in parallel for a short time window.Networking
When rescheduling containers, Swarm must handle multi-host networking properly.
The goal is for the new container to take over the previous one.
In an overlay network setup, this may involve:
The text was updated successfully, but these errors were encountered: