Skip to content
This repository has been archived by the owner on Feb 1, 2021. It is now read-only.

[experimental] Simple container rescheduling on node failure #1578

Merged
merged 5 commits into from
Jan 12, 2016

Conversation

vieux
Copy link
Contributor

@vieux vieux commented Jan 4, 2016

See #1488 for details

When a node goes down, swarm will try to reschedule container on another machine.

Depends on moby/moby#19001 for IP stability.
Depends on #1569 for proper cleanup of old containers.

if label, ok := c.Labels[SwarmLabelNamespace+".reschedule-policy"]; ok {
reschedulePolicy = label
}

// parse affinities/constraints from env (ex. docker run -e affinity:container==redis -e affinity:image==nginx -e constraint:region==us-east -e constraint:storage==ssd)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: update comment.

@vieux vieux force-pushed the rescheduling branch 2 times, most recently from ea7afd3 to b72d27f Compare January 8, 2016 22:17
case "engine_disconnect":
go w.rescheduleContainers(e.Engine)
case "die", "destroy", "kill", "oom", "start", "stop", "rename":
go w.reschedulePendingContainers()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason this trigger whole cluster reschedule, instead of just this engine? I think there could be a lot of events for a big cluster. Isn't just checking this container enough?

@vieux vieux force-pushed the rescheduling branch 2 times, most recently from 9352993 to 0220f36 Compare January 8, 2016 23:02
@mavenugo
Copy link
Contributor

mavenugo commented Jan 9, 2016

@vieux we are working on the forced cleanup fix in moby/libnetwork#862.
you can use my private docker branch (https://github.com/mavenugo/docker/tree/epcleanup), where this is integrated fully with docker engine API which swarm can use to reschedule container reclaiming the locked resources. Would you like to give it a try in parallel as we make progress in getting this PRs merged in multiple repos ?

aluzzardi and others added 3 commits January 11, 2016 15:59
Add rescheduling integration tests.

Signed-off-by: Andrea Luzzardi <[email protected]>
fix tests and keep swarm id
remove duplicate on node reconnect
explicit failure

Signed-off-by: Victor Vieux <[email protected]>
Signed-off-by: Victor Vieux <[email protected]>
@vieux vieux changed the title [experimental] [WIP] Container Rescheduling on node failure [experimental] Simple container rescheduling on node failure Jan 12, 2016
@@ -0,0 +1,81 @@
# Docker Experimental Features

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vieux This looks like it shouldn't be here it is a backup file

Signed-off-by: Victor Vieux <[email protected]>
@moxiegirl
Copy link

LGTM

@dongluochen
Copy link
Contributor

LGTM.

@abronan
Copy link
Contributor

abronan commented Jan 12, 2016

LGTM

abronan added a commit that referenced this pull request Jan 12, 2016
[experimental] Simple container rescheduling on node failure
@abronan abronan merged commit e121338 into docker-archive:master Jan 12, 2016
@vieux vieux deleted the rescheduling branch January 12, 2016 23:00
@abronan abronan mentioned this pull request Jan 12, 2016
ChristianKniep pushed a commit to ChristianKniep/swarm that referenced this pull request Jul 27, 2017
[experimental] Simple container rescheduling on node failure
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants