[experimental] Simple container rescheduling on node failure #1578

vieux · 2016-01-04T22:08:41Z

See #1488 for details

When a node goes down, swarm will try to reschedule container on another machine.

Depends on moby/moby#19001 for IP stability.
Depends on #1569 for proper cleanup of old containers.

dongluochen · 2016-01-07T01:30:07Z

cluster/config.go

+	if label, ok := c.Labels[SwarmLabelNamespace+".reschedule-policy"]; ok {
+		reschedulePolicy = label
+	}
+
 	// parse affinities/constraints from env (ex. docker run -e affinity:container==redis -e affinity:image==nginx -e constraint:region==us-east -e constraint:storage==ssd)


nit: update comment.

dongluochen · 2016-01-08T22:23:15Z

cluster/watchdog.go

+	case "engine_disconnect":
+		go w.rescheduleContainers(e.Engine)
+	case "die", "destroy", "kill", "oom", "start", "stop", "rename":
+		go w.reschedulePendingContainers()


Any reason this trigger whole cluster reschedule, instead of just this engine? I think there could be a lot of events for a big cluster. Isn't just checking this container enough?

mavenugo · 2016-01-09T21:45:36Z

@vieux we are working on the forced cleanup fix in moby/libnetwork#862.
you can use my private docker branch (https://github.com/mavenugo/docker/tree/epcleanup), where this is integrated fully with docker engine API which swarm can use to reschedule container reclaiming the locked resources. Would you like to give it a try in parallel as we make progress in getting this PRs merged in multiple repos ?

Signed-off-by: Andrea Luzzardi <[email protected]>

Add rescheduling integration tests. Signed-off-by: Andrea Luzzardi <[email protected]>

fix tests and keep swarm id remove duplicate on node reconnect explicit failure Signed-off-by: Victor Vieux <[email protected]>

Signed-off-by: Victor Vieux <[email protected]>

moxiegirl · 2016-01-12T02:12:10Z

experimental/README.md~

@@ -0,0 +1,81 @@
+# Docker Experimental Features


@vieux This looks like it shouldn't be here it is a backup file

Signed-off-by: Victor Vieux <[email protected]>

moxiegirl · 2016-01-12T13:57:09Z

LGTM

dongluochen · 2016-01-12T22:33:52Z

LGTM.

abronan · 2016-01-12T23:00:19Z

LGTM

[experimental] Simple container rescheduling on node failure

vieux added the area/API label Jan 4, 2016

vieux self-assigned this Jan 4, 2016

vieux added this to the 1.1.0 milestone Jan 4, 2016

GordonTheTurtle added the status/0-triage label Jan 4, 2016

vieux force-pushed the rescheduling branch from 4f796f6 to 2dffe17 Compare January 4, 2016 23:15

vieux added status/1-design-review priority/P2 and removed status/0-triage labels Jan 4, 2016

dongluochen reviewed Jan 7, 2016
View reviewed changes

vieux force-pushed the rescheduling branch 2 times, most recently from ea7afd3 to b72d27f Compare January 8, 2016 22:17

dongluochen reviewed Jan 8, 2016
View reviewed changes

vieux force-pushed the rescheduling branch 2 times, most recently from 9352993 to 0220f36 Compare January 8, 2016 23:02

mavenugo mentioned this pull request Jan 9, 2016

Force endpoint delete moby/libnetwork#862

Merged

mavenugo mentioned this pull request Jan 9, 2016

an option to disconnect an endpoint from a network forcefully docker/engine-api#29

Merged

vieux force-pushed the rescheduling branch from 0220f36 to c316586 Compare January 11, 2016 19:44

aluzzardi and others added 3 commits January 11, 2016 15:59

cluster: Support multiple event handlers.

56941d0

Signed-off-by: Andrea Luzzardi <[email protected]>

Add support for container rescheduling on node failure.

13f6021

Add rescheduling integration tests. Signed-off-by: Andrea Luzzardi <[email protected]>

add doc

78008f4

fix tests and keep swarm id remove duplicate on node reconnect explicit failure Signed-off-by: Victor Vieux <[email protected]>

vieux force-pushed the rescheduling branch from c316586 to 78008f4 Compare January 11, 2016 23:59

improve eventHandlers locking

a2018c1

Signed-off-by: Victor Vieux <[email protected]>

vieux changed the title ~~[experimental] [WIP] Container Rescheduling on node failure~~ [experimental] Simple container rescheduling on node failure Jan 12, 2016

moxiegirl reviewed Jan 12, 2016
View reviewed changes

experimental/README.md~

@@ -0,0 +1,81 @@

# Docker Experimental Features

Copy link

moxiegirl Jan 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vieux This looks like it shouldn't be here it is a backup file

move doc to experimental/

74dfe8b

Signed-off-by: Victor Vieux <[email protected]>

vieux force-pushed the rescheduling branch from 2c735bb to 74dfe8b Compare January 12, 2016 02:16

abronan added a commit that referenced this pull request Jan 12, 2016

Merge pull request #1578 from aluzzardi/rescheduling

e121338

[experimental] Simple container rescheduling on node failure

abronan merged commit e121338 into docker-archive:master Jan 12, 2016

vieux deleted the rescheduling branch January 12, 2016 23:00

abronan mentioned this pull request Jan 12, 2016

Proposal Node failover #755

Closed

ChristianKniep pushed a commit to ChristianKniep/swarm that referenced this pull request Jul 27, 2017

Merge pull request docker-archive#1578 from aluzzardi/rescheduling

15a3e1e

[experimental] Simple container rescheduling on node failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[experimental] Simple container rescheduling on node failure #1578

[experimental] Simple container rescheduling on node failure #1578

vieux commented Jan 4, 2016

dongluochen Jan 7, 2016

dongluochen Jan 8, 2016

mavenugo commented Jan 9, 2016

moxiegirl Jan 12, 2016

moxiegirl commented Jan 12, 2016

dongluochen commented Jan 12, 2016

abronan commented Jan 12, 2016

[experimental] Simple container rescheduling on node failure #1578

[experimental] Simple container rescheduling on node failure #1578

Conversation

vieux commented Jan 4, 2016

dongluochen Jan 7, 2016

Choose a reason for hiding this comment

dongluochen Jan 8, 2016

Choose a reason for hiding this comment

mavenugo commented Jan 9, 2016

moxiegirl Jan 12, 2016

Choose a reason for hiding this comment

moxiegirl commented Jan 12, 2016

dongluochen commented Jan 12, 2016

abronan commented Jan 12, 2016