Node drain orchestrator #3951

dadgar · 2018-03-08T00:21:49Z

This PR introduces a new node drain orchestrator to drain allocations off of nodes while attempting to minimize service down time.

…iner

schmichael

I can fixup most of this so feel free to merge whenever.

schmichael · 2018-03-08T00:28:38Z

nomad/drainer_int_test.go

+				Healthy: helper.BoolToPtr(true),
+			}
+			updates = append(updates, newAlloc)
+			logger.Printf("Marked deployment health for alloc %q", alloc.ID)


Include node id

schmichael · 2018-03-08T00:30:31Z

nomad/drainer_int_test.go

+				return
+			}
+
+			t.Fatalf("failed to get node allocs: %v", err)


Calling Fatalf shouldn't be done from a non-main goroutine: https://golang.org/pkg/testing/#T.FailNow

schmichael · 2018-03-08T00:31:02Z

nomad/drainer_int_test.go

+			WriteRequest: structs.WriteRequest{Region: "global"},
+		}
+		var resp structs.NodeAllocsResponse
+		require.Nil(t, msgpackrpc.CallWithCodec(codec, "Node.UpdateAlloc", req, &resp))


require functions shouldn't be called from a non-main goroutine: https://golang.org/pkg/testing/#T.FailNow

schmichael · 2018-03-08T00:37:40Z

nomad/drainerv2/drain_heap.go

+	batch          chan []string
+	nodes          map[string]time.Time
+	trigger        chan string
+	l              sync.RWMutex


mu instead of l for readability

no point in making a RWMutex since only 1 critical section is readonly and it's never called concurrently

Changed but it doesn't appear there is a canonical go naming scheme for these and l is used quite a bit through the code base.

schmichael · 2018-03-08T00:44:15Z

nomad/drainerv2/drain_heap.go

+	d := &deadlineHeap{
+		ctx:            ctx,
+		coalesceWindow: coalesceWindow,
+		batch:          make(chan []string, 4),


Do we really want this buffered? It seems like if we block on sends in the unlikely event it blocks for some time won't that nicely cause more deadlines to be coalesced into a single batch?

I mean it would have to block for over the coalesce window... which would be insane but there really isn't much benefit to the buffering either

schmichael · 2018-03-08T01:11:33Z

nomad/drainerv2/drain_heap_test.go

+	t.Parallel()
+	require := require.New(t)
+	h := NewDeadlineHeap(context.Background(), 1*time.Second)
+	require.Implements((*DrainDeadlineNotifier)(nil), h)


This whole test body could be replaced with var _ DrainDeadlineNotifier = &deadlineHeap{} and just let it be a compile error if the implementation doesn't match.

schmichael · 2018-03-08T01:25:52Z

nomad/drainerv2/drainer.go

+	}
+
+	// Flush the state to create the necessary objects
+	n.flush()


Shouldn't this only happen when enabling?

I kind of wish all of the state manipulated by flush() was encapsulated in a sub-struct that was created when enable=true and given a context that's canceled on subsequent SetEnabled calls. I think that would remove the need for locking in this struct entirely?

Anyway, no need to change now. Not that much code is involved.

schmichael · 2018-03-08T01:29:58Z

nomad/drainerv2/draining_node.go

+type drainingNode struct {
+	state *state.StateStore
+	node  *structs.Node
+	l     sync.RWMutex


This struct doesn't need any locking. It does not mutate any of its state.

It does since you have concurrent access and the node can be updated. The node is then used to decide how deadline time, is done and deadlined allocs

schmichael · 2018-03-08T01:32:53Z

nomad/drainerv2/watch_jobs.go

+		state:       state,
+		jobs:        make(map[structs.JobNs]struct{}, 64),
+		drainCh:     make(chan *DrainRequest, 8),
+		migratedCh:  make(chan []*structs.Allocation, 8),


Do these need to be buffered >1? Not sure there's any benefit to letting it run ahead of receivers as won't it be able to just do more work at a time if it runs less often due to a blocked send?

schmichael · 2018-03-08T01:42:22Z

nomad/drainerv2/watch_nodes.go

+)
+
+// DrainingNodeWatcher is the interface for watching for draining nodes.
+type DrainingNodeWatcher interface{}


I'm having a hard time figuring out what this does...

Just so that the factory can return a non-concrete type so that we can inject our own during testing.

github-actions · 2023-03-11T02:13:52Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

dadgar added 10 commits March 1, 2018 16:37

Initial design

2938045

drain heap

f3ac2b4

node watcher

44f1cc5

Node's being untracked or having updated deadlines, updates the deadl…

90da01e

…iner

job watcher

ea4df19

Drainer

1e7186f

integration test and basic fixes

eebca69

handle empty node case

83d3555

Comments

5abffa0

spelling fixes

3e83591

schmichael approved these changes Mar 8, 2018

View reviewed changes

code review

ef6a003

dadgar merged commit 0db7e8d into f-drainv2-node-drainer Mar 8, 2018

dadgar deleted the f-drainer branch March 8, 2018 21:28

github-actions bot locked as resolved and limited conversation to collaborators Mar 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node drain orchestrator #3951

Node drain orchestrator #3951

dadgar commented Mar 8, 2018

schmichael left a comment

schmichael Mar 8, 2018

schmichael Mar 8, 2018

schmichael Mar 8, 2018

schmichael Mar 8, 2018

schmichael Mar 8, 2018

dadgar Mar 8, 2018

schmichael Mar 8, 2018

dadgar Mar 8, 2018

schmichael Mar 8, 2018

schmichael Mar 8, 2018

schmichael Mar 8, 2018

dadgar Mar 8, 2018

schmichael Mar 8, 2018

schmichael Mar 8, 2018

dadgar Mar 8, 2018

github-actions bot commented Mar 11, 2023

Node drain orchestrator #3951

Node drain orchestrator #3951

Conversation

dadgar commented Mar 8, 2018

schmichael left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Mar 11, 2023