-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node drain orchestrator #3951
Node drain orchestrator #3951
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can fixup most of this so feel free to merge whenever.
Healthy: helper.BoolToPtr(true), | ||
} | ||
updates = append(updates, newAlloc) | ||
logger.Printf("Marked deployment health for alloc %q", alloc.ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Include node id
return | ||
} | ||
|
||
t.Fatalf("failed to get node allocs: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling Fatalf shouldn't be done from a non-main goroutine: https://golang.org/pkg/testing/#T.FailNow
WriteRequest: structs.WriteRequest{Region: "global"}, | ||
} | ||
var resp structs.NodeAllocsResponse | ||
require.Nil(t, msgpackrpc.CallWithCodec(codec, "Node.UpdateAlloc", req, &resp)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
require
functions shouldn't be called from a non-main goroutine: https://golang.org/pkg/testing/#T.FailNow
nomad/drainerv2/drain_heap.go
Outdated
batch chan []string | ||
nodes map[string]time.Time | ||
trigger chan string | ||
l sync.RWMutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mu
instead of l
for readability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no point in making a RWMutex since only 1 critical section is readonly and it's never called concurrently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed but it doesn't appear there is a canonical go naming scheme for these and l is used quite a bit through the code base.
nomad/drainerv2/drain_heap.go
Outdated
d := &deadlineHeap{ | ||
ctx: ctx, | ||
coalesceWindow: coalesceWindow, | ||
batch: make(chan []string, 4), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want this buffered? It seems like if we block on sends in the unlikely event it blocks for some time won't that nicely cause more deadlines to be coalesced into a single batch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean it would have to block for over the coalesce window... which would be insane but there really isn't much benefit to the buffering either
t.Parallel() | ||
require := require.New(t) | ||
h := NewDeadlineHeap(context.Background(), 1*time.Second) | ||
require.Implements((*DrainDeadlineNotifier)(nil), h) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole test body could be replaced with var _ DrainDeadlineNotifier = &deadlineHeap{}
and just let it be a compile error if the implementation doesn't match.
nomad/drainerv2/drainer.go
Outdated
} | ||
|
||
// Flush the state to create the necessary objects | ||
n.flush() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this only happen when enabling?
I kind of wish all of the state manipulated by flush()
was encapsulated in a sub-struct that was created when enable=true
and given a context that's canceled on subsequent SetEnabled calls. I think that would remove the need for locking in this struct entirely?
Anyway, no need to change now. Not that much code is involved.
type drainingNode struct { | ||
state *state.StateStore | ||
node *structs.Node | ||
l sync.RWMutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This struct doesn't need any locking. It does not mutate any of its state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does since you have concurrent access and the node can be updated. The node is then used to decide how deadline time, is done and deadlined allocs
nomad/drainerv2/watch_jobs.go
Outdated
state: state, | ||
jobs: make(map[structs.JobNs]struct{}, 64), | ||
drainCh: make(chan *DrainRequest, 8), | ||
migratedCh: make(chan []*structs.Allocation, 8), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these need to be buffered >1? Not sure there's any benefit to letting it run ahead of receivers as won't it be able to just do more work at a time if it runs less often due to a blocked send?
) | ||
|
||
// DrainingNodeWatcher is the interface for watching for draining nodes. | ||
type DrainingNodeWatcher interface{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having a hard time figuring out what this does...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just so that the factory can return a non-concrete type so that we can inject our own during testing.
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
This PR introduces a new node drain orchestrator to drain allocations off of nodes while attempting to minimize service down time.