Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node drain orchestrator #3951

Merged
merged 11 commits into from
Mar 8, 2018
Merged

Node drain orchestrator #3951

merged 11 commits into from
Mar 8, 2018

Conversation

dadgar
Copy link
Contributor

@dadgar dadgar commented Mar 8, 2018

This PR introduces a new node drain orchestrator to drain allocations off of nodes while attempting to minimize service down time.

Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can fixup most of this so feel free to merge whenever.

Healthy: helper.BoolToPtr(true),
}
updates = append(updates, newAlloc)
logger.Printf("Marked deployment health for alloc %q", alloc.ID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Include node id

return
}

t.Fatalf("failed to get node allocs: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling Fatalf shouldn't be done from a non-main goroutine: https://golang.org/pkg/testing/#T.FailNow

WriteRequest: structs.WriteRequest{Region: "global"},
}
var resp structs.NodeAllocsResponse
require.Nil(t, msgpackrpc.CallWithCodec(codec, "Node.UpdateAlloc", req, &resp))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

require functions shouldn't be called from a non-main goroutine: https://golang.org/pkg/testing/#T.FailNow

batch chan []string
nodes map[string]time.Time
trigger chan string
l sync.RWMutex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mu instead of l for readability

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no point in making a RWMutex since only 1 critical section is readonly and it's never called concurrently

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed but it doesn't appear there is a canonical go naming scheme for these and l is used quite a bit through the code base.

d := &deadlineHeap{
ctx: ctx,
coalesceWindow: coalesceWindow,
batch: make(chan []string, 4),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want this buffered? It seems like if we block on sends in the unlikely event it blocks for some time won't that nicely cause more deadlines to be coalesced into a single batch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean it would have to block for over the coalesce window... which would be insane but there really isn't much benefit to the buffering either

t.Parallel()
require := require.New(t)
h := NewDeadlineHeap(context.Background(), 1*time.Second)
require.Implements((*DrainDeadlineNotifier)(nil), h)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole test body could be replaced with var _ DrainDeadlineNotifier = &deadlineHeap{} and just let it be a compile error if the implementation doesn't match.

}

// Flush the state to create the necessary objects
n.flush()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this only happen when enabling?

I kind of wish all of the state manipulated by flush() was encapsulated in a sub-struct that was created when enable=true and given a context that's canceled on subsequent SetEnabled calls. I think that would remove the need for locking in this struct entirely?

Anyway, no need to change now. Not that much code is involved.

type drainingNode struct {
state *state.StateStore
node *structs.Node
l sync.RWMutex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This struct doesn't need any locking. It does not mutate any of its state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does since you have concurrent access and the node can be updated. The node is then used to decide how deadline time, is done and deadlined allocs

state: state,
jobs: make(map[structs.JobNs]struct{}, 64),
drainCh: make(chan *DrainRequest, 8),
migratedCh: make(chan []*structs.Allocation, 8),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these need to be buffered >1? Not sure there's any benefit to letting it run ahead of receivers as won't it be able to just do more work at a time if it runs less often due to a blocked send?

)

// DrainingNodeWatcher is the interface for watching for draining nodes.
type DrainingNodeWatcher interface{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a hard time figuring out what this does...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just so that the factory can return a non-concrete type so that we can inject our own during testing.

@dadgar dadgar merged commit 0db7e8d into f-drainv2-node-drainer Mar 8, 2018
@dadgar dadgar deleted the f-drainer branch March 8, 2018 21:28
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants