-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: hook up and test node decommissioning #17272
Conversation
ad7e058
to
7b6b09d
Compare
Thanks! Was it intentional that you squashed all of @neeral's work from #17157 into your big commit? I got a chance to try this out this morning, and the system config issue is a gossip issue that I think was possible even before decommissioning. Assuming node 1 is the leaseholder for the system config range and then gets decommissioned, it looks like this is what's happening:
This doesn't appear to repro every time when I'm manually messing around decommissioning nodes, so I don't yet fully understand what causes the new leaseholder to sometimes gossip the system config and sometimes not, but I'll look into it and send a fix. |
7b6b09d
to
b2a142c
Compare
Rebased on top of master and added a test for #17304 |
With @a-robinson's help, this indeed seems to pass locally. Waiting for reviews.
Yeah, there was a bunch of churn between the two and it ended up just being easier (plus it doesn't hurt if all of the |
I'll factor out the first commit, actually, so don't review that. |
a99c981
to
ceec51f
Compare
This is required for cockroachdb#17272 and merits a separate review.
@a-robinson I don't wanna punish you for helping me out with this PR, but now that it passes, do you mind reviewing it? |
Reviewed 22 of 22 files at r1. pkg/acceptance/decommission_test.go, line 208 at r1 (raw file):
What happened to the test case I added that tests decommissioning a down-but-not-dead node? pkg/acceptance/cluster/cluster.go, line 46 at r1 (raw file):
Nit, but the comment should explain what the return parameters are in cases like this when it isn't obvious. pkg/cli/flags.go, line 99 at r1 (raw file):
Minor, but this error would be more helpful if it listed the valid values pkg/cli/flags.go, line 432 at r1 (raw file):
Mind moving this up nearer to the other pkg/cli/node.go, line 200 at r1 (raw file):
Should this be `"decommission [ ...]" given that it's valid to not provide a node ID? Or should there be a different command for getting the decommission status of all nodes? If we keep it such that not providing a node ID just prints all nodes' statuses, the help text needs to be updated. pkg/cli/node.go, line 209 at r1 (raw file):
Not that important, but is there any method to the madness of how these functions were ordered? Normally I'd expect to see either top-down or bottom-up, but this appears to be a mixture. pkg/cli/node.go, line 257 at r1 (raw file):
Feel free to ignore as a style nit, but it'd make more sense to parse the node IDs just once, outside this retry loop (and without the error wrapping below), rather than on every iteration. pkg/cli/node.go, line 257 at r1 (raw file):
Along the same lines, it's also a bit odd that we're asking the server to decommission the nodes every time. If one user tries to decommission a node, then a different user sees the node being decommissioned and wants to recommission it, this retry loop could override the second user's attempt to recommission it. Calling the pkg/cli/node.go, line 273 at r1 (raw file):
This statement is a bit strong given that we haven't done any validation of whether the replicas from the decommissioned node(s) have been properly upreplicated. We do mention this in the help text for the pkg/cli/node.go, line 317 at r1 (raw file):
Again, this method ordering is... odd pkg/cli/cliflags/flags.go, line 493 at r1 (raw file):
s/either/any/ pkg/cmd/benchmark/main.go, line 32 at r1 (raw file):
Was this intentional? pkg/cmd/github-pull-request-make/main.go, line 41 at r1 (raw file):
Was this intentional? pkg/server/admin.go, line 1154 at r1 (raw file):
s/may don't/may not/ pkg/server/admin.go, line 1201 at r1 (raw file):
It'd be nice to use the liveness object returned above to get a more consistent view of each node's liveness rather than re-fetching it here. If getting ahold of the clock is too much work, though, then this is ok. pkg/server/serverpb/admin.proto, line 346 at r1 (raw file):
I'd rather just be strict and make it an error to send an empty pkg/server/serverpb/admin.proto, line 353 at r1 (raw file):
I think this comment is more confusing than clarifying (it also is a response to a pkg/server/serverpb/admin.proto, line 607 at r1 (raw file):
Comment needs fixing. pkg/storage/node_liveness.go, line 184 at r1 (raw file):
Why not return an error when this fails? It seems a like a weird user experience that a decommission request can fail but return a 200. pkg/storage/node_liveness.go, line 287 at r1 (raw file):
These In short, Comments from Reviewable |
Review status: 21 of 22 files reviewed at latest revision, 20 unresolved discussions, some commit checks failed. pkg/storage/node_liveness.go, line 287 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
I checked it out a little closer, and the existing calls look ok because they're in methods that only mess with self's liveness record. It's not safe in this context, though, since the node ID is being provided and might not be our own. Comments from Reviewable |
TFTR! Review status: 21 of 22 files reviewed at latest revision, 20 unresolved discussions, some commit checks failed. pkg/acceptance/decommission_test.go, line 208 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
I dropped the commit and you've since re-added it. pkg/acceptance/cluster/cluster.go, line 46 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Not a nit if you ask me - fixed! pkg/cli/flags.go, line 99 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Done. pkg/cli/flags.go, line 432 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Done. pkg/cli/node.go, line 200 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Updated. I think it's worth doing another iteration over the API. Filed #17419 pkg/cli/node.go, line 209 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
I just tried cleaning it up, checked with pkg/cli/node.go, line 257 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
I don't think the behavior you're suggesting is necessarily better. The current behavior is that as long as I'm running pkg/cli/node.go, line 257 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Done. pkg/cli/node.go, line 273 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Good point, how's this? pkg/cli/node.go, line 317 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
🏳️ pkg/cli/cliflags/flags.go, line 493 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Done. pkg/cmd/benchmark/main.go, line 32 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Yes, though it's a complete drive-by. pkg/cmd/github-pull-request-make/main.go, line 41 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Yes, though it's a complete drive-by. pkg/server/admin.go, line 1154 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Done. pkg/server/admin.go, line 1201 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Done. pkg/server/serverpb/admin.proto, line 346 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
The only path that calls this is pkg/server/serverpb/admin.proto, line 353 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Done. pkg/server/serverpb/admin.proto, line 607 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Done. pkg/storage/node_liveness.go, line 184 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Returning an error when we fail for epoch increment is just asking for flaky tests. Or what are you suggesting? pkg/storage/node_liveness.go, line 287 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Oops, I forgot this was in the diff. I think I added this because nodes that were marked as decommissioning would clear their decommissioning status on the next heartbeat, but of course you're right that this is completely wrong. I'll take another look. Comments from Reviewable |
, just a couple open comments in the node liveness changes Reviewed 10 of 10 files at r3, 1 of 1 files at r4, 6 of 10 files at r5. pkg/acceptance/decommission_test.go, line 93 at r5 (raw file):
I meant to remove this statement since having it here makes the test case I added degenerate to the case of a dead node, but it looks like I left it in. It's already been repeated where it belongs further down. Could you remove this one for me? pkg/cli/flags.go, line 99 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
You don't have to change it in this case, but as a general practice this would be less likely to break if we ever change the values if we used the actual variables rather than hard-coding the values pkg/cli/node.go, line 273 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Much better, thanks! pkg/server/serverpb/admin.proto, line 346 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Makes sense. I think it's alright because the default will have pkg/storage/node_liveness.go, line 184 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
But it's in a retry loop where we appear to handle failures due to losing the race with a concurrent updates (via checking for To be more specific, I'm referring to changing the I think we're coming at this from slightly different perspectives, judging by this comment and the one in admin.proto about how Comments from Reviewable |
Still looking at the node_liveness changes. What seems clear so far is that while that change was wrong, it also got the test passing. Now it hangs. Digging. Review status: 18 of 22 files reviewed at latest revision, 6 unresolved discussions, some commit checks failed. pkg/acceptance/decommission_test.go, line 93 at r5 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Done. pkg/storage/node_liveness.go, line 184 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Ah, sorry, didn't know you were referring to the error below, your suggestion sounds reasonable. Let's see what happens with tests (the There will likely be flakes with the new patch, but I'll get to them. Shouldn't be hard. Comments from Reviewable |
47e4dab
to
13c3f3d
Compare
Ok, so after removing the weirdness I added in the node liveness code and patching up the acceptance test a bit more, it looks like it's passing reliably, at least locally. Let's see what CI has to say. Code-wise, I'd say this is good for one last look but hopefully it's ready now. |
Reviewed 1 of 10 files at r5, 11 of 12 files at r6, 1 of 1 files at r8. pkg/storage/node_liveness.go, line 287 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Forget to address this? Or are you planning on leaving the FIXMEs in? pkg/storage/node_liveness.go, line 188 at r6 (raw file):
Does this really work right without re-grabbing the latest liveness in each loop iteration? It seems like if this fails once due to a heartbeat or epoch increment that it would never succeed on later retries. Comments from Reviewable |
Ignore the first commit -- that's cockroachdb#16968. See cockroachdb#6198. This implements the "management portion" of It leans heavily on @neeral's WIP PR cockroachdb#17157. - add `node decommission [--wait=all|live|none] [nodeID1 nodeID2 ...]` - add `cockroach quit --decommission` - add a comprehensive acceptance test that puts all of these to good use. It works surprisingly well but as you'd expect there are kinks. Specificially, in the acceptance test, the invocation `quit --decommission` tends to hang for extended periods of time, sometimes "forever". In the most recent run, this was traced to the fact that the lease holder for a replica remaining on a decommissioning node had *no* ZoneConfig in gossip, which effectively disables its leaseholder replication checks. It is not clear whether this is related to decommissioning in the first place, though the leaseholder node was itself decommissioned, recommissioned and restarted when this occured. The acceptance test also requires at least four nodes to work, and it takes around 10 minutes, so that we may only want to run a reduced version during regular CI, with the long one running nightly. The invocation for the failing acceptance test is: ``` make acceptance TESTS=Decom TESTFLAGS='-v -show-logs -nodes=4' TESTTIMEOUT=20m ``` (if the test runs and fails with the localcluster shim complaining about unexpected events, then that's because I haven't had a chance to tell it about the node we're intentionally `--quit`ting yet) or rather, test what I did to tell it about that. cc @a-robinson Closes cockroachdb#17157.
Tests the fix in cockroachdb#17304 for issue cockroachdb#17288
Review status: all files reviewed at latest revision, 5 unresolved discussions, all commit checks successful. pkg/storage/node_liveness.go, line 287 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
I managed to forget to push the removal of those two. Gone now, though. pkg/storage/node_liveness.go, line 188 at r6 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Yikes, good catch! Comments from Reviewable |
See #6198. This implements the "management portion" of
It leans heavily on @neeral's WIP PR #17157.
node decommission [--wait=all|live|none] [nodeID1 nodeID2 ...]
cockroach quit --decommission
It works surprisingly well but as you'd expect there are kinks. Specificially,
in the acceptance test, the invocation
quit --decommission
tends to hang forextended periods of time, sometimes "forever". In the most recent run, this was
traced to the fact that the lease holder for a replica remaining on
a decommissioning node had no ZoneConfig in gossip, which effectively
disables its leaseholder replication checks. It is not clear whether this is
related to decommissioning in the first place, though the leaseholder node was
itself decommissioned, recommissioned and restarted when this occured.
The acceptance test also requires at least four nodes to work, and it takes
around 10 minutes, so that we may only want to run a reduced version during
regular CI, with the long one running nightly.
The invocation for the failing acceptance test is:
(if the test runs and fails with the localcluster shim complaining about
unexpected events, then that's because I haven't had a chance to tell it
about the node we're intentionally
--quit
ting yet) or rather, test whatI did to tell it about that.
cc @a-robinson
Closes #17157.