Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: hook up and test node decommissioning #17272

Merged
merged 3 commits into from
Aug 5, 2017

Conversation

tbg
Copy link
Member

@tbg tbg commented Jul 27, 2017

See #6198. This implements the "management portion" of
It leans heavily on @neeral's WIP PR #17157.

  • add node decommission [--wait=all|live|none] [nodeID1 nodeID2 ...]
  • add cockroach quit --decommission
  • add a comprehensive acceptance test that puts all of these to good use.

It works surprisingly well but as you'd expect there are kinks. Specificially,
in the acceptance test, the invocation quit --decommission tends to hang for
extended periods of time, sometimes "forever". In the most recent run, this was
traced to the fact that the lease holder for a replica remaining on
a decommissioning node had no ZoneConfig in gossip, which effectively
disables its leaseholder replication checks. It is not clear whether this is
related to decommissioning in the first place, though the leaseholder node was
itself decommissioned, recommissioned and restarted when this occured.

The acceptance test also requires at least four nodes to work, and it takes
around 10 minutes, so that we may only want to run a reduced version during
regular CI, with the long one running nightly.

The invocation for the failing acceptance test is:

make acceptance TESTS=Decom TESTFLAGS='-v -show-logs -nodes=4' TESTTIMEOUT=20m

(if the test runs and fails with the localcluster shim complaining about
unexpected events, then that's because I haven't had a chance to tell it
about the node we're intentionally --quitting yet) or rather, test what
I did to tell it about that.

cc @a-robinson

Closes #17157.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg force-pushed the decommission-cli branch 2 times, most recently from ad7e058 to 7b6b09d Compare July 28, 2017 04:35
@a-robinson
Copy link
Contributor

Thanks! Was it intentional that you squashed all of @neeral's work from #17157 into your big commit?

I got a chance to try this out this morning, and the system config issue is a gossip issue that I think was possible even before decommissioning. Assuming node 1 is the leaseholder for the system config range and then gets decommissioned, it looks like this is what's happening:

  1. n1 takes lease for system config range at some point and gossips the system config
  2. n1 gets decommissioned and transfers all its ranges/leases away
  3. Some other node (let's say n2) takes the lease on the system config range, but doesn't gossip it because it's equivalent to the existing system config info in gossip, and we only re-gossip it when it's changed.
  4. n1 gets brought back up and reconnects to gossip. However, because the source node for the gossiped system config is n1, all other nodes assume they don't need to send their copy to n1. This means that n1 won't receive the system config until something causes the system config's value to change.

This doesn't appear to repro every time when I'm manually messing around decommissioning nodes, so I don't yet fully understand what causes the new leaseholder to sometimes gossip the system config and sometimes not, but I'll look into it and send a fix.

@a-robinson
Copy link
Contributor

Rebased on top of master and added a test for #17304

@tbg tbg force-pushed the decommission-cli branch from b2a142c to c889a5c Compare July 31, 2017 15:30
@tbg
Copy link
Member Author

tbg commented Jul 31, 2017

With @a-robinson's help, this indeed seems to pass locally. Waiting for reviews.

Thanks! Was it intentional that you squashed all of @neeral's work from #17157 into your big commit?

Yeah, there was a bunch of churn between the two and it ended up just being easier (plus it doesn't hurt if all of the git blame is on me). I gave credit in the commit message instead.

@tbg
Copy link
Member Author

tbg commented Jul 31, 2017

I'll factor out the first commit, actually, so don't review that.

@tbg tbg force-pushed the decommission-cli branch from c889a5c to f3d8b5d Compare July 31, 2017 16:26
@tbg tbg force-pushed the decommission-cli branch 2 times, most recently from a99c981 to ceec51f Compare July 31, 2017 16:48
tbg added a commit to tbg/cockroach that referenced this pull request Jul 31, 2017
This is required for

cockroachdb#17272

and merits a separate review.
@tbg tbg force-pushed the decommission-cli branch from ceec51f to 561c66a Compare July 31, 2017 21:12
@tbg tbg requested a review from a-robinson August 1, 2017 14:02
@tbg tbg force-pushed the decommission-cli branch from 561c66a to e0650e6 Compare August 1, 2017 14:09
@tbg
Copy link
Member Author

tbg commented Aug 1, 2017

@a-robinson I don't wanna punish you for helping me out with this PR, but now that it passes, do you mind reviewing it?

@a-robinson
Copy link
Contributor

Reviewed 22 of 22 files at r1.
Review status: all files reviewed at latest revision, 20 unresolved discussions, some commit checks failed.


pkg/acceptance/decommission_test.go, line 208 at r1 (raw file):

	}

	// It is being decommissioned in absentia, meaning that its replicas are

What happened to the test case I added that tests decommissioning a down-but-not-dead node?


pkg/acceptance/cluster/cluster.go, line 46 at r1 (raw file):

	// dismantle the cluster.
	AssertAndStop(context.Context, testing.TB)
	// ExecRoot executes the given command with super-user privileges.

Nit, but the comment should explain what the return parameters are in cases like this when it isn't obvious.


pkg/cli/flags.go, line 99 at r1 (raw file):

		*s = nodeDecommissionWaitNone
	default:
		return errors.New("invalid value")

Minor, but this error would be more helpful if it listed the valid values


pkg/cli/flags.go, line 432 at r1 (raw file):

	// Quit command.
	boolFlag(quitCmd.Flags(), &serverDecommission, cliflags.Decommission, false)

Mind moving this up nearer to the other quitCmd flags?


pkg/cli/node.go, line 200 at r1 (raw file):

var decommissionNodeCmd = &cobra.Command{
	Use:   "decommission <nodeID1> [<nodeID2> ...]",

Should this be `"decommission [ ...]" given that it's valid to not provide a node ID? Or should there be a different command for getting the decommission status of all nodes?

If we keep it such that not providing a node ID just prints all nodes' statuses, the help text needs to be updated.


pkg/cli/node.go, line 209 at r1 (raw file):

}

func setDecommission(

Not that important, but is there any method to the madness of how these functions were ordered? Normally I'd expect to see either top-down or bottom-up, but this appears to be a mixture.


pkg/cli/node.go, line 257 at r1 (raw file):

	}
	for r := retry.StartWithCtx(ctx, opts); r.Next(); {
		resp, err := setDecommission(ctx, c, args, true)

Feel free to ignore as a style nit, but it'd make more sense to parse the node IDs just once, outside this retry loop (and without the error wrapping below), rather than on every iteration.


pkg/cli/node.go, line 257 at r1 (raw file):

	}
	for r := retry.StartWithCtx(ctx, opts); r.Next(); {
		resp, err := setDecommission(ctx, c, args, true)

Along the same lines, it's also a bit odd that we're asking the server to decommission the nodes every time. If one user tries to decommission a node, then a different user sees the node being decommissioned and wants to recommission it, this retry loop could override the second user's attempt to recommission it.

Calling the DecommissionStatus API in the retry loop instead of the Decommission API may be preferable.


pkg/cli/node.go, line 273 at r1 (raw file):

		}
		if replicaCount == 0 && allDecommissioning {
			fmt.Fprintln(os.Stdout, "The target nodes may now be removed from the cluster.")

This statement is a bit strong given that we haven't done any validation of whether the replicas from the decommissioned node(s) have been properly upreplicated. We do mention this in the help text for the --wait flag, but this message is more visible. Should we warn if there were down nodes that we didn't wait on?


pkg/cli/node.go, line 317 at r1 (raw file):

}

func printDecommissionStatus(resp serverpb.DecommissionStatusResponse) error {

Again, this method ordering is... odd


pkg/cli/cliflags/flags.go, line 493 at r1 (raw file):

		Description: `
Specifies when to return after having marked the targets as decommissioning.
Takes either of the following values:

s/either/any/


pkg/cmd/benchmark/main.go, line 32 at r1 (raw file):

	"time"

	"golang.org/x/net/context"

Was this intentional?


pkg/cmd/github-pull-request-make/main.go, line 41 at r1 (raw file):

	"time"

	"golang.org/x/net/context"

Was this intentional?


pkg/server/admin.go, line 1154 at r1 (raw file):

	ctx context.Context, req *serverpb.DecommissionStatusRequest,
) (*serverpb.DecommissionStatusResponse, error) {
	// Get the number of replicas on each node. We *may* don't need all of them,

s/may don't/may not/


pkg/server/admin.go, line 1201 at r1 (raw file):

			Draining:        l.Draining,
		}
		if live, err := s.server.nodeLiveness.IsLive(nodeID); err == nil && live {

It'd be nice to use the liveness object returned above to get a more consistent view of each node's liveness rather than re-fetching it here. If getting ahold of the clock is too much work, though, then this is ok.


pkg/server/serverpb/admin.proto, line 346 at r1 (raw file):

// all nodes specified by 'node_id' to the value of 'decommissioning'.
//
// If no 'node_id' is given, targets the recipient node.

I'd rather just be strict and make it an error to send an empty node_ids unless we have reason to be more lenient. Seems nicer than letting people accidentally decommission a node.


pkg/server/serverpb/admin.proto, line 353 at r1 (raw file):

}

// DecommissionStatusResponse is the response to a successful DecommissionRequest and

I think this comment is more confusing than clarifying (it also is a response to a DecommissionStatusRequest, for which "after having processing the request" doesn't make any sense). Mind just removing it?


pkg/server/serverpb/admin.proto, line 607 at r1 (raw file):

  }

  // Decommission puts the node into the specified decommissioning state

Comment needs fixing.


pkg/storage/node_liveness.go, line 184 at r1 (raw file):

// SetDecommissioning runs a best-effort attempt of marking the the liveness
// record as decommissioning.
func (nl *NodeLiveness) SetDecommissioning(

Why not return an error when this fails? It seems a like a weird user experience that a decommission request can fail but return a 200.


pkg/storage/node_liveness.go, line 287 at r1 (raw file):

		return err
	}
	nl.setSelf(newLiveness)

These setSelf calls look wrong to me, but they're also used everywhere else throughout the file. I wonder if they're all broken too...

In short, setSelf blindly updates nl.mu.self without having any idea of whether the liveness record is actually for itself. If node x decommissions node y, it seems like node x would overwrite its own local liveness record with node y's. It'll get corrected next time it tries to heartbeat its liveness record, but that still seems bad.


Comments from Reviewable

@a-robinson
Copy link
Contributor

Review status: 21 of 22 files reviewed at latest revision, 20 unresolved discussions, some commit checks failed.


pkg/storage/node_liveness.go, line 287 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

These setSelf calls look wrong to me, but they're also used everywhere else throughout the file. I wonder if they're all broken too...

In short, setSelf blindly updates nl.mu.self without having any idea of whether the liveness record is actually for itself. If node x decommissions node y, it seems like node x would overwrite its own local liveness record with node y's. It'll get corrected next time it tries to heartbeat its liveness record, but that still seems bad.

I checked it out a little closer, and the existing calls look ok because they're in methods that only mess with self's liveness record. It's not safe in this context, though, since the node ID is being provided and might not be our own.


Comments from Reviewable

@tbg
Copy link
Member Author

tbg commented Aug 3, 2017

TFTR!


Review status: 21 of 22 files reviewed at latest revision, 20 unresolved discussions, some commit checks failed.


pkg/acceptance/decommission_test.go, line 208 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

What happened to the test case I added that tests decommissioning a down-but-not-dead node?

I dropped the commit and you've since re-added it.


pkg/acceptance/cluster/cluster.go, line 46 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Nit, but the comment should explain what the return parameters are in cases like this when it isn't obvious.

Not a nit if you ask me - fixed!


pkg/cli/flags.go, line 99 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Minor, but this error would be more helpful if it listed the valid values

Done.


pkg/cli/flags.go, line 432 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Mind moving this up nearer to the other quitCmd flags?

Done.


pkg/cli/node.go, line 200 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Should this be `"decommission [ ...]" given that it's valid to not provide a node ID? Or should there be a different command for getting the decommission status of all nodes?

If we keep it such that not providing a node ID just prints all nodes' statuses, the help text needs to be updated.

Updated. I think it's worth doing another iteration over the API. Filed #17419


pkg/cli/node.go, line 209 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Not that important, but is there any method to the madness of how these functions were ordered? Normally I'd expect to see either top-down or bottom-up, but this appears to be a mixture.

I just tried cleaning it up, checked with zone.go and realized that there likely isn't a method to the madness. I'll leave as-is for now.
Ideally we'd just split all these files up a bunch.


pkg/cli/node.go, line 257 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Along the same lines, it's also a bit odd that we're asking the server to decommission the nodes every time. If one user tries to decommission a node, then a different user sees the node being decommissioned and wants to recommission it, this retry loop could override the second user's attempt to recommission it.

Calling the DecommissionStatus API in the retry loop instead of the Decommission API may be preferable.

I don't think the behavior you're suggesting is necessarily better. The current behavior is that as long as I'm running decommission --wait, I want to actually see these nodes decommissioned, even if someone interferes. We could change it, but I prefer to leave as is until we have discussed which one we want.


pkg/cli/node.go, line 257 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Feel free to ignore as a style nit, but it'd make more sense to parse the node IDs just once, outside this retry loop (and without the error wrapping below), rather than on every iteration.

Done.


pkg/cli/node.go, line 273 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

This statement is a bit strong given that we haven't done any validation of whether the replicas from the decommissioned node(s) have been properly upreplicated. We do mention this in the help text for the --wait flag, but this message is more visible. Should we warn if there were down nodes that we didn't wait on?

Good point, how's this?


pkg/cli/node.go, line 317 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Again, this method ordering is... odd

🏳️


pkg/cli/cliflags/flags.go, line 493 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

s/either/any/

Done.


pkg/cmd/benchmark/main.go, line 32 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Was this intentional?

Yes, though it's a complete drive-by.


pkg/cmd/github-pull-request-make/main.go, line 41 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Was this intentional?

Yes, though it's a complete drive-by.


pkg/server/admin.go, line 1154 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

s/may don't/may not/

Done.


pkg/server/admin.go, line 1201 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

It'd be nice to use the liveness object returned above to get a more consistent view of each node's liveness rather than re-fetching it here. If getting ahold of the clock is too much work, though, then this is ok.

Done.


pkg/server/serverpb/admin.proto, line 346 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

I'd rather just be strict and make it an error to send an empty node_ids unless we have reason to be more lenient. Seems nicer than letting people accidentally decommission a node.

The only path that calls this is quit --decommission.


pkg/server/serverpb/admin.proto, line 353 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

I think this comment is more confusing than clarifying (it also is a response to a DecommissionStatusRequest, for which "after having processing the request" doesn't make any sense). Mind just removing it?

Done.


pkg/server/serverpb/admin.proto, line 607 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Comment needs fixing.

Done.


pkg/storage/node_liveness.go, line 184 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Why not return an error when this fails? It seems a like a weird user experience that a decommission request can fail but return a 200.

Returning an error when we fail for epoch increment is just asking for flaky tests. Or what are you suggesting?


pkg/storage/node_liveness.go, line 287 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

I checked it out a little closer, and the existing calls look ok because they're in methods that only mess with self's liveness record. It's not safe in this context, though, since the node ID is being provided and might not be our own.

Oops, I forgot this was in the diff. I think I added this because nodes that were marked as decommissioning would clear their decommissioning status on the next heartbeat, but of course you're right that this is completely wrong. I'll take another look.


Comments from Reviewable

@tbg tbg force-pushed the decommission-cli branch from 5accfed to 308ff50 Compare August 3, 2017 16:25
@a-robinson
Copy link
Contributor

:lgtm:, just a couple open comments in the node liveness changes


Reviewed 10 of 10 files at r3, 1 of 1 files at r4, 6 of 10 files at r5.
Review status: 18 of 22 files reviewed at latest revision, 6 unresolved discussions, some commit checks failed.


pkg/acceptance/decommission_test.go, line 93 at r5 (raw file):

			t.Fatal(err)
		}
		if _, err := db.ExecContext(ctx, "SET CLUSTER SETTING server.time_until_store_dead = '15s'"); err != nil {

I meant to remove this statement since having it here makes the test case I added degenerate to the case of a dead node, but it looks like I left it in. It's already been repeated where it belongs further down. Could you remove this one for me?


pkg/cli/flags.go, line 99 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Done.

You don't have to change it in this case, but as a general practice this would be less likely to break if we ever change the values if we used the actual variables rather than hard-coding the values


pkg/cli/node.go, line 273 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Good point, how's this?

Much better, thanks!


pkg/server/serverpb/admin.proto, line 346 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

The only path that calls this is quit --decommission.

Makes sense. I think it's alright because the default will have decommissioning set to false. Otherwise I'd worry about a stray empty RPC having such a large effect.


pkg/storage/node_liveness.go, line 184 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Returning an error when we fail for epoch increment is just asking for flaky tests. Or what are you suggesting?

But it's in a retry loop where we appear to handle failures due to losing the race with a concurrent updates (via checking for errChangeDecommissioningFailed).

To be more specific, I'm referring to changing the log.Errorf(ctx, "unable to mark node as decommissioning: %s", err) line to be return errors.Wrap(err, "unable to mark node as decommissioning")? Are you saying that would still be too flaky?

I think we're coming at this from slightly different perspectives, judging by this comment and the one in admin.proto about how DecommissionRequest should behave if no nodeID is specified. I'm thinking about the decommissioning API as if people may use it directly without going through our CLI. While our CLI is designed to work with the Decommission API call returning success even if it wasn't able to start the decommissioning process (because of its internal retry loop), Clients written by outsiders may assume a successful response means they don't have to send any further Decommission calls, just DecommissionStatus calls.


Comments from Reviewable

@tbg
Copy link
Member Author

tbg commented Aug 3, 2017

Still looking at the node_liveness changes. What seems clear so far is that while that change was wrong, it also got the test passing. Now it hangs. Digging.


Review status: 18 of 22 files reviewed at latest revision, 6 unresolved discussions, some commit checks failed.


pkg/acceptance/decommission_test.go, line 93 at r5 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

I meant to remove this statement since having it here makes the test case I added degenerate to the case of a dead node, but it looks like I left it in. It's already been repeated where it belongs further down. Could you remove this one for me?

Done.


pkg/storage/node_liveness.go, line 184 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

But it's in a retry loop where we appear to handle failures due to losing the race with a concurrent updates (via checking for errChangeDecommissioningFailed).

To be more specific, I'm referring to changing the log.Errorf(ctx, "unable to mark node as decommissioning: %s", err) line to be return errors.Wrap(err, "unable to mark node as decommissioning")? Are you saying that would still be too flaky?

I think we're coming at this from slightly different perspectives, judging by this comment and the one in admin.proto about how DecommissionRequest should behave if no nodeID is specified. I'm thinking about the decommissioning API as if people may use it directly without going through our CLI. While our CLI is designed to work with the Decommission API call returning success even if it wasn't able to start the decommissioning process (because of its internal retry loop), Clients written by outsiders may assume a successful response means they don't have to send any further Decommission calls, just DecommissionStatus calls.

Ah, sorry, didn't know you were referring to the error below, your suggestion sounds reasonable. Let's see what happens with tests (the ErrNoLivenessRecord will happen in tests, but ignoring them silently doesn't seem right and I can't retry them after this change).

There will likely be flakes with the new patch, but I'll get to them. Shouldn't be hard.


Comments from Reviewable

@tbg tbg force-pushed the decommission-cli branch 2 times, most recently from 47e4dab to 13c3f3d Compare August 4, 2017 16:40
@tbg
Copy link
Member Author

tbg commented Aug 4, 2017

Ok, so after removing the weirdness I added in the node liveness code and patching up the acceptance test a bit more, it looks like it's passing reliably, at least locally. Let's see what CI has to say. Code-wise, I'd say this is good for one last look but hopefully it's ready now.

@a-robinson
Copy link
Contributor

Reviewed 1 of 10 files at r5, 11 of 12 files at r6, 1 of 1 files at r8.
Review status: all files reviewed at latest revision, 5 unresolved discussions, all commit checks successful.


pkg/storage/node_liveness.go, line 287 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Oops, I forgot this was in the diff. I think I added this because nodes that were marked as decommissioning would clear their decommissioning status on the next heartbeat, but of course you're right that this is completely wrong. I'll take another look.

Forget to address this? Or are you planning on leaving the FIXMEs in?


pkg/storage/node_liveness.go, line 188 at r6 (raw file):

) error {
	ctx = nl.ambientCtx.AnnotateCtx(ctx)
	liveness, err := nl.GetLiveness(nodeID)

Does this really work right without re-grabbing the latest liveness in each loop iteration? It seems like if this fails once due to a heartbeat or epoch increment that it would never succeed on later retries.


Comments from Reviewable

tbg and others added 3 commits August 4, 2017 17:53
Ignore the first commit -- that's cockroachdb#16968.

See cockroachdb#6198. This implements the "management portion" of
It leans heavily on @neeral's WIP PR cockroachdb#17157.

- add `node decommission [--wait=all|live|none] [nodeID1 nodeID2 ...]`
- add `cockroach quit --decommission`
- add a comprehensive acceptance test that puts all of these to good use.

It works surprisingly well but as you'd expect there are kinks. Specificially,
in the acceptance test, the invocation `quit --decommission` tends to hang for
extended periods of time, sometimes "forever". In the most recent run, this was
traced to the fact that the lease holder for a replica remaining on
a decommissioning node had *no* ZoneConfig in gossip, which effectively
disables its leaseholder replication checks. It is not clear whether this is
related to decommissioning in the first place, though the leaseholder node was
itself decommissioned, recommissioned and restarted when this occured.

The acceptance test also requires at least four nodes to work, and it takes
around 10 minutes, so that we may only want to run a reduced version during
regular CI, with the long one running nightly.

The invocation for the failing acceptance test is:

```
make acceptance TESTS=Decom TESTFLAGS='-v -show-logs -nodes=4' TESTTIMEOUT=20m
```

(if the test runs and fails with the localcluster shim complaining about
unexpected events, then that's because I haven't had a chance to tell it
about the node we're intentionally `--quit`ting yet) or rather, test what
I did to tell it about that.

cc @a-robinson

Closes cockroachdb#17157.
@tbg
Copy link
Member Author

tbg commented Aug 4, 2017

Review status: all files reviewed at latest revision, 5 unresolved discussions, all commit checks successful.


pkg/storage/node_liveness.go, line 287 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Forget to address this? Or are you planning on leaving the FIXMEs in?

I managed to forget to push the removal of those two. Gone now, though.


pkg/storage/node_liveness.go, line 188 at r6 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Does this really work right without re-grabbing the latest liveness in each loop iteration? It seems like if this fails once due to a heartbeat or epoch increment that it would never succeed on later retries.

Yikes, good catch!


Comments from Reviewable

@tbg tbg force-pushed the decommission-cli branch from 13c3f3d to 5a6be9b Compare August 4, 2017 21:53
@tbg tbg requested review from a team August 4, 2017 21:53
@tbg tbg merged commit 438e699 into cockroachdb:master Aug 5, 2017
@tbg tbg deleted the decommission-cli branch August 5, 2017 02:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants