fix indefinite retrying for `cockroach quit` when quorum is lost #14620 #14708

xphoniex · 2017-04-07T16:53:51Z

error log is still misleading, should we call it a graceful shutdown when a hard shutdown has been initiated?

timed out, trying again
ok
initiating graceful shutdown of server
server drained and shutdown completed

cockroach-teamcity · 2017-04-07T16:53:56Z

This change is

a-robinson · 2017-04-07T16:55:58Z

Thanks for the contribution, @xphoniex! Assigning @asubiotto for review since he's been looking into this as well, I believe.

tamird · 2017-04-07T17:24:51Z

This doesn't look right to me - we want the node to stop, not for the client to give up.

xphoniex · 2017-04-07T17:29:35Z

@tamird , it doesn't give up, it proceeds to initiate a hard shutdown once the one minute window has passed, I checked on my machine and all processes had been shutdown.

tamird · 2017-04-07T17:32:48Z

Ah, my mistake. Looks like you have two commits here where there should be one?

asubiotto

Thanks a lot for the contribution! You can rebase the two commits into one using git rebase -i HEAD~2 as outlined here.
Additionally, you can automatically close the issue this refers to by putting Fixes #14620 on its own line in the commit message (run git commit --amend).
It would also be nice to provide a little more information of what is being done here in the commit message (e.g. "client initiates a hard shutdown on the server after a minute...").

asubiotto · 2017-04-10T15:00:41Z

pkg/cli/start.go

-			fmt.Fprintf(
-				os.Stdout, "graceful shutdown failed: %s\nproceeding with hard shutdown\n", err,
-			)
+	ec := make(chan error, 1)


By convention, we name these errChan.

asubiotto · 2017-04-10T15:13:32Z

pkg/cli/start.go

+	}()
+	select {
+	case err := <-ec:
+		if err != nil {


It'd be nice to avoid else statements as follows:

if err != nil { if _, ok := err.(errTryHardShutdown); ok { fmt.Printf("graceful shutdown failed: %s; proceeding with hard shutdown\n", err) break } return err } return nil

asubiotto · 2017-04-10T15:13:36Z

pkg/cli/start.go

+	case err := <-ec:
+		if err != nil {
+			if _, ok := err.(errTryHardShutdown); ok {
+				fmt.Fprintf(


fmt.Printf can be used here instead of Fprintf and I would rather use a semicolon than a newline between "graceful shutdown..." and "proceeding..." (see above)

asubiotto · 2017-04-10T15:17:00Z

pkg/cli/start.go

 		}
-	} else {
-		return nil
+	case <-time.After(time.Second * 60):


Use time.Minute.

asubiotto · 2017-04-10T15:20:01Z

pkg/cli/start.go

-	} else {
-		return nil
+	case <-time.After(time.Second * 60):
+		fmt.Fprintf(


Use Println and put the output on the same line as the call. The output should also be timed out; proceeding with hard shutdown since we aren't draining gracefully.

asubiotto · 2017-04-10T16:31:13Z

One thing I forgot to mention: initiating graceful shutdown of server etc... is printed because the stopper.ShouldStop() case is being hit here and we keep on going to print the messages on the server. This case simply waits for a signal that the stopper is stopped externally and I therefore think that we should simply wait on <-sopper.IsStopped() if we hit that case and then return.

xphoniex · 2017-04-11T15:13:11Z

@asubiotto it should be okay now.

xphoniex · 2017-04-11T15:16:55Z

Also, there's no guideline for writing the doc. Is this something I should do ? :)

tamird · 2017-04-11T15:46:41Z

@xphoniex there's something not right with these commits.

xphoniex · 2017-04-11T15:48:33Z

@tamird can you please check again :)

asubiotto · 2017-04-11T16:52:25Z

The changes look good. The TeamCity build is failing because of formatting issues in pkg/cli/start.go. You can run gofmt -w pkg/cli/start.go to fix this.

Note that "Fixes #14620" only works if it's on its own line. While you're modifying the commit message it would be good to fix the spelling of quorum and prepend the first line with the package name to know which package is affected by the commit as follows:

cli: fix indefinite retrying for `cockroach quit` when quorum is lost

Fixes #14620

add a timeout of 1 minute inside runQuit, after which a hard shutdown is initiated

Don't worry about the docs-todo label.

asubiotto

LGTM

asubiotto · 2017-04-11T18:34:37Z

Thanks for this PR @xphoniex. Could you add the wait on <-stopper.IsStopped() in this case?

andreimatei · 2017-04-11T19:13:10Z

@xphoniex You'll have to resolve conflicts with #14775. Sorry about that.

I think you shouldn't mark #14620 as resolved; we should keep it open and make a node be able to drain satisfactorily even when there's no quorum. Perhaps you can update the title of the issue to that when this PR merges.
I think in the future we'll probably want cockroach quit to be less liberal with brutal killing the process in the face of drain failures or timeouts. Probably depending on a flag, people should be able to depend on it to do proper draining, as draining becomes a tool for different kinds of machine or datacenter migrations.

Review status: 0 of 1 files reviewed at latest revision, 5 unresolved discussions, all commit checks successful.

Comments from Reviewable

bdarnell · 2017-04-11T22:33:45Z

I think in the future we'll probably want cockroach quit to be less liberal with brutal killing the process in the face of drain failures or timeouts. Probably depending on a flag, people should be able to depend on it to do proper draining, as draining becomes a tool for different kinds of machine or datacenter migrations.

I disagree. A node that fails to die when you ask it to is more disruptive to a rollout process than one that fails to release all its leases (remember that graceful shutdown can never be 100% guaranteed because processes can always die for unrelated reasons). We should take care that this only happens in exceptional situations, but when those situations arise it's better to die abruptly than wait.

andreimatei · 2017-04-11T23:20:40Z

Regardless of what we want cockroach quit to do, I think that one of the draining phases hanging indefinitely because the node was unable to write an updated liveness record is funky. Do we agree on this?

About the cli quit command, what I had in mind was the kind of draining that moves ranges away and waits for the up-replication to be done.

bdarnell · 2017-04-12T00:15:08Z

Regardless of what we want cockroach quit to do, I think that one of the draining phases hanging indefinitely because the node was unable to write an updated liveness record is funky. Do we agree on this?

Yes. But if it's happening because this is the last node being shut down and the cluster has lost quorum, there's nothing we can do but put a timeout on that liveness write. That's what I mean by it being better to die abruptly than wait indefinitely.

About the cli quit command, what I had in mind was the kind of draining that moves ranges away and waits for the up-replication to be done.

We should have some way to do that, but not the default cockroach quit. The default assumption should be that the node is being taken down for something like a software update or reboot and will be coming back up with its data. In this case we want to relinquish all leases but keep the ranges so we can reuse them when we come back up.

xphoniex · 2017-04-12T15:26:29Z

@andreimatei the conflict is when checkNodeRunning is returning error, right? want to be sure I'm not missing anything else

andreimatei · 2017-04-12T19:36:45Z

I just meant that there's going to be a merge conflict; I wasn't trying to suggest something in particular to pay attention to.

I'm not sure what your second question is referring to :)

Review status: 0 of 1 files reviewed at latest revision, 5 unresolved discussions, all commit checks successful.

Comments from Reviewable

xphoniex · 2017-04-12T20:18:41Z

@andreimatei I just rebased my fork and it didn't give me a merge conflict :/

andreimatei · 2017-04-13T23:22:39Z

Review status: 0 of 1 files reviewed at latest revision, 7 unresolved discussions, all commit checks successful.

pkg/cli/start.go, line 516 at r3 (raw file):

		return err
	case <-stopper.ShouldStop():
		<-stopper.IsStopped()

I'm confused about this new code. So before we were initiating draining on stopped.ShouldStop. Now, on ShouldStop, we seemingly block until IsStopped without doing the draining, and then we return directly, but we print a message about draining being done. What's the intention / what's going on?

pkg/cli/start.go, line 519 at r3 (raw file):

		const msgDone = "server drained and shutdown completed"
		log.Infof(shutdownCtx, msgDone)
		fmt.Fprintln(os.Stdout, msgDone)

@knz how come no acceptance tests failed because of this unexpected new message? Is it surprising?

Comments from Reviewable

knz · 2017-04-14T08:18:53Z

Reviewed 1 of 1 files at r3.
Review status: all files reviewed at latest revision, 7 unresolved discussions, all commit checks successful.

pkg/cli/start.go, line 516 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I'm confused about this new code. So before we were initiating draining on stopped.ShouldStop. Now, on ShouldStop, we seemingly block until IsStopped without doing the draining, and then we return directly, but we print a message about draining being done. What's the intention / what's going on?

Andrei I think this just means that once IsStopped()'s channel unblocks, that means the server has finished draining.
But yes I also find it difficult to understand locally, it probably deserves a comment.

pkg/cli/start.go, line 519 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

@knz how come no acceptance tests failed because of this unexpected new message? Is it surprising?

Why would they fail? The only tests that check this are in cli/interactive_tests and they only require specific substrings to be present and in a specific order. The substrings ("initiating graceful shutdown", "shutdown completed") still occur in the right order, so nothing changes from these tests' perspective.

Comments from Reviewable

asubiotto · 2017-04-14T14:42:19Z

Review status: all files reviewed at latest revision, 7 unresolved discussions, all commit checks successful.

pkg/cli/start.go, line 516 at r3 (raw file):

Previously, knz (kena) wrote…

Andrei I think this just means that once IsStopped()'s channel unblocks, that means the server has finished draining.
But yes I also find it difficult to understand locally, it probably deserves a comment.

This ShouldStop check is for any external requests to stop (for example the quit endpoint which hits Drain in pkg/server/admin.go). We previously would print initiating graceful shutdown of server and then wait for the stopper to be stopped after which we would print out server drained and shutdown completed even though we potentially got a request to perform a hard shutdown through the quit endpoint. I suggested to add this wait to avoid printing confusing log messages. I would remove outputting msgDone here @xphoniex and not print anything out (I think the responsibility of that should lie with the quit endpoint in this case), but add a comment as to why this wait is happening. Something to be worked on is to share more logic (and log messages) between these two endpoints but for now I think it's better to not output misleading log messages.

Comments from Reviewable

xphoniex · 2017-04-14T17:03:33Z

@andreimatei We were not initiating draining on ShouldStop() before. It was there for the Server to escape the select and log relevant messages which were confusing in this case, read post #1 & #8 here.

By the time we hit the ShouldStop() case, we've already drained the server. ( @asubiotto please correct me if I'm wrong )

Fixes cockroachdb#14620 added a timeout of 1 minute inside runQuit, after which a hard shutdown is initiated

a-robinson requested a review from asubiotto April 7, 2017 16:56

asubiotto reviewed Apr 10, 2017

View reviewed changes

asubiotto added the docs-todo label Apr 10, 2017

xphoniex force-pushed the master branch 2 times, most recently from 3334f96 to 5a35680 Compare April 11, 2017 15:05

xphoniex force-pushed the master branch from 9d76a9c to 5a35680 Compare April 11, 2017 15:53

xphoniex force-pushed the master branch from 5a35680 to ff15c61 Compare April 11, 2017 17:02

asubiotto approved these changes Apr 11, 2017

View reviewed changes

jseldess mentioned this pull request Apr 11, 2017

Fix indefinite retrying for cockroach quit when quorum is lost cockroachdb/docs#1266

Closed

xphoniex force-pushed the master branch from ff15c61 to fa48755 Compare April 12, 2017 12:02

xphoniex changed the title ~~fix indefinite retrying for cockroach quit when quorom is lost #14620~~ fix indefinite retrying for cockroach quit when quorum is lost #14620 Apr 12, 2017

xphoniex force-pushed the master branch from fa48755 to 7296a84 Compare April 12, 2017 20:12

xphoniex force-pushed the master branch 5 times, most recently from ec963be to 93cc469 Compare April 13, 2017 21:50

xphoniex force-pushed the master branch from 93cc469 to bf1ce97 Compare April 14, 2017 16:53

cli: fix indefinite retrying for cockroach quit when quorum is lost

90e25ca

Fixes cockroachdb#14620 added a timeout of 1 minute inside runQuit, after which a hard shutdown is initiated

xphoniex force-pushed the master branch from bf1ce97 to 90e25ca Compare April 17, 2017 18:11

asubiotto merged commit 625f5ef into cockroachdb:master Apr 18, 2017

asubiotto mentioned this pull request Apr 18, 2017

server: draining hangs when quorum is lost #14620

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix indefinite retrying for `cockroach quit` when quorum is lost #14620 #14708

fix indefinite retrying for `cockroach quit` when quorum is lost #14620 #14708

xphoniex commented Apr 7, 2017

cockroach-teamcity commented Apr 7, 2017

a-robinson commented Apr 7, 2017

tamird commented Apr 7, 2017

xphoniex commented Apr 7, 2017

tamird commented Apr 7, 2017

asubiotto left a comment

asubiotto Apr 10, 2017

asubiotto Apr 10, 2017

asubiotto Apr 10, 2017

asubiotto Apr 10, 2017

asubiotto Apr 10, 2017

asubiotto commented Apr 10, 2017 •

edited

Loading

xphoniex commented Apr 11, 2017

xphoniex commented Apr 11, 2017

tamird commented Apr 11, 2017

xphoniex commented Apr 11, 2017 •

edited

Loading

asubiotto commented Apr 11, 2017 •

edited

Loading

asubiotto left a comment

asubiotto commented Apr 11, 2017 •

edited

Loading

andreimatei commented Apr 11, 2017

bdarnell commented Apr 11, 2017

andreimatei commented Apr 11, 2017

bdarnell commented Apr 12, 2017

xphoniex commented Apr 12, 2017 •

edited

Loading

andreimatei commented Apr 12, 2017

xphoniex commented Apr 12, 2017

andreimatei commented Apr 13, 2017

knz commented Apr 14, 2017

asubiotto commented Apr 14, 2017

xphoniex commented Apr 14, 2017

fix indefinite retrying for cockroach quit when quorum is lost #14620 #14708

fix indefinite retrying for cockroach quit when quorum is lost #14620 #14708

Conversation

xphoniex commented Apr 7, 2017

cockroach-teamcity commented Apr 7, 2017

a-robinson commented Apr 7, 2017

tamird commented Apr 7, 2017

xphoniex commented Apr 7, 2017

tamird commented Apr 7, 2017

asubiotto left a comment

Choose a reason for hiding this comment

asubiotto Apr 10, 2017

Choose a reason for hiding this comment

asubiotto Apr 10, 2017

Choose a reason for hiding this comment

asubiotto Apr 10, 2017

Choose a reason for hiding this comment

asubiotto Apr 10, 2017

Choose a reason for hiding this comment

asubiotto Apr 10, 2017

Choose a reason for hiding this comment

asubiotto commented Apr 10, 2017 • edited Loading

xphoniex commented Apr 11, 2017

xphoniex commented Apr 11, 2017

tamird commented Apr 11, 2017

xphoniex commented Apr 11, 2017 • edited Loading

asubiotto commented Apr 11, 2017 • edited Loading

asubiotto left a comment

Choose a reason for hiding this comment

asubiotto commented Apr 11, 2017 • edited Loading

andreimatei commented Apr 11, 2017

bdarnell commented Apr 11, 2017

andreimatei commented Apr 11, 2017

bdarnell commented Apr 12, 2017

xphoniex commented Apr 12, 2017 • edited Loading

andreimatei commented Apr 12, 2017

xphoniex commented Apr 12, 2017

andreimatei commented Apr 13, 2017

knz commented Apr 14, 2017

asubiotto commented Apr 14, 2017

xphoniex commented Apr 14, 2017

fix indefinite retrying for `cockroach quit` when quorum is lost #14620 #14708

fix indefinite retrying for `cockroach quit` when quorum is lost #14620 #14708

asubiotto commented Apr 10, 2017 •

edited

Loading

xphoniex commented Apr 11, 2017 •

edited

Loading

asubiotto commented Apr 11, 2017 •

edited

Loading

asubiotto commented Apr 11, 2017 •

edited

Loading

xphoniex commented Apr 12, 2017 •

edited

Loading