Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix indefinite retrying for cockroach quit when quorum is lost #14620 #14708

Merged
merged 1 commit into from
Apr 18, 2017

Conversation

xphoniex
Copy link
Contributor

@xphoniex xphoniex commented Apr 7, 2017

error log is still misleading, should we call it a graceful shutdown when a hard shutdown has been initiated?

timed out, trying again
ok
initiating graceful shutdown of server
server drained and shutdown completed

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@a-robinson
Copy link
Contributor

Thanks for the contribution, @xphoniex! Assigning @asubiotto for review since he's been looking into this as well, I believe.

@a-robinson a-robinson requested a review from asubiotto April 7, 2017 16:56
@tamird
Copy link
Contributor

tamird commented Apr 7, 2017

This doesn't look right to me - we want the node to stop, not for the client to give up.

@xphoniex
Copy link
Contributor Author

xphoniex commented Apr 7, 2017

@tamird , it doesn't give up, it proceeds to initiate a hard shutdown once the one minute window has passed, I checked on my machine and all processes had been shutdown.

@tamird
Copy link
Contributor

tamird commented Apr 7, 2017

Ah, my mistake. Looks like you have two commits here where there should be one?

Copy link
Contributor

@asubiotto asubiotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the contribution! You can rebase the two commits into one using git rebase -i HEAD~2 as outlined here.
Additionally, you can automatically close the issue this refers to by putting Fixes #14620 on its own line in the commit message (run git commit --amend).
It would also be nice to provide a little more information of what is being done here in the commit message (e.g. "client initiates a hard shutdown on the server after a minute...").

pkg/cli/start.go Outdated
fmt.Fprintf(
os.Stdout, "graceful shutdown failed: %s\nproceeding with hard shutdown\n", err,
)
ec := make(chan error, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By convention, we name these errChan.

pkg/cli/start.go Outdated
}()
select {
case err := <-ec:
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to avoid else statements as follows:

if err != nil {
    if _, ok := err.(errTryHardShutdown); ok {
        fmt.Printf("graceful shutdown failed: %s; proceeding with hard shutdown\n", err)
        break
    }
    return err
}
return nil

pkg/cli/start.go Outdated
case err := <-ec:
if err != nil {
if _, ok := err.(errTryHardShutdown); ok {
fmt.Fprintf(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fmt.Printf can be used here instead of Fprintf and I would rather use a semicolon than a newline between "graceful shutdown..." and "proceeding..." (see above)

pkg/cli/start.go Outdated
}
} else {
return nil
case <-time.After(time.Second * 60):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use time.Minute.

pkg/cli/start.go Outdated
} else {
return nil
case <-time.After(time.Second * 60):
fmt.Fprintf(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Println and put the output on the same line as the call. The output should also be timed out; proceeding with hard shutdown since we aren't draining gracefully.

@asubiotto
Copy link
Contributor

asubiotto commented Apr 10, 2017

One thing I forgot to mention: initiating graceful shutdown of server etc... is printed because the stopper.ShouldStop() case is being hit here and we keep on going to print the messages on the server. This case simply waits for a signal that the stopper is stopped externally and I therefore think that we should simply wait on <-sopper.IsStopped() if we hit that case and then return.

@xphoniex xphoniex force-pushed the master branch 2 times, most recently from 3334f96 to 5a35680 Compare April 11, 2017 15:05
@xphoniex
Copy link
Contributor Author

@asubiotto it should be okay now.

@xphoniex
Copy link
Contributor Author

Also, there's no guideline for writing the doc. Is this something I should do ? :)

@tamird
Copy link
Contributor

tamird commented Apr 11, 2017

@xphoniex there's something not right with these commits.

@xphoniex
Copy link
Contributor Author

xphoniex commented Apr 11, 2017

@tamird can you please check again :)

@asubiotto
Copy link
Contributor

asubiotto commented Apr 11, 2017

The changes look good. The TeamCity build is failing because of formatting issues in pkg/cli/start.go. You can run gofmt -w pkg/cli/start.go to fix this.

Note that "Fixes #14620" only works if it's on its own line. While you're modifying the commit message it would be good to fix the spelling of quorum and prepend the first line with the package name to know which package is affected by the commit as follows:

cli: fix indefinite retrying for `cockroach quit` when quorum is lost

Fixes #14620

add a timeout of 1 minute inside runQuit, after which a hard shutdown is initiated

Don't worry about the docs-todo label.

Copy link
Contributor

@asubiotto asubiotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@asubiotto
Copy link
Contributor

asubiotto commented Apr 11, 2017

Thanks for this PR @xphoniex. Could you add the wait on <-stopper.IsStopped() in this case?

@andreimatei
Copy link
Contributor

@xphoniex You'll have to resolve conflicts with #14775. Sorry about that.

I think you shouldn't mark #14620 as resolved; we should keep it open and make a node be able to drain satisfactorily even when there's no quorum. Perhaps you can update the title of the issue to that when this PR merges.
I think in the future we'll probably want cockroach quit to be less liberal with brutal killing the process in the face of drain failures or timeouts. Probably depending on a flag, people should be able to depend on it to do proper draining, as draining becomes a tool for different kinds of machine or datacenter migrations.


Review status: 0 of 1 files reviewed at latest revision, 5 unresolved discussions, all commit checks successful.


Comments from Reviewable

@bdarnell
Copy link
Contributor

I think in the future we'll probably want cockroach quit to be less liberal with brutal killing the process in the face of drain failures or timeouts. Probably depending on a flag, people should be able to depend on it to do proper draining, as draining becomes a tool for different kinds of machine or datacenter migrations.

I disagree. A node that fails to die when you ask it to is more disruptive to a rollout process than one that fails to release all its leases (remember that graceful shutdown can never be 100% guaranteed because processes can always die for unrelated reasons). We should take care that this only happens in exceptional situations, but when those situations arise it's better to die abruptly than wait.

@andreimatei
Copy link
Contributor

Regardless of what we want cockroach quit to do, I think that one of the draining phases hanging indefinitely because the node was unable to write an updated liveness record is funky. Do we agree on this?

About the cli quit command, what I had in mind was the kind of draining that moves ranges away and waits for the up-replication to be done.

@bdarnell
Copy link
Contributor

Regardless of what we want cockroach quit to do, I think that one of the draining phases hanging indefinitely because the node was unable to write an updated liveness record is funky. Do we agree on this?

Yes. But if it's happening because this is the last node being shut down and the cluster has lost quorum, there's nothing we can do but put a timeout on that liveness write. That's what I mean by it being better to die abruptly than wait indefinitely.

About the cli quit command, what I had in mind was the kind of draining that moves ranges away and waits for the up-replication to be done.

We should have some way to do that, but not the default cockroach quit. The default assumption should be that the node is being taken down for something like a software update or reboot and will be coming back up with its data. In this case we want to relinquish all leases but keep the ranges so we can reuse them when we come back up.

@xphoniex xphoniex changed the title fix indefinite retrying for cockroach quit when quorom is lost #14620 fix indefinite retrying for cockroach quit when quorum is lost #14620 Apr 12, 2017
@xphoniex
Copy link
Contributor Author

xphoniex commented Apr 12, 2017

@andreimatei the conflict is when checkNodeRunning is returning error, right? want to be sure I'm not missing anything else

@andreimatei
Copy link
Contributor

I just meant that there's going to be a merge conflict; I wasn't trying to suggest something in particular to pay attention to.

I'm not sure what your second question is referring to :)


Review status: 0 of 1 files reviewed at latest revision, 5 unresolved discussions, all commit checks successful.


Comments from Reviewable

@xphoniex
Copy link
Contributor Author

@andreimatei I just rebased my fork and it didn't give me a merge conflict :/

@xphoniex xphoniex force-pushed the master branch 5 times, most recently from ec963be to 93cc469 Compare April 13, 2017 21:50
@andreimatei
Copy link
Contributor

Review status: 0 of 1 files reviewed at latest revision, 7 unresolved discussions, all commit checks successful.


pkg/cli/start.go, line 516 at r3 (raw file):

		return err
	case <-stopper.ShouldStop():
		<-stopper.IsStopped()

I'm confused about this new code. So before we were initiating draining on stopped.ShouldStop. Now, on ShouldStop, we seemingly block until IsStopped without doing the draining, and then we return directly, but we print a message about draining being done. What's the intention / what's going on?


pkg/cli/start.go, line 519 at r3 (raw file):

		const msgDone = "server drained and shutdown completed"
		log.Infof(shutdownCtx, msgDone)
		fmt.Fprintln(os.Stdout, msgDone)

@knz how come no acceptance tests failed because of this unexpected new message? Is it surprising?


Comments from Reviewable

@knz
Copy link
Contributor

knz commented Apr 14, 2017

Reviewed 1 of 1 files at r3.
Review status: all files reviewed at latest revision, 7 unresolved discussions, all commit checks successful.


pkg/cli/start.go, line 516 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I'm confused about this new code. So before we were initiating draining on stopped.ShouldStop. Now, on ShouldStop, we seemingly block until IsStopped without doing the draining, and then we return directly, but we print a message about draining being done. What's the intention / what's going on?

Andrei I think this just means that once IsStopped()'s channel unblocks, that means the server has finished draining.
But yes I also find it difficult to understand locally, it probably deserves a comment.


pkg/cli/start.go, line 519 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

@knz how come no acceptance tests failed because of this unexpected new message? Is it surprising?

Why would they fail? The only tests that check this are in cli/interactive_tests and they only require specific substrings to be present and in a specific order. The substrings ("initiating graceful shutdown", "shutdown completed") still occur in the right order, so nothing changes from these tests' perspective.


Comments from Reviewable

@asubiotto
Copy link
Contributor

Review status: all files reviewed at latest revision, 7 unresolved discussions, all commit checks successful.


pkg/cli/start.go, line 516 at r3 (raw file):

Previously, knz (kena) wrote…

Andrei I think this just means that once IsStopped()'s channel unblocks, that means the server has finished draining.
But yes I also find it difficult to understand locally, it probably deserves a comment.

This ShouldStop check is for any external requests to stop (for example the quit endpoint which hits Drain in pkg/server/admin.go). We previously would print initiating graceful shutdown of server and then wait for the stopper to be stopped after which we would print out server drained and shutdown completed even though we potentially got a request to perform a hard shutdown through the quit endpoint. I suggested to add this wait to avoid printing confusing log messages. I would remove outputting msgDone here @xphoniex and not print anything out (I think the responsibility of that should lie with the quit endpoint in this case), but add a comment as to why this wait is happening. Something to be worked on is to share more logic (and log messages) between these two endpoints but for now I think it's better to not output misleading log messages.


Comments from Reviewable

@xphoniex
Copy link
Contributor Author

@andreimatei We were not initiating draining on ShouldStop() before. It was there for the Server to escape the select and log relevant messages which were confusing in this case, read post #1 & #8 here.

By the time we hit the ShouldStop() case, we've already drained the server. ( @asubiotto please correct me if I'm wrong )

Fixes cockroachdb#14620

added a timeout of 1 minute inside runQuit, after which a hard shutdown is initiated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants