-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix indefinite retrying for cockroach quit
when quorum is lost #14620
#14708
Conversation
Thanks for the contribution, @xphoniex! Assigning @asubiotto for review since he's been looking into this as well, I believe. |
This doesn't look right to me - we want the node to stop, not for the client to give up. |
@tamird , it doesn't give up, it proceeds to initiate a hard shutdown once the one minute window has passed, I checked on my machine and all processes had been shutdown. |
Ah, my mistake. Looks like you have two commits here where there should be one? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the contribution! You can rebase the two commits into one using git rebase -i HEAD~2
as outlined here.
Additionally, you can automatically close the issue this refers to by putting Fixes #14620
on its own line in the commit message (run git commit --amend
).
It would also be nice to provide a little more information of what is being done here in the commit message (e.g. "client initiates a hard shutdown on the server after a minute...").
pkg/cli/start.go
Outdated
fmt.Fprintf( | ||
os.Stdout, "graceful shutdown failed: %s\nproceeding with hard shutdown\n", err, | ||
) | ||
ec := make(chan error, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By convention, we name these errChan
.
pkg/cli/start.go
Outdated
}() | ||
select { | ||
case err := <-ec: | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice to avoid else statements as follows:
if err != nil {
if _, ok := err.(errTryHardShutdown); ok {
fmt.Printf("graceful shutdown failed: %s; proceeding with hard shutdown\n", err)
break
}
return err
}
return nil
pkg/cli/start.go
Outdated
case err := <-ec: | ||
if err != nil { | ||
if _, ok := err.(errTryHardShutdown); ok { | ||
fmt.Fprintf( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fmt.Printf
can be used here instead of Fprintf
and I would rather use a semicolon than a newline between "graceful shutdown..." and "proceeding..." (see above)
pkg/cli/start.go
Outdated
} | ||
} else { | ||
return nil | ||
case <-time.After(time.Second * 60): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use time.Minute
.
pkg/cli/start.go
Outdated
} else { | ||
return nil | ||
case <-time.After(time.Second * 60): | ||
fmt.Fprintf( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use Println
and put the output on the same line as the call. The output should also be timed out; proceeding with hard shutdown
since we aren't draining gracefully.
One thing I forgot to mention: |
3334f96
to
5a35680
Compare
@asubiotto it should be okay now. |
Also, there's no guideline for writing the doc. Is this something I should do ? :) |
@xphoniex there's something not right with these commits. |
@tamird can you please check again :) |
The changes look good. The TeamCity build is failing because of formatting issues in Note that "Fixes #14620" only works if it's on its own line. While you're modifying the commit message it would be good to fix the spelling of quorum and prepend the first line with the package name to know which package is affected by the commit as follows:
Don't worry about the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@xphoniex You'll have to resolve conflicts with #14775. Sorry about that. I think you shouldn't mark #14620 as resolved; we should keep it open and make a node be able to drain satisfactorily even when there's no quorum. Perhaps you can update the title of the issue to that when this PR merges. Review status: 0 of 1 files reviewed at latest revision, 5 unresolved discussions, all commit checks successful. Comments from Reviewable |
I disagree. A node that fails to die when you ask it to is more disruptive to a rollout process than one that fails to release all its leases (remember that graceful shutdown can never be 100% guaranteed because processes can always die for unrelated reasons). We should take care that this only happens in exceptional situations, but when those situations arise it's better to die abruptly than wait. |
Regardless of what we want About the cli quit command, what I had in mind was the kind of draining that moves ranges away and waits for the up-replication to be done. |
Yes. But if it's happening because this is the last node being shut down and the cluster has lost quorum, there's nothing we can do but put a timeout on that liveness write. That's what I mean by it being better to die abruptly than wait indefinitely.
We should have some way to do that, but not the default |
cockroach quit
when quorom is lost #14620cockroach quit
when quorum is lost #14620
@andreimatei the conflict is when checkNodeRunning is returning error, right? want to be sure I'm not missing anything else |
I just meant that there's going to be a merge conflict; I wasn't trying to suggest something in particular to pay attention to. I'm not sure what your second question is referring to :) Review status: 0 of 1 files reviewed at latest revision, 5 unresolved discussions, all commit checks successful. Comments from Reviewable |
@andreimatei I just rebased my fork and it didn't give me a merge conflict :/ |
ec963be
to
93cc469
Compare
Review status: 0 of 1 files reviewed at latest revision, 7 unresolved discussions, all commit checks successful. pkg/cli/start.go, line 516 at r3 (raw file):
I'm confused about this new code. So before we were initiating draining on pkg/cli/start.go, line 519 at r3 (raw file):
@knz how come no acceptance tests failed because of this unexpected new message? Is it surprising? Comments from Reviewable |
Reviewed 1 of 1 files at r3. pkg/cli/start.go, line 516 at r3 (raw file): Previously, andreimatei (Andrei Matei) wrote…
Andrei I think this just means that once IsStopped()'s channel unblocks, that means the server has finished draining. pkg/cli/start.go, line 519 at r3 (raw file): Previously, andreimatei (Andrei Matei) wrote…
Why would they fail? The only tests that check this are in Comments from Reviewable |
Review status: all files reviewed at latest revision, 7 unresolved discussions, all commit checks successful. pkg/cli/start.go, line 516 at r3 (raw file): Previously, knz (kena) wrote…
This Comments from Reviewable |
@andreimatei We were not initiating draining on By the time we hit the |
Fixes cockroachdb#14620 added a timeout of 1 minute inside runQuit, after which a hard shutdown is initiated
error log is still misleading, should we call it a graceful shutdown when a hard shutdown has been initiated?