Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stability: Stuck ranges with lots of raft elections #26257

Closed
bdarnell opened this issue May 30, 2018 · 2 comments
Closed

stability: Stuck ranges with lots of raft elections #26257

bdarnell opened this issue May 30, 2018 · 2 comments
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-robot Originated from a bot. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting

Comments

@bdarnell
Copy link
Contributor

cyan is currently in a pretty bad state: It's not serving any SQL qps (it looks like the problem first started on May 23 with a build from the master branch, and has been continuing much of the time since then). Restarting all the nodes gets it working for about an hour, then it stops again. It doesn't appear to recover on its own.

While it's in the broken state, all raft messages except vote, prevote, their responses, and TimeoutNow drop to near zero, while vote-related traffic is much higher than normal. The number of quiesced ranges increases to about 95% of the total ranges (when things are working, it's more like 60%). The problem ranges page shows about a dozen ranges with either no raft leader or leader-not-leaseholder.

On these ranges, it appears that something is going wrong with quiescence. We see this cycle repeat itself:

  • Node 1 is the leader and leaseholder. Nothing appears to be happening on the range; no writes are getting proposed.
  • Nodes 2 and 3 log "not quiescing; not leader" periodically, indicating that they are unquiesced.
  • Either node 2 or node 3 times out and starts an election, indicating that they are not receiving heartbeats.
  • The election succeeds; node 2 (or 3) becomes leader. Then that node sees it is not the leaseholder and transfers leadership back to node 1
  • Node 1's election succeeds and the cycle repeats.

There is no evidence of network or raft transport problems (this was my first suspicion, blaming the recent grpc upgrade). Each node is able to win an election and is able to communicate bidirectionally with each of its peers.

Node 1 never logs any quiescing or unquiescing messages for this replica. An unquiesced replica will periodically log (at a high V level), so this suggests that node 1's replica is quiesced throughout this process, but somehow failed to send a quiesce heartbeat to the followers (or it failed to unquiesce when it should have, and is somehow sleepwalking through these elections).

#24956 (merged may 21) is the most likely culprit I can see. I don't have any theories as to why this is only affecting cyan.

bdarnell added a commit to bdarnell/cockroach that referenced this issue May 31, 2018
I suspect that cockroachdb#26257 is caused by the unquiescedReplicas map
introduced in cockroachdb#24956 getting out of sync with the per-replica
quiescent flag. Add debug pages to help us see if that's happening.

Release note: None
bdarnell added a commit to bdarnell/cockroach that referenced this issue May 31, 2018
I suspect that cockroachdb#26257 is caused by the unquiescedReplicas map
introduced in cockroachdb#24956 getting out of sync with the per-replica
quiescent flag. Add debug pages to help us see if that's happening.

Release note: None
craig bot pushed a commit that referenced this issue May 31, 2018
26269: server,ui: Add debugging for quiesced ranges r=bdarnell a=bdarnell

I suspect that #26257 is caused by the unquiescedReplicas map
introduced in #24956 getting out of sync with the per-replica
quiescent flag. Add debug pages to help us see if that's happening.

Release note: None

Co-authored-by: Ben Darnell <[email protected]>
@bdarnell
Copy link
Contributor Author

Omega has been hit by what appears to be the same issue (a day and a half after being upgraded to the alpha candidate).

I've got a hypothesis that the unquiesced map introduced in #24956 is getting out of sync with the per-replica quiescent flag. I've added some debugging for this on master (#26269) which I'll test on cyan. If this is the case, I'll probably revert #24956 for the alpha.

@bdarnell bdarnell added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-robot Originated from a bot. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting A-kv-replication Relating to Raft, consensus, and coordination. labels May 31, 2018
@bdarnell bdarnell self-assigned this May 31, 2018
@bdarnell
Copy link
Contributor Author

Confirmed on cyan that we have leaders with replicas where the quiescent flag is true but they are not present in the unquiesced map. I'm going to revert #24956 and also cherry-pick the revert onto the alpha candidate.

bdarnell added a commit to bdarnell/cockroach that referenced this issue May 31, 2018
This reverts the main effect of cockroachdb#24956. The supporting infrastructure
is left in place to minimize merge conflicts and to aid in diagnosing
why the maps get out of sync.

Fixes cockroachdb#26257

Release note: None
craig bot pushed a commit that referenced this issue May 31, 2018
26291: storage: Tick all replicas, not just unquiesced ones r=bdarnell a=bdarnell

This reverts the main effect of #24956. The supporting infrastructure
is left in place to minimize merge conflicts and to aid in diagnosing
why the maps get out of sync.

Fixes #26257

Release note: None

Co-authored-by: Ben Darnell <[email protected]>
@craig craig bot closed this as completed in #26291 May 31, 2018
bdarnell added a commit to bdarnell/cockroach that referenced this issue May 31, 2018
There are two paths by which a replica can become unquiesced, and only
one of them updated the unquiesced map.

Updates cockroachdb#26257

This is probably the underlying issue there, although I'm still
looking for potential synchronization problems coming from the use of
two different locks.

Release note: None
craig bot pushed a commit that referenced this issue May 31, 2018
26265: sql: stop using a txn for getting the index backfill timestamp r=andreimatei a=andreimatei

Index backfills use AOST so that their reads are less intrusive.
Before this patch, the timestamp for them reads was chosen in a bizarre
way (unless I'm missing something) - we were creating a txn for the soul
purpose of getting its timestamp. That txn was used to save the job
details with the timestamp in them, but I don't think using a txn for
that is necessary.
This patch gets a timestamp directly from the node's clock.

Besides seeming bizarre, the old code had a bug - it was incorrectly
using the job.WithTxn() API. The code was:

       if err := sc.db.Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
               details := *sc.job.WithTxn(txn).Payload().Details.(*jobs.Payload_SchemaChange).SchemaChange
               if details.ReadAsOf == (hlc.Timestamp{}) {
                       details.ReadAsOf = txn.CommitTimestamp()
                       if err := sc.job.WithTxn(txn).SetDetails(ctx, details); err != nil {
                               return errors.Wrapf(err, "failed to store readAsOf on job %d", *sc.job.ID())
                       }
               }
               sc.readAsOf = details.ReadAsOf
               return nil
       }); err != nil {
               return err
       }

The first WithTxn() call is just bad - it mutates the job and persists
the txn in it, for no reason. If the second WithTxn() call runs, then
the effects of the first one will be negated - the SetDetails() call
will indirectly wipe the txn. However, if the 2nd call didn't run (which
happens, for example, when the txn gets restarted), then the job ends up
holding on to a committed transaction, which will cause it to fail when
it tries to do some further operations.

Fixes #25680

Release note(bug fix): Fixed a bug causing index creation to fail under
rare circustances.

26297: storage: Fix a discrepancy between quiescent flag and unquiesced map r=petermattis a=bdarnell

There are two paths by which a replica can become unquiesced, and only
one of them updated the unquiesced map.

Updates #26257

This is probably the underlying issue there, although I'm still
looking for potential synchronization problems coming from the use of
two different locks.

Release note: None

Co-authored-by: Andrei Matei <[email protected]>
Co-authored-by: Ben Darnell <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-robot Originated from a bot. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Projects
None yet
Development

No branches or pull requests

1 participant