-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: very slow restart of local cluster #26391
Comments
Trace from a 2.5 minute |
Actually, it looks like it's different. I've rebased on top of HEAD (5d2feac) and am still seeing the same behavior. The trace suggests that what's happening is that the SQL plan for the query is hitting each range serially with a
|
I'm still seeing the problem even after reverting your 4 recent unquiesced replica changes. Do you have any thoughts on what to look into? The initial election for each raft group appears to hit a stalemate where each replica votes for itself: And then I guess it takes until This would mostly be a non-issue if batches were sent to more/all ranges in parallel (which I'd at least expect for |
Wait, nevermind. After reverting all of #24920 and restarting a couple times, the queries now succeed quickly after the restart. I'm not sure why my first couple attempts after #24920 were still taking over a minute, but it may be because I wasn't waiting as long as I had been before between stopping and restarting the cluster. Are you up for taking over? The repro steps are: create a 3 node local cluster, create a simple table with 100 ranges in it as shown in the initial comment, kill the cluster, wait until the last node has finished "gracefully" shutting down, then restart it and run |
And yet they're all in the follower state, not candidate. I think what's happening is that the range is being awakened for an incoming MsgVote, but before it processes that message, it initiates its own campaign (committing its vote for this term and sending out its own MsgVotes), then it has to reject the incoming vote request because it's already voted for itself. |
I've written this up as a roachtest in this commit. It only fails about half the time for me (in local mode on my laptop), making it annoying to figure out whether my change has made a difference. |
I think I'm seeing the same thing restarting a (remote) three node cluster with 100k ranges: The kv load I'm running hovers at 200qps and is slowly climbing, but it used to do over 1000qps before I restarted the cluster. It eventually works itself out, with "other" messages dropping to basically zero: |
When the incoming message is a MsgVote (which is likely the case for the first message received by a quiesced follower), immediate campaigning will cause the election to fail. This is similar to reverting commit 44e3977, but only disables campaigning in one location. Fixes cockroachdb#26391 Release note: None
When the incoming message is a MsgVote (which is likely the case for the first message received by a quiesced follower), immediate campaigning will cause the election to fail. This is similar to reverting commit 44e3977, but only disables campaigning in one location. Fixes cockroachdb#26391 Release note: None
26441: distsql: add NewFinalIterator to the rowIterator interface r=asubiotto a=asubiotto Some implementations of the rowIterator interface would destroy rows as they were iterated over to free memory eagerly. NewFinalIterator is introduced in this change to provide non-reusable behavior and NewIterator is explicitly described as reusable. A reusable iterator has been added to the memRowContainer to satisfy these new interface semantics. Release note: None 26463: storage: Disable campaign-on-wake when receiving raft messages r=bdarnell a=bdarnell When the incoming message is a MsgVote (which is likely the case for the first message received by a quiesced follower), immediate campaigning will cause the election to fail. This is similar to reverting commit 44e3977, but only disables campaigning in one location. Fixes #26391 Release note: None 26469: lint: Fix a file descriptor leak r=bdarnell a=bdarnell This leak is enough to cause `make lintshort` fail when run under the default file descriptor limit on macos (256). Release note: None 26470: build: Pin go.uuid to the version currently in use r=bdarnell a=bdarnell Updates #26332 Release note: None Co-authored-by: Alfonso Subiotto Marqués <[email protected]> Co-authored-by: Ben Darnell <[email protected]>
Tried this again for #26448 -- sometimes 1s, but a run just now took 2m16s... I don't think we can call this closed. |
Hmm, very confusing. I wrote a bash script to repro this, but the last couple of runs were all pretty fast. (note: edited the script to insert a sleep before creating the table, to allow for upreplication, didn't seem to change anything).
|
What I was seeing here is likely along the lines of #30613 and we're fixing all of these bugs right now. |
I created a 3-node local cluster using roachdemo on top of 98b1ceb with minor unrelated modifications to the SQL layer. I created a simple table with 100 ranges but no data in it (
create table foo (k int primary key, v int); alter table foo [Csplit at select generate_series(0, 10000, 100);
).I stopped the cluster for about a minute then restarted it, and after restarting it any query that tried to touch the table took multiple minutes to complete (e.g.
SHOW TESTING_RANGES FOR TABLE foo;
orSELECT * FROM foo;
). After the first one succeeds, additional such queries complete in well under a second.The cluster is hardly using any CPU, so it's not spinning on a bunch of real work.
The most suspicious bit I've noticed so far is a lot of raft election traffic until the moment when things start working:
Also showing the slow startup is the very slow ramp-up of the replicas and leaseholders metrics:
Maybe relevant is an "Invalid Lease" for range 9, the eventlog range, although that seems unlikely, particularly since it remains "Invalid" even after the cluster has started responding to queries quickly.
The text was updated successfully, but these errors were encountered: