-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
acceptance: qps drops to 0 under block_writer load #8192
Comments
Just jumping in to point out that |
While we're testing whether the various processes were alive, we weren't testing whether queries were being processed. Now, we ensure that QPS is non-zero. This would've caught cockroachdb#8192 automatically (I just happened to notice the 0 QPS manually). Also added a load test for photos.
After enabling SQL tracing, I've caught several traces for SQL queries that take longer than 2 minutes (!). Here's one of those traces. This error happens upon every retry:
Somehow we're repeatedly directing the query to a replica that's not the range leaseholder? cc @bdarnell |
this block of errors seems to happen with every retry (with some variation):
|
We preferentially send RPCs to the healthy replicas. I saw that problem once during some of my testing but never investigated further. Perhaps that is fouling things up here. I think you'll need to add some more logging/tracing to test this hypothesis. |
We're trying the lease holder (or there is no lease holder), but it's not responding within the 3 second timeout, so we never officially notice that that node is the lease holder (it wouldn't matter anyway, since 3s would be enough for a lease to expire). We're cycling through all the replicas on each attempt. So the main question is why the lease holder is timing out all the time. (we could probably also reorganize our retry loops here to hold the original request to the lease holder open instead of cancelling it and restarting every cycle). |
This issue seems to correlate to two other issues:
I'll take a look at #1 first unless someone has a better idea. |
The second issue is related to preemptive snapshots (replica ID == 0). |
It also turns out that at least one node in the test cluster reliably panics or slows to a halt (~0 QPS) , when I |
Really? My stress tool performs that exact action ( |
The first 3 times I performed the That said, the QPS blip is more than a blip. On Tue, Aug 9, 2016 at 11:16 AM Peter Mattis [email protected]
|
The blips I see are for 2-10 seconds. What you're describing is significantly more severe. |
PS Mentioning this because it would be an interesting clue if the problem occurs on real networks but not when the nodes are running on the loopback network. |
The continuous load test also has a |
The
So, it seems both sides are timing out. Why though? Still digging... |
also preceding these
This same range is involved in the |
True, though having multiple writers was causing enough issues that I've On Wed, Aug 10, 2016 at 9:07 PM Peter Mattis [email protected]
|
Ah, the difference is running 4 nodes vs running 3. I switched to running a 4 node cluster and it hit the 0 QPS problem in 1m57s starting from a clean slate. |
After just rebasing, when queries take >2 minutes to execute, we get bursts of these messages:
So, it looks like the range doesn't have a leader for a long time? |
At the same time as the above transaction, there are replica changes going on with range 39:
Is some kind of cache or gossip info not being invalidated? |
The trace is showing that we're timing out trying to figure out who the lease holder is when talking to replica |
If a replica is removed and then re-added, clients may have a stale copy of the range descriptor that still points to the replica under its old identity, so that's one way for an uninitialized replica to get traffic. If it's still So one scenario that I think is possible is that replica The |
Speaking of |
Yes, the range cache is a black box in terms of tracing. I've already added On Mon, Aug 15, 2016 at 8:35 PM Peter Mattis [email protected]
|
With some additional logging, it looks like somehow, some lease requests are taking a very long time. Here's how successful lease requests tend to look:
Here's an unhealthy lease renewal request for that same range:
Note that this lease request took 56 seconds to complete. This period overlaps with queries involving this range that took a bit over 60 seconds to finish. During this lease-less time, queries to range 10 get stuck in a vicious retry loop because there's no lease holder to execute the Now, the question is: why did this lease request take 56 seconds?? |
Interestingly, an
followed by a 3 second timeout:
Why and where are we timing out this replica change operation? I've noticed that some successful Raft operations take close to 3 seconds under load, so it seems easy to trigger change replica failures. |
Lease requests are raft commands. So there needs to be a raft leader for the range in question and there needs to be a quorum of nodes available to execute the command. If you turn on raft logging |
Another possibility is that we do have a leader for the range, but are not quickly re-proposing the lease request command to it. See |
In the span of a minute, the term for a problematic range is increasing by 32. There seems to be rapid-fire succession of Raft elections for the same range. So, we're throwing away a lot of unstable log entries during that time and shifting the lease holder a ton. |
Can you post that |
Here are some logs for a test run that had this issue: SQL tracing is on:
In these logs, range 11 is a problematic range that has frequent elections, sometimes multiple within a second. |
Running with a |
Here are logs with a 1 second Raft interval: I don't see election storms any more, but there are still instances of leaderless ranges resulting in dropped Raft proposals, and there are still lease holder changes (albeit much slower). |
It's worth noting that the 0 QPS periods are far rarer with a 1s raft tick interval. However, the still happening but rarer 3-4 minute pauses still happen after 1-2 hours of load. So, it's a possibility that queued up heartbeats are behind the election storms I was seeing with the default raft tick interval. Definitely open to suggestions on what to look at next. Was thinking of going back to the default tick interval and digging further. |
Let's try throttling rebalancing. I'm suspicious about how we rebalance by adding and then immediately removing a replica without waiting for the new replica to be brought up to date. Is there some scenario there were a range can't achieve quorum until the new replica comes up to date?
|
Here is a better hack to try:
This will simulate what will happen if we waited for preemptive snapshots to apply. I've run a test for 5m with no blips using this patch. |
Will try this today. As I mentioned in person to you, I believe that we need end-to-end tracing.
|
Agreed. I'm not suggesting my patch as a fix, just another means of narrowing in on the problem. If excessive rebalancing is causing the issue it could focus the diagnostic efforts. |
I tried a build with #8604 applied, and the election storm & 2-3 minute queries still happen after 4-5 minutes on two separate runs. |
next task is to dig into voting mechanism and understand what's triggering all these elections and why we get so many of these messages, which overlap with the issues:
|
Perhaps related to what you're looking into, but when sending a large snapshot I'm not seeing any activity (msgApp, msgHeartbeat, etc) for a range for over a second which then triggers an election. On the node sending the snapshot the logs contain:
On another node I see:
Hmm, this is curious. I see a 3 second gap in the logs for that first node. Note only is there no activity for range 7, but there is no logging activity period. On other nodes I still see activity during that 3 second gap. I wonder what is happening there. |
Interesting. I could only find one instance of that in the logs for my On Wed, Aug 17, 2016 at 1:06 PM Peter Mattis [email protected]
|
Raft ready logging (enabled via -vmodule=raft=5) is extraordinarily expensive when logging a large snapshot. In particular, if an outgoing message contained a snapshot or raft.Ready.Snapshot was non-empty we would see multiple seconds being spent formatting the log message only for most of it to be thrown away. These delays horked up the raft processing goroutine Added warning logs when raft ready processing and raft ticking take longer than a second. When that happens something bad is usually going on elsehwere. See cockroachdb#8192.
Raft ready logging (enabled via -vmodule=raft=5) is extraordinarily expensive when logging a large snapshot. In particular, if an outgoing message contained a snapshot or raft.Ready.Snapshot was non-empty we would see multiple seconds being spent formatting the log message only for most of it to be thrown away. These delays horked up the raft processing goroutine Added warning logs when raft ready processing and raft ticking take longer than a second. When that happens something bad is usually going on elsehwere. See cockroachdb#8192.
So, it looks like Pete's earlier theory about the quorum being 3 is correct:
This is combined with some kind of blip:
This is enough to prevent forward progress with the election. I lost some logs from supervisord's default log truncation, so I'll continue to dig further to figure out what was going on during this blip. |
#8613 should help with this, since it delays changing the quorum requirement until the preemptive snapshot has been applied. |
There's still an election storm with the latest master (0d11f31), though it does seem quite a bit better now. I'll verify more tomorrow. It looks like
After that, there's a series of 18 contested elections for the same range over 10 seconds (it seems like it can take a while for elections to converge when quorum=3). After the dust settles, a range lease command must be committed before write operations can succeed for that range. All this can take a while and causes |
These anomalies seem to have greatly reduced given the recent burst of Raft improvements. Will continue to monitor and update later. |
The last 5 runs of the |
Using this command:
block_writer
qps drops to 0 and stays there after 1-3 minutes of load.There are some stuck tasks:
Some Raft requests are taking a while:
I see some interesting log entries after about 10 seconds of load:
Then I see a lot of these entries:
then lots of:
and some of these:
Full logs (
node.0
has the most errors):node.0.cockroach.stderr.txt
node.1.cockroach.stderr.txt
node.2.cockroach.stderr.txt
node.3.cockroach.stderr.txt
writer.0.block_writer.stdout.txt
The text was updated successfully, but these errors were encountered: