-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Severe performance degradation when shutting one of 3 nodes down. #19236
Comments
Thanks once again, @arctica :) This is indeed not expected. We'll take a close look at this soon (next week, most likely), but if we aren't able to easily reproduce it then the following information would be useful:
Assigning to myself to look early next week, but cc @cuongdo in case you have someone else in mind. |
Hi @a-robinson , thanks for taking an interest in this :) Good to hear this is unexpected indeed. Please find attached the screenshots of the 4 pages you asked for. Logs from the node that was shut down:
The only weird thing that jumps out to me is that the node seems to still have gotten some range leases during shutdown. |
@a-robinson I assigned you to #17611 which seems largely aligned with this issue. |
As mentioned on #19262, the load generator as written doesn't handle errors well - in my first attempt to reproduce this, the QPS dropped to 0 because the load generator received an ambiguous error and Also, I assume that the three nodes had very low latency network connections between them? We've seen this before in somewhat similar situations in widely distributed clusters, but I'm having trouble reproducing with 3 VMs in the same GCE zone after replacing the That probably just means that I need to try harder to get a repro case, I just want to make sure I know exactly what was happening when you ran into it. |
When this degradation happened in my latest test (where the graphs are from), the load generator did not quit, hence there still being queries running. The latency between the machines is around 0.2ms. They are dedicated machines in the same datacenter. Yes, when I restarted the third node, speed came back up without doing anything to the load generator. Note that my table has very small range min/max values set (64kb) in order to make it split into many ranges. That might be a deciding factor. You can see in the graphs that there were over 5k ranges. |
I've just tested this with UPDATE and SELECT queries, same degradation happens in all cases. I'm running it with |
Alright, so now that I have splits working at much smaller sizes than before (#19339), I've been able to reproduce the performance degradation you reported and I know what's going on. The core issue is #9446. To summarize, we have an optimization that disables the periodic network heartbeats in ranges that aren't receiving traffic. This optimization gets disabled if any of the replicas of a range are behind, which is the case when one of the replicas is offline. It's worth noting that this issue is at its absolute worst in a 3 node cluster. We won't remove the replicas that are on the dead node in a 3 node cluster because there's no other node to move them to, which means the cluster gets stuck in this state indefinitely. On a larger cluster, the problem would go away after 5+ minutes, since we'd start creating new copies of the data from the dead node after 5 minutes. The reason I wasn't seeing this before doing all the splits is because the amount of heartbeat traffic scales with the number of ranges in the cluster. I've copied the most relevant graphs below, but to prove that this was really the issue I built a custom binary that never quiesces replica. With that custom binary, I see the exact same behavior with 3 nodes running as I was seeing with the standard binary when one of the nodes is down. I'd say we'll need to bump the priority of #9446, at least to avoid permanent situations like this in 3 node clusters. cc @petermattis |
@a-robinson Great to see you were able to reproduce and on top also figure out what the issue is. Also good to hear that a cluster with >3 nodes wouldn't have this issue, I can afford more than 3 nodes when going into production but I like to test some extreme conditions first. It makes sense that a 3 node cluster with 3 replicas runs into tough situations when one node gues down. But sounds like that's fixable so all good. |
It would still have performance degradation, it would just be less intense and would go away after the node is considered dead (which happens 5 minutes after it goes down). Note that the reason it's so bad in this case is because you split your data into so many small ranges -- if you had let it split more naturally, or not reduced the range size by quite so much, you wouldn't see nearly as bad of performance hit. For example, when I was running with 850 ranges (with my |
Rare performance degradation over a timeframe of 5 minutes are fine for this usecase though I would expect a clean graceful shutdown to avoid that delay. I'll try to rerun the test with bigger ranges. The reason I've set such a low value was that initially when the table had very few ranges, performance was very bad and only picked up when there were a couple dozen ranges or so. Due to the very small row sizes it would take a long time to split. In this specific case (and I'm sure many others as well), it would actually be great if one could specify a number of ranges to create instead of range sizes. |
Early results of a re-test with a fresh table and range size of 2MiB indicate that performance is much improved over the v1.0 and v1.1 I tested before. Now QPS start right away at nearly the level I saw before with many ranges. QPS are consistently at between 1k and 1.5k with the avg maybe being something like 1.2k. I'm seeing one weird thing right now: the Replication page still shows 150k ranges even though the old table has been dropped and recreated. The databases page shows 8 ranges for the system db and 32 currently for the test db. Disk storage seems also still in use and the Replicas graph shows 150k as well. Probably because I ctrl-c'd the DROP statement as it was taking a long time due to the load generator still running. Then I stopped the generator and saw the table got dropped. I'll probably have to open another issue for this. |
Your best bet as of now (until #19154 gets done) is to use the
to get the table split into 6 roughly evenly sized ranges before you even start loading data. Based on my testing of your load in #18657, you shouldn't need very many ranges to max out performance, and going into the thousands of ranges appeared to make performance noticeably worse. |
Well, unfortunately this didn't make the cut for 2.0. Our roadmap has us fixing this before the 2.1 release in the fall. I'm going to close this as a dupe of #9446, but please reopen if there's anything separate to do here. |
Running a 3 node cluster of the recent v1.1 release with the load generator from #18657 running on node "master" (other two nodes are called node1 and node2).
When I shut down node2, queries per second drop from around 1600 to a very low 50-250.
This is unexpected to me since I'd expect the two remaining healthy nodes to still be able to form a raft quorum and make normal progress.
Node2 was shut down gracefully and the admin UI doesn't show any unavailable ranges.
After bringing node2 back up, qps get back to the same level as before the shutdown.
The table at the time of testing had over 6.6k ranges.
Log entries from node "master":
The text was updated successfully, but these errors were encountered: