-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv/sql: GC after TRUNCATE temporarily craters write throughput #62700
Comments
@nvanbenschoten was this run with the Pebble fix cockroachdb/pebble#1085? |
No, it was not. This was running v20.2.6. |
This was the original intention with the `SnapshotRequest_Strategy` structure and was prototyped in cockroachdb#25134, but we never pushed it over the finish line because we did not find cases where SST ingestion was disruptive enough to warrant the extra complexity. cockroachdb#62700 tells a different story.
I did some exploration on As expected, the GC is followed by a wave of range merges that are in some way correlated with the destabilization. I then ran a few experiments to get an idea of cause and effect, as well as to determine whether there are any short-term mitigations here. We had speculated that a wave of range merges was responsible for destabilizing the cluster due to the rebalancing load imposed by the merge themselves along with the rebalancing load that fell out of the sudden change in cluster balance. So first, I played with the In this run, GC was followed by a steady rate of range merges and almost no impact to foreground traffic. This was a pretty clear indication that range merges are in some way responsible for the instability we are seeing here - either directly or indirectly. We actually already knew this to some extent, as we had seen early that with range merges disabled entirely, the GC was mostly non-disruptive. Then, when we enabled range merges, the instability immediately hit. So this test once again confirmed this fact. It also demonstrated that reducing the aggressiveness of range merges is a good short-term mitigation for this problem. So to summarize, the following should make TRUNCATE much less disruptive, even in earlier versions of CRDB:
I then started digging into why a wave of range merges on a keyspace disjoint from the load was so disruptive. In this setup, we have very few ranges outside of those in the table being GCed, so the idea that range merges were leading to other rebalancing activity didn't quite add up. And yet, we still saw serious disruption due to range merges. So it seemed likely that the range merges and the empty range rebalancing they lead to was itself causing problems. One theory I had was that we've been failing to account for a fixed externality of Raft snapshots - the need to flush the LSM's overlapping memtables when ingesting a Raft snapshot. This is a cost that a Raft snapshot currently (since #38932) has to pay, regardless of how small the snapshot is. It is also a cost that may be passed on to concurrent writers. As detailed in the documentation on
I put together a prototype of a change we had originally (see #25134) intended to make when addressing #16954. Instead of always using external SST ingestion during Raft snapshot application, we would only do so for ranges above some size threshold. This allows us to use the LSM's standard write path when applying Raft snapshots below a size threshold, avoiding the memtable flush and "clean up" compactions of small SSTs. This prototype is hacked together in #63013. Using this prototype, I ran the test, once again with range merging set to maximum aggressiveness ( In the test, we saw that there was a small blip in throughput when the GC went into effect due to p99 latency jumping from 30ms to about 60ms. However, throughput immediately recovered even as all 10k ranges were merged away in seconds. I'd like to re-run this experiment a few times to confirm that I'm not fooling myself, but this looks very good and hints that we may be on to something with the theory about SST ingestion during snapshot application being the cause of foreground latency. One other thing I noticed when drafting this prototype is that we do not call |
Great sleuthing, @nvanbenschoten.
I don't recall if that was intentional, and I don't know if it would help. The snapshot pre-ingest delay stuff was added by the Bulk IO team and we may simply have never thought to add it to the Raft snapshot ingestion path. On the other hand, delaying Raft snapshot ingestion will block other operations on the replica. Could that be problematic? I'm too distant from this area of the code now to judge. Certainly seems like a possible easy thing to experiment with.
Not ingesting very small sstables sounds like a good idea, though I don't have a good idea for how to make a principled decision on the size threshold at which to do ingestion vs the standard write path. Perhaps @sumeerbhola has some ideas. We could also consider doing this at the Pebble layer. cockroachdb/pebble#25 is one possibility for making ingestions faster. Another is that a very small ingested sstable could be transformed in an addition to the memtable. The WAL entry would be a pointer to the ingested sstable and the "commit" would iterate over the sstable and add the entries to the memtable. The upside to this approach is that it feels like a storage level concern. The downside is that either cockroachdb/pebble#25 or another approach would not be backward compatible so we'd have to gate using that on a cluster version. |
Nice @nvanbenschoten!
Pebble also has the context to determine whether or not the files overlap the memtable. Maybe we we incorporate that into the heuristic of whether or not to apply the ingest as a memtable addition. |
I don't have an opinion regarding what to do regarding a backportable fix. But going forward, given the number of places we ingest sstables, and how sensitive customers are likely to be wrt high percentile latency hiccups in an OLTP setting, cockroachdb/pebble#25 seems like the right approach (regardless of sstable size). |
I am generally very in favor of doing this in Pebble, as SST ingestion is used in a number of places, like Sumeer is saying. I suspect that this is in some way related to the concerns we periodically hear about index creation being disruptive to foreground traffic. That said, it is at least worth mentioning the I/O we could save by avoiding SST ingestion for small Raft snapshots. Today, each Raft snapshot results in 4+ SSTs, most of which are very small. In situations like we see here, all of them are very small. So we end up writing out 4+ small SSTs to the filesystem, then ingesting them, then quickly compacting them away. A WriteBatch avoids this wasted I/O, so it should be more efficient. How much more efficient? I don't know - so it would be a premature optimization to do anything here if more general solutions exist. And the downside is of course the added code complexity of needing to maintain an SST ingestion path and a WriteBatch path for Raft snapshots. But a WriteBatch path may be a more reasonable backport target. |
This is a good point. We need to be mindful of what is a more reasonable backport target. The proposed improvement to Pebble's sstable ingestion will not be backportable. The creation of the 4 small sstables will also result in 4 fsyncs. The WriteBatch approach avoids that, doing a single fsync of the Pebble WAL. |
Unfortunately, I no longer believe that the analysis above is complete. About a week ago, I began working on a patch that would avoid the IngestExternalFiles calls when applying small Raft snapshots in a manner that I believed would be backportable. Unlike the first prototype, which built a WriteBatch instead of writing sst files as the Raft snapshot streamed in, this more targetted approach continued to write the sst files, but would then read them in and copy them into a WriteBatch and apply that if it determined that the WriteBatch would be reasonably sized. The patch is here. I then ran a few more tests to confirm that the patch did address the issue, as I had seen before with the slightly different approach to avoiding the sst ingestion. Unfortunately, I was not able to demonstrate the improvement. I did see that we were no longer rapidly flushing memtables during the GC period, so the patch was doing what it intended to. without patch with patch However, I once again saw throughput crater after a short period of time, eventually bottoming out at 0: So I went back to the original patch from above that I had seen demonstrate such an improvement. Sure enough, I had a hard time reproducing those results, as I was still seeing cases where throughput was cratering. So it became clear that the previous diagnosis was not correct. There is more to the story. I started exploring hardware metrics to try to correlate the degraded throughput with degredation of storage performance. There was certainly a correlation, as log commit latency often spiked during the GC period, as did the IOPS In Progress metric, but there wasn't much more to go off here. I came back to the issue this week. This time, I focused less on storage-level performance degradation and more on distribution-level operations that may be causing stalls. The throughput hit began to look too severe to be explained by degraded storage performance, even if log commit latencies were spiking into the 800ms range. In each of the previous reproductions, there had been a period of performance degradation by about 60%, which can be mostly explained by a jump in storage latency. However, this had always been followed by a full outage period where qps hit 0, often repeatedly, which was even more concerning. After a few tests, the following correlation caught my eye: Notice that throughput is inversely correlated with the combined occurrences of This was all a bit surprising, as the expectation was that merging and rebalancing were occurring on the dropped keyspace that was being GCed, but not on the keyspace of the active table which was seeing all the load. And yet, requests were hitting these errors. I began playing with cluster settings to try to disable all lease transfers. The theory was that due to the range merges on the inactive table, we were moving around the leases on the active table to balance qps and count. The thought was that storage was slowing down in response to lots of compactions due to the range merges on the inactive table, then we started moving around leases, the lease transfers would be quite slow because storage is slow (especially in the tail), so we would have seconds-long blips of unavailability because a range is unavailable between the time that it begins a lease transfer and the time that the lease transfer applies on the new leaseholder. In that way, a lease transfer is a lot like a range-wide lock. Since the load was randomly distributed across all ranges, the second a range became unavailable, everyone would pile up on it and qps across the cluster would go to 0. To test this out, I made an attempt to prevent all lease transfers. I tested with the following cluster settings:
Unfortunately, this also did not help. Next, I started thinking about the > SELECT table_id FROM crdb_internal.tables WHERE name = 'kv';
table_id
------------
57
> SELECT
"eventType", count(*)
FROM
system.rangelog
WHERE
crdb_internal.pretty_key(decode(info::JSONB->'UpdatedDesc'->>'start_key', 'base64'), 0) LIKE '/57%'
AND "timestamp" BETWEEN '2021-04-23 02:45:00' AND '2021-04-23 03:00:00'
GROUP BY
"eventType";
eventType | count
---------------+--------
split | 119
add_voter | 393
remove_voter | 399
merge | 100 What jumped out wasn't the rebalancing, but the range merges. Why on earth were we merging ranges in the active table? The answer to this lies in why these ranges existed in the first place - they were load-based splits! So when qps dropped on the cluster, the mergeQueue decided that the split points were no longer needed, so it removed them by merging ranges. This led to more of a slowdown, and more merges. And the merge cascade continued as the poor load struggled to keep up as it was bounced around between nodes and stalled on merges and lease transfers. Furthermore, as ranges merged, the load began to consolidate back on a few nodes instead of being well distributed across the cluster. There's some history here, as something similar was tracked in #41317. We closed that issue by introducing a new So to test this theory, I ran another test with What we see is that qps still drops by about 60% during the range merge load, as we had seen previously. However, there was no follow-up period of reverberating full throughput stalls, which were bad on this 16 node cluster and way worse on the 72 node cluster we saw before. Instead, qps dips while inactive ranges are merging and log commit latency grows, and then recovers immediately after log commit latency recovers. So it looks like the merge cascade of load-based splits can explain the extended unavailability we have seen in these clusters (this time, I triple checked 😃). This also explains why I somehow got lucky enough to fool myself up above. This "merge cascade on performance dip" behavior is quite unstable. Once you hit it, the system melts down. But you also won't hit it if you never see an initial performance dip. So for one reason or anything, maybe because the memtable stall theory wasn't totally off-base, my experiment with that change never triggered the beginning of the cascade - slowing down qps just enough to allow a single range in the active table to merge - so they never saw performance collapse. So there are a few takeaways from this: First, we should revisit #41317. Load-based splitting is a great tool, but if performance depends on the splits that it creates and the splits disappear the moment load slows down, it creates a very unstable situation. Second, we should determine what short-term suggestions we can make with this new understanding. My guess is that the following would serve the current user who is running into this quite well, as it would prevent load-based splits created between the TRUNCATE and GC from being merged away during any temporary instability during the GC: SET CLUSTER SETTING kv.range_split.by_load_merge_delay = <2 x gc.ttlseconds> Finally, we should continue investigating the degraded (but not stalled) performance while ranges on the inactive table are merging. Unlike with the more confounding stalls, we see a clear correlation between the reduced throughput and all of (disk read bytes, disk write bytes, disk read IOPS, log commit tail latency, and LSM compactions). So on the surface, this looks less bizarre. This may be a result of the test making range merges extra aggressive, but there is likely something to learn here. |
(from a past life) In addition to that, do we take ongoing count and latency of merges into account when deciding whether to start a new merge. If we took resource overload signals into account before deciding to merge, we would potentially fare better.
It is unfortunate that we cannot differentiate between low load because the external load is actually low, versus the closed loop load generators are experiencing a slow system and therefore the load rate has reduced. Is there an explicit queue somewhere that we could measure at the |
I was thinking something similar when I read @nvanbenschoten's write-up. Splitting a range based on load to that range makes sense. For merging, simply looking at the load on the ranges to be merged seems insufficient. I think a coarse switch might be ok. Don't we do something similar to throttle SQL stats collection. I'm imagining something where we could look at CPU, disk IOPS, compaction debt, etc and temporarily disable range merges if they are high. |
We don't. The only thing we do is throttle merges based on the
For SQL stats collection, we throttle based on CPU utilization on the current node here: cockroach/pkg/sql/rowexec/sampler.go Lines 276 to 316 in fe96523
The way this works is if the cpu usage is below 25%, we don't throttle at all. If it is between 25% and 75%, sleep for some fraction of
I'm still trying to understand how to think about these kinds of approaches, so if you two have references to prior art in this area, I'd find that very helpful. There's also discussion from @sumeerbhola about other areas where we could think along these lines in #57247 and #57248. The challenges to this kind of throttling behavior seem to be that 1) we need cluster-wide information to make good decisions, otherwise we may not properly account for unbalanced load (one way or the other), 2) we need to determine which signals are appropriate to look at and which ones aren't (the classic admission control problem), and 3) it seems quite easy to accidentally design a throttling heuristic that results in range merges never running under certain workload characteristics. We may want some form of probabilistic throttling to avoid that kind of pathology. An alternative approach I've been thinking of along the lines of probabilistic throttling is to keep track on the leaseholder of the first time a range was run through the |
Informs cockroachdb#63009. Informs cockroachdb#64056. In cockroachdb#63009/cockroachdb#64056, we saw that this test could flake with a nil pointer panic. I don't know quite what's going on here, but when working on a patch for cockroachdb#62700, I managed to hit this panic reliably by accidentally breaking all range merges. After a bit of debugging, it became clear that we were always hitting a panic in the `reset` stage of `TestMergeQueue/sticky-bit` because the previous subtest, `TestMergeQueue/non-collocated`, was moving the RHS range to a different node, failing to merge the two range, and failing itself. This soft failure was being drowned out by the hard failure in the next subtest. This commit replaces the crash with a failure that looks something like the following when range merges are completely disabled: ``` --- FAIL: TestMergeQueue (0.34s) test_log_scope.go:73: test logs captured to: /var/folders/8k/436yf8s97cl_27vlh270yb8c0000gp/T/logTestMergeQueue627909827 test_log_scope.go:74: use -show-logs to present logs inline --- FAIL: TestMergeQueue/both-empty (0.00s) client_merge_test.go:4183: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00" --- FAIL: TestMergeQueue/lhs-undersize (0.00s) client_merge_test.go:4192: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00" --- FAIL: TestMergeQueue/combined-threshold (0.00s) client_merge_test.go:4214: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00" --- FAIL: TestMergeQueue/non-collocated (0.03s) client_merge_test.go:4236: replica doesn't exist --- FAIL: TestMergeQueue/sticky-bit (0.00s) client_merge_test.go:4243: right-hand side range not found --- FAIL: TestMergeQueue/sticky-bit-expiration (0.00s) client_merge_test.go:4268: right-hand side range not found ``` I expect that under stress on master, we will see the `TestMergeQueue/non-collocated` subtest fail.
Closes cockroachdb#62700. Re-addresses cockroachdb#41317. This commit reworks how queries-per-second measurements are used when determining whether to merge two ranges together. At a high-level, the change moves from a scheme where the QPS over the last second on the LHS and RHS ranges are combined and compared against a threshold (half the load-based split threshold) to a scheme where the maximum QPS measured over the past 5 minutes (configurable) on the LHS and RHS ranges are combined and compared against said threshold. The commit makes this change to avoid thrashing and to avoid overreacting to temporary fluctuations in load. These overreactions lead to general instability in clusters, as we saw in cockroachdb#41317. Worse, the overreactions compound and can lead to cluster-wide meltdowns where a transient slowdown can trigger a wave of range merges, which can slow the cluster down further, which can lead to more merges, etc. This is what we saw in cockroachdb#62700. This behavior is bad on small clusters and it is even worse on large ones, where range merges don't just interrupt traffic, but also result in a centralization of load in a previously well-distributed dataset, undoing all of the hard work of load-based splitting and rebalancing and creating serious hotspots. The commit improves this situation by introducing a form of memory into the load-based split `Decider`. This is the object which was previously only responsible for measuring queries-per-second on a range and triggering the process of finding a load-based split point. The object is now given an additional role of taking the second-long QPS samples that it measures and aggregating them together to track the maximum historical QPS over a configurable retention period. This maximum QPS measurement can be used to prevent load-based splits from being merged away until the resulting ranges have consistently remained below a certain QPS threshold for a sufficiently long period of time. The `mergeQueue` is taught how to use this new source of information. It is also taught that it should be conservative about imprecision in this QPS tracking, opting to skip a merge rather than perform one when the maximum QPS measurement has not been tracked for long enough. This means that range merges will typically no longer fire within 5 minutes of a lease transfer. This seems fine, as there are almost never situations where a range merge is desperately needed and we should risk making a bad decision in order to perform one. I've measured this change on the `clearrange` roachtest that we made heavy use of in cockroachdb#62700. As expected, it has the same effect as bumping up the `kv.range_split.by_load_merge_delay` high enough such that ranges never merge on the active table. Here's a screenshot of a recent run. We still see a period of increased tail latency and reduced throughput, which has a strong correlation with Pebble compactions. However, we no longer see the subsequent cluster outage that used to follow, where ranges on the active table would begin to merge and throughput would fall to 0 and struggle to recover, bottoming out repeatedly. <todo insert images> Release note (performance improvement): Range merges are no longer triggered if a range has seen significant load over the previous 5 minutes, instead of only considering the last second. This improves stability, as load-based splits will no longer rapidly disappear during transient throughput dips.
Informs cockroachdb#62700. This commit bumps the default value for the `kv.range_merge.queue_interval` cluster setting from 1s to 5s. This setting serves as a per-store rate limit on the number of range merges that will be initiated. We've seen in a few issues like cockroachdb#62700 that excessive range merge traffic can cause instability in a cluster. There's very little reason to be aggressive about range merging, as range merges are rarely needed with any urgency. However, there are good reasons to be conservative about them. This change can also be justified as a (late) reaction to the increased max range size from 64MB to 512MB. A range merge may need to rebalance replicas in a range, so its cost can be a function of the sizes of ranges. This means that if range merges are now more expensive, we should be running them less frequently.
I want to clarify one thing: does the Regarding the rate limiting that we do for SQL stats and your concerns about "need cluster-wide information to make good decisions";ending up with a design that accidentally doesn't merge; probabilistic throttling: these are all good points and I don't have any clear answers from prior art I have encountered:
|
64199: kv: avoid panic on TestMergeQueue/non-collocated failure r=nvanbenschoten a=nvanbenschoten Informs #63009. Informs #64056. In #63009/#64056, we saw that this test could flake with a nil pointer panic. I don't know quite what's going on here, but when working on a patch for #62700, I managed to hit this panic reliably by accidentally breaking all range merges. After a bit of debugging, it became clear that we were always hitting a panic in the `reset` stage of `TestMergeQueue/sticky-bit` because the previous subtest, `TestMergeQueue/non-collocated`, was moving the RHS range to a different node, failing to merge the two range, and failing itself. This soft failure was being drowned out by the hard failure in the next subtest. This commit replaces the crash with a failure that looks something like the following when range merges are completely disabled: ``` --- FAIL: TestMergeQueue (0.34s) test_log_scope.go:73: test logs captured to: /var/folders/8k/436yf8s97cl_27vlh270yb8c0000gp/T/logTestMergeQueue627909827 test_log_scope.go:74: use -show-logs to present logs inline --- FAIL: TestMergeQueue/both-empty (0.00s) client_merge_test.go:4183: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00" --- FAIL: TestMergeQueue/lhs-undersize (0.00s) client_merge_test.go:4192: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00" --- FAIL: TestMergeQueue/combined-threshold (0.00s) client_merge_test.go:4214: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00" --- FAIL: TestMergeQueue/non-collocated (0.03s) client_merge_test.go:4236: replica doesn't exist --- FAIL: TestMergeQueue/sticky-bit (0.00s) client_merge_test.go:4243: right-hand side range not found --- FAIL: TestMergeQueue/sticky-bit-expiration (0.00s) client_merge_test.go:4268: right-hand side range not found ``` I expect that under stress on master, we will see the `TestMergeQueue/non-collocated` subtest fail. The fact that `TestMergeQueue/non-collocated` is the test failing means that we may want to have @aayushshah15 take over this investigation, since he's made changes in that area recently. What do you two think? Co-authored-by: Nathan VanBenschoten <[email protected]>
64080: kvserver: improve error message for rare lease errors r=andreimatei a=andreimatei In some rare cases, the status of a lease in relation to a request/timestamp can't be determined. For the request's client this results in a NotLeaseholderError. This patch improves the message of that error. In particular, this test failure[1] seems to show that a node couldn't verify that an existing lease is expired because its liveness gossiped info was stale. This sounds interesting and if the test fails again this improved message should help. [1] #57932 (comment) Release note: None 64239: kv: bump kv.range_merge.queue_interval to 5s r=nvanbenschoten a=nvanbenschoten Informs #62700. This commit bumps the default value for the `kv.range_merge.queue_interval` cluster setting from 1s to 5s. This setting serves as a per-store rate limit on the frequency at which range merges will be initiated. We've seen in a few issues, including #62700, that excessive range merge traffic can cause instability in a cluster. There's very little reason to be aggressive about range merging, as range merges are rarely needed with any urgency. However, there are good reasons to be conservative about them. This change can also be justified as a (late) reaction to the increased max range size from 64MB to 512MB. A range merge may need to rebalance replicas in a range, so its cost can be a function of the sizes of ranges. This means that if range merges are now more expensive, we should be running them less frequently. 64260: roachpb: remove EndTxn's DeprecatedCanCommitAtHigherTimestamp field r=nvanbenschoten a=nvanbenschoten The field was replaced with the more general BatchRequest.CanForwardReadTimestamp in 972915d. This commit completes the migration to remove the old flag. We attempted this before, in 93d5eb9, but had to back that out in 4189938 because we still needed compatibility with v20.1 nodes at the time. That is no longer the case. Co-authored-by: Andrei Matei <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]>
This is roughly correct, though the limit is actually per store, not per node.
I agree with this. I'm not confident that we would be able to construct an accurate cost-model of each admin command that fully accounts for all of its immediate costs and all of its externalities. So I think throttling here needs to be performed by whoever is deciding to issue the admin command.
This is an interesting point. This kind of approach becomes possible when we closely monitor the health and control the resources in a system. But requiring an operator to be in the loop becomes much less appealing for on-prem deployments.
Also interesting. Does the rate at which these centralized decisions get applied tend to scale linearly with available resources or are constant limits imposed? |
64201: kv: rationalize load-based range merging r=nvanbenschoten a=nvanbenschoten Closes #62700. Fully addresses #41317. This commit reworks how queries-per-second measurements are used when determining whether to merge two ranges together. At a high-level, the change moves from a scheme where the QPS over the last second on the LHS and RHS ranges are combined and compared against a threshold (half the load-based split threshold) to a scheme where the maximum QPS measured over the past 5 minutes (configurable) on the LHS and RHS ranges are combined and compared against said threshold. The commit makes this change to avoid thrashing and to avoid overreacting to temporary fluctuations in load. These overreactions lead to general instability in clusters, as we saw in #41317. Worse, the overreactions compound and can lead to cluster-wide meltdowns where a transient slowdown can trigger a wave of range merges, which can slow the cluster down further, which can lead to more merges, etc. This is what we saw in #62700. This behavior is bad on small clusters and it is even worse on large ones, where range merges don't just interrupt traffic, but also result in a centralization of load in a previously well-distributed dataset, undoing all of the hard work of load-based splitting and rebalancing and creating serious hotspots. The commit improves this situation by introducing a form of memory into the load-based split `Decider`. This is the object which was previously only responsible for measuring queries-per-second on a range and triggering the process of finding a load-based split point. The object is now given an additional role of taking the second-long QPS samples that it measures and aggregating them together to track the maximum historical QPS over a configurable retention period. This maximum QPS measurement can be used to prevent load-based splits from being merged away until the resulting ranges have consistently remained below a certain QPS threshold for a sufficiently long period of time. The `mergeQueue` is taught how to use this new source of information. It is also taught that it should be conservative about imprecision in this QPS tracking, opting to skip a merge rather than perform one when the maximum QPS measurement has not been tracked for long enough. This means that range merges will typically no longer fire within 5 minutes of a lease transfer. This seems fine, as there are almost never situations where a range merge is desperately needed and we should risk making a bad decision in order to perform one. I've measured this change on the `clearrange` roachtest that we made heavy use of in #62700. As expected, it has the same effect as bumping up the `kv.range_split.by_load_merge_delay` high enough such that ranges never merge on the active table. Here's a screenshot of a recent run. We still see a period of increased tail latency and reduced throughput, which has a strong correlation with Pebble compactions. However, we no longer see the subsequent cluster outage that used to follow, where ranges on the active table would begin to merge and throughput would fall to 0 and struggle to recover, bottoming out repeatedly. <img width="1323" alt="Screen Shot 2021-04-26 at 12 32 53 AM" src="https://user-images.githubusercontent.com/5438456/116037215-c8f66300-a635-11eb-8ff2-9e7db4baee8d.png"> <img width="986" alt="Screen Shot 2021-04-26 at 12 33 04 AM" src="https://user-images.githubusercontent.com/5438456/116037225-cc89ea00-a635-11eb-8f2c-a40b2b3e47a7.png"> Instead of what we originally saw, which looked like: <img width="1305" alt="Screen Shot 2021-03-27 at 10 52 18 PM" src="https://user-images.githubusercontent.com/5438456/112763884-53668b00-8fd4-11eb-9ebc-61d9494eca10.png"> Release note (performance improvement): Range merges are no longer triggered if a range has seen significant load over the previous 5 minutes, instead of only considering the last second. This improves stability, as load-based splits will no longer rapidly disappear during transient throughput dips. Release note (performance improvement): Range merges are no longer considered if a range has seen significant load over the previous 5 minutes, instead of being considered as long as a range has low load over the last second. This improves stability, as load-based splits will no longer rapidly disappear during transient throughput dips. cc @cockroachdb/kv Co-authored-by: Nathan VanBenschoten <[email protected]>
For something like limiting the number of ongoing merges, I've seen simple heuristics like no merges at mean cluster utilization > N%. And when < N%, no more than a mean of M ongoing merges/node. So yes, it does scale with deployed resources but not necessarily in a sophisticated way. Looks like we already have an M which is the number of stores per node. |
Informs cockroachdb#62700. This commit bumps the default value for the `kv.range_merge.queue_interval` cluster setting from 1s to 5s. This setting serves as a per-store rate limit on the number of range merges that will be initiated. We've seen in a few issues like cockroachdb#62700 that excessive range merge traffic can cause instability in a cluster. There's very little reason to be aggressive about range merging, as range merges are rarely needed with any urgency. However, there are good reasons to be conservative about them. This change can also be justified as a (late) reaction to the increased max range size from 64MB to 512MB. A range merge may need to rebalance replicas in a range, so its cost can be a function of the sizes of ranges. This means that if range merges are now more expensive, we should be running them less frequently.
Informs cockroachdb#63009. Informs cockroachdb#64056. In cockroachdb#63009/cockroachdb#64056, we saw that this test could flake with a nil pointer panic. I don't know quite what's going on here, but when working on a patch for cockroachdb#62700, I managed to hit this panic reliably by accidentally breaking all range merges. After a bit of debugging, it became clear that we were always hitting a panic in the `reset` stage of `TestMergeQueue/sticky-bit` because the previous subtest, `TestMergeQueue/non-collocated`, was moving the RHS range to a different node, failing to merge the two range, and failing itself. This soft failure was being drowned out by the hard failure in the next subtest. This commit replaces the crash with a failure that looks something like the following when range merges are completely disabled: ``` --- FAIL: TestMergeQueue (0.34s) test_log_scope.go:73: test logs captured to: /var/folders/8k/436yf8s97cl_27vlh270yb8c0000gp/T/logTestMergeQueue627909827 test_log_scope.go:74: use -show-logs to present logs inline --- FAIL: TestMergeQueue/both-empty (0.00s) client_merge_test.go:4183: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00" --- FAIL: TestMergeQueue/lhs-undersize (0.00s) client_merge_test.go:4192: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00" --- FAIL: TestMergeQueue/combined-threshold (0.00s) client_merge_test.go:4214: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00" --- FAIL: TestMergeQueue/non-collocated (0.03s) client_merge_test.go:4236: replica doesn't exist --- FAIL: TestMergeQueue/sticky-bit (0.00s) client_merge_test.go:4243: right-hand side range not found --- FAIL: TestMergeQueue/sticky-bit-expiration (0.00s) client_merge_test.go:4268: right-hand side range not found ``` I expect that under stress on master, we will see the `TestMergeQueue/non-collocated` subtest fail.
Closes cockroachdb#62700. Re-addresses cockroachdb#41317. This commit reworks how queries-per-second measurements are used when determining whether to merge two ranges together. At a high-level, the change moves from a scheme where the QPS over the last second on the LHS and RHS ranges are combined and compared against a threshold (half the load-based split threshold) to a scheme where the maximum QPS measured over the past 5 minutes (configurable) on the LHS and RHS ranges are combined and compared against said threshold. The commit makes this change to avoid thrashing and to avoid overreacting to temporary fluctuations in load. These overreactions lead to general instability in clusters, as we saw in cockroachdb#41317. Worse, the overreactions compound and can lead to cluster-wide meltdowns where a transient slowdown can trigger a wave of range merges, which can slow the cluster down further, which can lead to more merges, etc. This is what we saw in cockroachdb#62700. This behavior is bad on small clusters and it is even worse on large ones, where range merges don't just interrupt traffic, but also result in a centralization of load in a previously well-distributed dataset, undoing all of the hard work of load-based splitting and rebalancing and creating serious hotspots. The commit improves this situation by introducing a form of memory into the load-based split `Decider`. This is the object which was previously only responsible for measuring queries-per-second on a range and triggering the process of finding a load-based split point. The object is now given an additional role of taking the second-long QPS samples that it measures and aggregating them together to track the maximum historical QPS over a configurable retention period. This maximum QPS measurement can be used to prevent load-based splits from being merged away until the resulting ranges have consistently remained below a certain QPS threshold for a sufficiently long period of time. The `mergeQueue` is taught how to use this new source of information. It is also taught that it should be conservative about imprecision in this QPS tracking, opting to skip a merge rather than perform one when the maximum QPS measurement has not been tracked for long enough. This means that range merges will typically no longer fire within 5 minutes of a lease transfer. This seems fine, as there are almost never situations where a range merge is desperately needed and we should risk making a bad decision in order to perform one. I've measured this change on the `clearrange` roachtest that we made heavy use of in cockroachdb#62700. As expected, it has the same effect as bumping up the `kv.range_split.by_load_merge_delay` high enough such that ranges never merge on the active table. Here's a screenshot of a recent run. We still see a period of increased tail latency and reduced throughput, which has a strong correlation with Pebble compactions. However, we no longer see the subsequent cluster outage that used to follow, where ranges on the active table would begin to merge and throughput would fall to 0 and struggle to recover, bottoming out repeatedly. <todo insert images> Release note (performance improvement): Range merges are no longer considered if a range has seen significant load over the previous 5 minutes, instead of being considered as long as a range has low load over the last second. This improves stability, as load-based splits will no longer rapidly disappear during transient throughput dips.
Informs cockroachdb#63009. Informs cockroachdb#64056. In cockroachdb#63009/cockroachdb#64056, we saw that this test could flake with a nil pointer panic. I don't know quite what's going on here, but when working on a patch for cockroachdb#62700, I managed to hit this panic reliably by accidentally breaking all range merges. After a bit of debugging, it became clear that we were always hitting a panic in the `reset` stage of `TestMergeQueue/sticky-bit` because the previous subtest, `TestMergeQueue/non-collocated`, was moving the RHS range to a different node, failing to merge the two range, and failing itself. This soft failure was being drowned out by the hard failure in the next subtest. This commit replaces the crash with a failure that looks something like the following when range merges are completely disabled: ``` --- FAIL: TestMergeQueue (0.34s) test_log_scope.go:73: test logs captured to: /var/folders/8k/436yf8s97cl_27vlh270yb8c0000gp/T/logTestMergeQueue627909827 test_log_scope.go:74: use -show-logs to present logs inline --- FAIL: TestMergeQueue/both-empty (0.00s) client_merge_test.go:4183: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00" --- FAIL: TestMergeQueue/lhs-undersize (0.00s) client_merge_test.go:4192: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00" --- FAIL: TestMergeQueue/combined-threshold (0.00s) client_merge_test.go:4214: ranges unexpectedly unmerged expected startKey /Table/Max, but got "\xfa\x00\x00" --- FAIL: TestMergeQueue/non-collocated (0.03s) client_merge_test.go:4236: replica doesn't exist --- FAIL: TestMergeQueue/sticky-bit (0.00s) client_merge_test.go:4243: right-hand side range not found --- FAIL: TestMergeQueue/sticky-bit-expiration (0.00s) client_merge_test.go:4268: right-hand side range not found ``` I expect that under stress on master, we will see the `TestMergeQueue/non-collocated` subtest fail.
Closes cockroachdb#62700. Re-addresses cockroachdb#41317. This commit reworks how queries-per-second measurements are used when determining whether to merge two ranges together. At a high-level, the change moves from a scheme where the QPS over the last second on the LHS and RHS ranges are combined and compared against a threshold (half the load-based split threshold) to a scheme where the maximum QPS measured over the past 5 minutes (configurable) on the LHS and RHS ranges are combined and compared against said threshold. The commit makes this change to avoid thrashing and to avoid overreacting to temporary fluctuations in load. These overreactions lead to general instability in clusters, as we saw in cockroachdb#41317. Worse, the overreactions compound and can lead to cluster-wide meltdowns where a transient slowdown can trigger a wave of range merges, which can slow the cluster down further, which can lead to more merges, etc. This is what we saw in cockroachdb#62700. This behavior is bad on small clusters and it is even worse on large ones, where range merges don't just interrupt traffic, but also result in a centralization of load in a previously well-distributed dataset, undoing all of the hard work of load-based splitting and rebalancing and creating serious hotspots. The commit improves this situation by introducing a form of memory into the load-based split `Decider`. This is the object which was previously only responsible for measuring queries-per-second on a range and triggering the process of finding a load-based split point. The object is now given an additional role of taking the second-long QPS samples that it measures and aggregating them together to track the maximum historical QPS over a configurable retention period. This maximum QPS measurement can be used to prevent load-based splits from being merged away until the resulting ranges have consistently remained below a certain QPS threshold for a sufficiently long period of time. The `mergeQueue` is taught how to use this new source of information. It is also taught that it should be conservative about imprecision in this QPS tracking, opting to skip a merge rather than perform one when the maximum QPS measurement has not been tracked for long enough. This means that range merges will typically no longer fire within 5 minutes of a lease transfer. This seems fine, as there are almost never situations where a range merge is desperately needed and we should risk making a bad decision in order to perform one. I've measured this change on the `clearrange` roachtest that we made heavy use of in cockroachdb#62700. As expected, it has the same effect as bumping up the `kv.range_split.by_load_merge_delay` high enough such that ranges never merge on the active table. Here's a screenshot of a recent run. We still see a period of increased tail latency and reduced throughput, which has a strong correlation with Pebble compactions. However, we no longer see the subsequent cluster outage that used to follow, where ranges on the active table would begin to merge and throughput would fall to 0 and struggle to recover, bottoming out repeatedly. <todo insert images> Release note (performance improvement): Range merges are no longer considered if a range has seen significant load over the previous 5 minutes, instead of being considered as long as a range has low load over the last second. This improves stability, as load-based splits will no longer rapidly disappear during transient throughput dips.
This issue described the back half of #62672.
After waiting out the GC TLL of a
TRUNCATE
orDROP TABLE
, a series ofClearRange
requests are issued to clear out space. This can cause a dramatic drop in write throughput on other tables that takes 10s of minutes to recover from. In the words of the customer that helped us diagnose this: "it's a write availability time bomb."Reproduction
The easiest way to reproduce this is to use the
clearrange/checks=false
roachtest, with a slight adaptation to run load during the cleanup:This configures the test to import a 2 TB table, ramp write load up on a different table over 30 minutes, clear the 2 TB table, and then observe the impact on the foreground traffic. The effect is not pretty:
Other notes
In attempting to reproduce, I also ran this with a configuration that looked more similar to the relevant customer's cluster - 20 nodes, 32 vCPU per node, 2 SSDs per node, and a 4 TB table. This looked very similar:
note: this is a 10 minute time scale, not 30 minute like above
Epic: CRDB-2627
The text was updated successfully, but these errors were encountered: