-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: avoid excessively wide range tombstones during Raft snapshot reception #44048
Comments
Constrains the width of the range deletion tombstone to the span of keys actually present within the range. If the range has no kv-entries, then skip the rangedel completely. Before this change, when receiving a snapshot, the original file would have a range deletion tombstone that spanned the entire range written to it regardless of the actual keys contained in the range or if the range was empty. This resulted in the creation of excessively wide tombstones, which has significant performance implications since the wide tombstones impede compaction. Fixes cockroachdb#44048. Release note: None.
Constrains the width of the range deletion tombstone to the span of keys actually present within the range. If the range has no kv-entries, then skip the rangedel completely. Before this change, when receiving a snapshot, the original file would have a range deletion tombstone that spanned the entire range written to it regardless of the actual keys contained in the range or if the range was empty. This resulted in the creation of excessively wide tombstones, which has significant performance implications since the wide tombstones impede compaction. Fixes cockroachdb#44048. Release note: None.
Constrains the width of the range deletion tombstone to the span of keys actually present within the range. If the range has no kv-entries, then skip the rangedel completely. Before this change, when receiving a snapshot, the original file would have a range deletion tombstone that spanned the entire range written to it regardless of the actual keys contained in the range or if the range was empty. This resulted in the creation of excessively wide tombstones, which has significant performance implications since the wide tombstones impede compaction. Fixes cockroachdb#44048. Release note: None. address review comments Update documentation Update TestStoreRangeMergeRaftSnapshot to constrain range of expected tombstone. Constrain width of range deletion tombstone for subsumed ranges
Constrains the width of the range deletion tombstone to the span of keys actually present within the range. If the range has no kv-entries, then skip the rangedel completely. Before this change, when receiving a snapshot, the original file would have a range deletion tombstone that spanned the entire range written to it regardless of the actual keys contained in the range or if the range was empty. This resulted in the creation of excessively wide tombstones, which has significant performance implications since the wide tombstones impede compaction. Fixes cockroachdb#44048. Release note: None. address review comments Update documentation Update TestStoreRangeMergeRaftSnapshot to constrain range of expected tombstone. Constrain width of range deletion tombstone for subsumed ranges
Constrains the width of the range deletion tombstone to the span of keys actually present within the range. If the range has no kv-entries, then skip the rangedel completely. Before this change, when receiving a snapshot, the original file would have a range deletion tombstone that spanned the entire range written to it regardless of the actual keys contained in the range or if the range was empty. This resulted in the creation of excessively wide tombstones, which has significant performance implications since the wide tombstones impede compaction. Fixes cockroachdb#44048. Release note: None. Update documentation Constrain width of range deletion tombstone for subsumed ranges
Constrains the width of the range deletion tombstone to the span of keys actually present within the range. If the range has no kv-entries, then skip the rangedel completely. Before this change, when receiving a snapshot, the original file would have a range deletion tombstone that spanned the entire range written to it regardless of the actual keys contained in the range or if the range was empty. This resulted in the creation of excessively wide tombstones, which has significant performance implications since the wide tombstones impede compaction. Fixes cockroachdb#44048. Release note: None. Update documentation Constrain width of range deletion tombstone for subsumed ranges Address nathan's reviews
Constrains the width of the range deletion tombstone to the span of keys actually present within the range. If the range has no kv-entries, then skip the rangedel completely. Before this change, when receiving a snapshot, the original file would have a range deletion tombstone that spanned the entire range written to it regardless of the actual keys contained in the range or if the range was empty. This resulted in the creation of excessively wide tombstones, which has significant performance implications since the wide tombstones impede compaction. Fixes cockroachdb#44048. Release note: None. Update documentation. Constrain width of range deletion tombstone for subsumed ranges. Address nathan's reviews. Address comments and fix pkg renaming issues with engine to storage.
44725: storage: constrained span of rangedel in ClearRange to keys in range r=nvanbenschoten a=OwenQian Constrains the width of the range deletion tombstone to the span of keys actually present within the range. If the range has no kv-entries, then skip the rangedel completely. Before this change, when receiving a snapshot, the original file would have a range deletion tombstone that spanned the entire range written to it regardless of the actual keys contained in the range or if the range was empty. This resulted in the creation of excessively wide tombstones, which has significant performance implications since the wide tombstones impede compaction. Fixes #44048. Rebased off #45100. Release note: None. 45157: sql: add inverted indexes on arrays r=jordanlewis a=jordanlewis Closes #43199. This commit adds inverted index support to arrays. Inverted index entries are created from arrays by simply encoding a key that contains the array element's table key encoding. Nulls are not indexed, since in SQL, ARRAY[1, NULL] @> ARRAY[NULL] returns false. For example, in a table t(int, int[]) with an inverted index with id 3 on the int[] column the row (10, [1, NULL, 2]) produces 2 index keys: ``` /tableId/3/1/10 /tableId/3/2/10 ``` This encoding scheme is much simpler than the one for JSON, since arrays don't have "paths": their elements are simply ordinary datums. Release note (sql change): The inverted index implementation now supports indexing array columns. This permits accelerating containment queries (@> and <@) on array columns by adding an index to them. 45642: ui: Set react component `key` prop to fix react errors r=nathanstilwell a=koorosh Set react component `key` prop to fix react errors Resolves: #45188 Co-authored-by: Owen Qian <[email protected]> Co-authored-by: Jordan Lewis <[email protected]> Co-authored-by: Andrii Vorobiov <[email protected]>
I was thinking about this change over the weekend and I'm becoming concerned that the fix in #44725 is unsafe. Specifically, I'm concerned about a hazard where a follower replica applies an entry to its replicated state machine between the time that it receives a snapshot and the time that it applies that snapshot such that the snapshot's range deletion tombstone is no longer sufficiently wide to clear all old state. Consider the case where a range has a raft log that is appending keys to the end of the keyspace and removing keys from the beginning of the keyspace. Raft log index 5 writes key 5 and deletes key 4, index 6 writes key 6 and deletes key 5, index k writes key k and deletes key k-1, etc. Now imagine that a follower has applied up to log index k and is being sent a snapshot that includes the materialized state of index k+10 (i.e. the index of the snapshot is k+10). At the point that the follower begins receiving the snapshot, we have the following states:
As of #44725, the follower will scan its own keyspace upon receipt of the snapshot and use it to generate a range deletion tombstone. In this case, its tombstone will cover a single key: k. So the SST it ends up ingesting will look like:
So far, so good. However, at this point, I don't think we have any firm guarantee that the follower actually applies the snapshot before applying any new entries in its log. Of course, it's very unlikely that it will receive any new entires between its current applied index and the snapshot index because the leader wouldn't be sending the follower a snapshot if it could catch it up from its log, but I think it's possible to construct such situations. For instance, what if leadership is transferred and the new leader does happen to have index k+1 in its log? It should try to catch the follower up by appending k+1 to the followers log. So in this case, we might run into a problem where the follower applies log index k+1 and then applies the snapshot at k+10. By the time it applies the snapshot, its snapshots range deletion tombstone will no longer cover the entire range:
Now the follower is in an inconsistent state! So for this change to be safe, we must guarantee that the applied index of the follower stays constant between the time that it constructs its snapshot sst (using its own state to determine the range del boundaries) and the time that it applies that snapshot sst. I don't think we have that guarantee today. I put together a patch that tests for this here and haven't seen it fire yet while running unit tests, but that alone doesn't give me a ton of confidence. I'd like to get @bdarnell and @tbg's thoughts on this. Am I missing something? Do we have a ticking time bomb here? |
I'm also not aware of such a guarantee. This does look like a potential issue.
My thinking is that we should never do anything with the range ending in Maybe the answer is to special case the empty |
Huh? I haven't looked at this code in a long time, but I would expect appending Raft log entries to be mutually exclusive with snapshot application. Is that not true? Or perhaps it is true, but the follower is scanning its key space too early. |
Receiving and applying snapshots has become spread out over time, and we generate this range tombstone fairly early in the process. It's not obvious to me that we're blocking everything else on the range for the entire duration (but we might be). Scanning the keyspace later in the process would be one solution, but I think it would require rewriting the sst files for additional IO pressure. |
This does seem like a real issue. Best illustrated by this code: cockroach/pkg/kv/kvserver/store_snapshot.go Lines 838 to 844 in 37a1bd1
A straightforward way to hit this today could be when a follower is catching up through a backlog of committed entries while receiving a snapshot. In that case, the applied index would go up rapidly (but the snapshot might still be necessary). I don't think we're particularly likely to hit this issue today, though leaving the bug in is certainly not an option. As Nathan suggested we can use the quick fix of discarding the snapshot at apply time if we find that the applied index has changed (which in practice we expect to "never" be the case). Or we re-compute the constrained span under the right lock and use that as a (better) signal. Either way, a loud message plus sentry telemetry is in order when it happens as we'll want to know. Quick fix aside, how should this work? I don't like seeing subtly wrong code plus a band-aid in the long term (and chances are that if we don't fix it soon, we never will). Can we leave the SSTs "open" so that we can add the tombstone "at the bottom" in at apply time? I suppose the answer is "no" since the deletion tombstone needs to come first to avoid shadowing the rest of the SST (unless we add some kind of "TombstoneBehind" operation)? But also we can't hope to fix up the existing SSTs at apply time if needed, since the new keys that pop up may overlay with data ingested in the SSTs, and so a very fine-grained surgery would be needed. |
Every record in an ingested sstable occurs at the same point in time and there is special case logic so that range tombstones are considered to happen "before" any point operation. While the point operations have to be added in order while the sstable is being built, the range tombstone can be added at a later point. So your suggestion to leave the sstable "open" until the snapshot is applied could be done without any change to sstable ingestion. I'm not pushing for this approach, merely indicating that it could be done. |
Interesting, that's good to know. Seems like a sane way of doing things, but curious if there are other alternatives to consider. |
I wonder if this is still a problem given that we prioritize compaction of range tombstones and perform compaction of L6 tables in order to drop range tombstones. @jbowens can you put some thinking into this? We may also want to perform some experiments to see if this remains a problem. If we ingest an sstable into L6 that has a wide range tombstone, how quickly will we compact it if there are other concurrent ingestions taking place? |
I think today we'll rewrite these sstables to drop the wide range tombstones as soon as we don't have score-based compactions. This code will set the file's
In this case, the I agree it would be worthwhile to do some experiments (sorry I didn't get to them before my storage team hiatus!). If we ever pursue cockroachdb/pebble#25, maybe we could incorporate clearing key spans as an optional part of the ingestion operation. We'd only need to write the range tombstones as a part of the WAL batch if they are indeed in-use. This would force the ingestions into L0 if the span is in-use anywhere within the LSM though. |
Spoke with @jbowens and we're going to close because we haven't seen the support ticket load we used to see re: this issue since we changed the range tombstone compaction priority |
Going to reopen this issue, since it’s still somewhat relevant for wide range tombstones ingested into L6. We’re hoping to remove the range tombstones from ingested snapshot sstables as a part of the virtual sstable work by allowing for an Ingest operation that ‘excises’ existing data in a span through virtualizing overlapping sstables. |
Previously, a file in L6 that contained range deletion(s) was assumed to contain range deletions due to open snapshots at the time it was written. Table stats, and in particular the RangeDeletionBytesEstimate value, was calculated assuming the range deletion dropped all keys within the sstable that fall within the range deletion's bounds. This is a decent heuristic for when snapshots were indeed open, preventing the elision of the range deletion. However, in Cockroach KV snapshot reception writes range deletions to each of the ingested sstables (see cockroachdb/cockroach#44048). These range deletions delete nothing within the tables themselves. They exist only to clear out existing state below the table in the LSM. In these cases, the RangeDeletionBytesEstimate value would be a gross overestimate. Compacting the table would only drop the tombstones themselves, and none of the actual range data contained in point blocks. This commit adjusts the logic for calculating RangeDeletionBytesEstimate to take the range deletion sequence number into account. This small change will prevent sstables ingested through Cockroach snapshot reception from having a RangeDeletionBytesEstimate that incorrectly sums all data blocks. This has two consequences: a) In times of low compaction pressure, we will no longer rewrite the ingested sstables through elision-only compactions simply to remove the tombstones. b) When compacting from L5 into L6, we will no longer artificially deflate these sstable's size when calculating the min-overlapping ratio heuristic. Together, these effects should reduce write amplification/bandwidth, especially in the presence of significant rebalancing.
Previously, a file in L6 that contained range deletion(s) was assumed to contain range deletions due to open snapshots at the time it was written. Table stats, and in particular the RangeDeletionBytesEstimate value, was calculated assuming the range deletion dropped all keys within the sstable that fall within the range deletion's bounds. This is a decent heuristic for when snapshots were indeed open, preventing the elision of the range deletion. However, in Cockroach KV snapshot reception writes range deletions to each of the ingested sstables (see cockroachdb/cockroach#44048). These range deletions delete nothing within the tables themselves. They exist only to clear out existing state below the table in the LSM. In these cases, the RangeDeletionBytesEstimate value would be a gross overestimate. Compacting the table would only drop the tombstones themselves, and none of the actual range data contained in point blocks. This commit adjusts the logic for calculating RangeDeletionBytesEstimate to take the range deletion sequence number into account. This small change will prevent sstables ingested through Cockroach snapshot reception from having a RangeDeletionBytesEstimate that incorrectly sums all data blocks. This has two consequences: a) In times of low compaction pressure, we will no longer rewrite the ingested sstables through elision-only compactions simply to remove the tombstones. b) When compacting from L5 into L6, we will no longer artificially deflate these sstable's size when calculating the min-overlapping ratio heuristic. Together, these effects should reduce write amplification/bandwidth, especially in the presence of significant rebalancing.
Previously, a file in L6 that contained range deletion(s) was assumed to contain range deletions due to open snapshots at the time it was written. Table stats, and in particular the RangeDeletionBytesEstimate value, was calculated assuming the range deletion dropped all keys within the sstable that fall within the range deletion's bounds. This is a decent heuristic for when snapshots were indeed open, preventing the elision of the range deletion. However, in Cockroach KV snapshot reception writes range deletions to each of the ingested sstables (see cockroachdb/cockroach#44048). These range deletions delete nothing within the tables themselves. They exist only to clear out existing state below the table in the LSM. In these cases, the RangeDeletionBytesEstimate value would be a gross overestimate. Compacting the table would only drop the tombstones themselves, and none of the actual range data contained in point blocks. This commit adjusts the logic for calculating RangeDeletionBytesEstimate to take the range deletion sequence number into account. This small change will prevent sstables ingested through Cockroach snapshot reception from having a RangeDeletionBytesEstimate that incorrectly sums all data blocks. This has two consequences: a) In times of low compaction pressure, we will no longer rewrite the ingested sstables through elision-only compactions simply to remove the tombstones. b) When compacting from L5 into L6, we will no longer artificially deflate these sstable's size when calculating the min-overlapping ratio heuristic. Together, these effects should reduce write amplification/bandwidth, especially in the presence of significant rebalancing.
Going to close this out now that we have excises. |
kvBatchSnapshotStrategy
blindly adds a range tombstone to the sstables it generates during Raft snapshot reception for each of the 3 key spans for a range. For most ranges, this is perfectly fine, but for the last range in the key space this ends up adding a range tombstone from[<start>,/Max]
. This key range overlaps with any future ranges split off the end.Why is this a problem? This wide range tombstone acts as a "block" in the RocksDB/Pebble LSM preventing ingestion into a level. We see this in TPCC imports. At startup, the range ending in
/Max
is upreplicated from n1 to two other nodes. Those other nodes ingest an sstable with a range tombstone ending in/Max
. Subsequently, this range is split many times for import and when import tries to ingest sstables on these follower nodes, the ingestion hits L0 rather than L6. This in turn causes increased compaction pressure allowing more sstables to build up in L0. The evidence so far is a downward spiral results.While
kvBatchSnapshotStrategy
appears to be the proximate culprit of such wide range tombstones, @nvanbenschoten speculates thatReplica.clearSubsumedReplicaDiskData
could have this same problem.The suggestion is to introduce some additional checks to narrow the scope of the range tombstone. Specifically,
Store.receiveSnapshot
can create an iterator andSeekLT
on the end key of the range. This last key in the range will be used to lower the upper bound of the range tombstone, rather than blindly using the upper bound of the range.Cc @nvanbenschoten, @OwenQian
The text was updated successfully, but these errors were encountered: