storage: avoid excessively wide range tombstones during Raft snapshot reception #44048

petermattis · 2020-01-15T22:24:58Z

kvBatchSnapshotStrategy blindly adds a range tombstone to the sstables it generates during Raft snapshot reception for each of the 3 key spans for a range. For most ranges, this is perfectly fine, but for the last range in the key space this ends up adding a range tombstone from [<start>,/Max]. This key range overlaps with any future ranges split off the end.

Why is this a problem? This wide range tombstone acts as a "block" in the RocksDB/Pebble LSM preventing ingestion into a level. We see this in TPCC imports. At startup, the range ending in /Max is upreplicated from n1 to two other nodes. Those other nodes ingest an sstable with a range tombstone ending in /Max. Subsequently, this range is split many times for import and when import tries to ingest sstables on these follower nodes, the ingestion hits L0 rather than L6. This in turn causes increased compaction pressure allowing more sstables to build up in L0. The evidence so far is a downward spiral results.

While kvBatchSnapshotStrategy appears to be the proximate culprit of such wide range tombstones, @nvanbenschoten speculates that Replica.clearSubsumedReplicaDiskData could have this same problem.

The suggestion is to introduce some additional checks to narrow the scope of the range tombstone. Specifically, Store.receiveSnapshot can create an iterator and SeekLT on the end key of the range. This last key in the range will be used to lower the upper bound of the range tombstone, rather than blindly using the upper bound of the range.

Cc @nvanbenschoten, @OwenQian

The text was updated successfully, but these errors were encountered:

Constrains the width of the range deletion tombstone to the span of keys actually present within the range. If the range has no kv-entries, then skip the rangedel completely. Before this change, when receiving a snapshot, the original file would have a range deletion tombstone that spanned the entire range written to it regardless of the actual keys contained in the range or if the range was empty. This resulted in the creation of excessively wide tombstones, which has significant performance implications since the wide tombstones impede compaction. Fixes cockroachdb#44048. Release note: None.

Constrains the width of the range deletion tombstone to the span of keys actually present within the range. If the range has no kv-entries, then skip the rangedel completely. Before this change, when receiving a snapshot, the original file would have a range deletion tombstone that spanned the entire range written to it regardless of the actual keys contained in the range or if the range was empty. This resulted in the creation of excessively wide tombstones, which has significant performance implications since the wide tombstones impede compaction. Fixes cockroachdb#44048. Release note: None. address review comments Update documentation Update TestStoreRangeMergeRaftSnapshot to constrain range of expected tombstone. Constrain width of range deletion tombstone for subsumed ranges

Constrains the width of the range deletion tombstone to the span of keys actually present within the range. If the range has no kv-entries, then skip the rangedel completely. Before this change, when receiving a snapshot, the original file would have a range deletion tombstone that spanned the entire range written to it regardless of the actual keys contained in the range or if the range was empty. This resulted in the creation of excessively wide tombstones, which has significant performance implications since the wide tombstones impede compaction. Fixes cockroachdb#44048. Release note: None. Update documentation Constrain width of range deletion tombstone for subsumed ranges

Constrains the width of the range deletion tombstone to the span of keys actually present within the range. If the range has no kv-entries, then skip the rangedel completely. Before this change, when receiving a snapshot, the original file would have a range deletion tombstone that spanned the entire range written to it regardless of the actual keys contained in the range or if the range was empty. This resulted in the creation of excessively wide tombstones, which has significant performance implications since the wide tombstones impede compaction. Fixes cockroachdb#44048. Release note: None. Update documentation Constrain width of range deletion tombstone for subsumed ranges Address nathan's reviews

Constrains the width of the range deletion tombstone to the span of keys actually present within the range. If the range has no kv-entries, then skip the rangedel completely. Before this change, when receiving a snapshot, the original file would have a range deletion tombstone that spanned the entire range written to it regardless of the actual keys contained in the range or if the range was empty. This resulted in the creation of excessively wide tombstones, which has significant performance implications since the wide tombstones impede compaction. Fixes cockroachdb#44048. Release note: None. Update documentation. Constrain width of range deletion tombstone for subsumed ranges. Address nathan's reviews. Address comments and fix pkg renaming issues with engine to storage.

44725: storage: constrained span of rangedel in ClearRange to keys in range r=nvanbenschoten a=OwenQian Constrains the width of the range deletion tombstone to the span of keys actually present within the range. If the range has no kv-entries, then skip the rangedel completely. Before this change, when receiving a snapshot, the original file would have a range deletion tombstone that spanned the entire range written to it regardless of the actual keys contained in the range or if the range was empty. This resulted in the creation of excessively wide tombstones, which has significant performance implications since the wide tombstones impede compaction. Fixes #44048. Rebased off #45100. Release note: None. 45157: sql: add inverted indexes on arrays r=jordanlewis a=jordanlewis Closes #43199. This commit adds inverted index support to arrays. Inverted index entries are created from arrays by simply encoding a key that contains the array element's table key encoding. Nulls are not indexed, since in SQL, ARRAY[1, NULL] @> ARRAY[NULL] returns false. For example, in a table t(int, int[]) with an inverted index with id 3 on the int[] column the row (10, [1, NULL, 2]) produces 2 index keys: ``` /tableId/3/1/10 /tableId/3/2/10 ``` This encoding scheme is much simpler than the one for JSON, since arrays don't have "paths": their elements are simply ordinary datums. Release note (sql change): The inverted index implementation now supports indexing array columns. This permits accelerating containment queries (@> and <@) on array columns by adding an index to them. 45642: ui: Set react component `key` prop to fix react errors r=nathanstilwell a=koorosh Set react component `key` prop to fix react errors Resolves: #45188 Co-authored-by: Owen Qian <[email protected]> Co-authored-by: Jordan Lewis <[email protected]> Co-authored-by: Andrii Vorobiov <[email protected]>

nvanbenschoten · 2020-04-06T22:33:34Z

I was thinking about this change over the weekend and I'm becoming concerned that the fix in #44725 is unsafe. Specifically, I'm concerned about a hazard where a follower replica applies an entry to its replicated state machine between the time that it receives a snapshot and the time that it applies that snapshot such that the snapshot's range deletion tombstone is no longer sufficiently wide to clear all old state.

Consider the case where a range has a raft log that is appending keys to the end of the keyspace and removing keys from the beginning of the keyspace. Raft log index 5 writes key 5 and deletes key 4, index 6 writes key 6 and deletes key 5, index k writes key k and deletes key k-1, etc. Now imagine that a follower has applied up to log index k and is being sent a snapshot that includes the materialized state of index k+10 (i.e. the index of the snapshot is k+10). At the point that the follower begins receiving the snapshot, we have the following states:

range desc:        [0, 9999)
leader keyspace:   [{k+10}]
follower keyspace: [{k}]
snapshot keys:     [{k+10}]

As of #44725, the follower will scan its own keyspace upon receipt of the snapshot and use it to generate a range deletion tombstone. In this case, its tombstone will cover a single key: k. So the SST it ends up ingesting will look like:

snapshot sst:
  put: k+10
  delete range: [k, k+1)

So far, so good.

However, at this point, I don't think we have any firm guarantee that the follower actually applies the snapshot before applying any new entries in its log. Of course, it's very unlikely that it will receive any new entires between its current applied index and the snapshot index because the leader wouldn't be sending the follower a snapshot if it could catch it up from its log, but I think it's possible to construct such situations. For instance, what if leadership is transferred and the new leader does happen to have index k+1 in its log? It should try to catch the follower up by appending k+1 to the followers log.

So in this case, we might run into a problem where the follower applies log index k+1 and then applies the snapshot at k+10. By the time it applies the snapshot, its snapshots range deletion tombstone will no longer cover the entire range:

follower keyspace @ index k:    [{k}]
follower keyspace @ index k+1 after applying entry k+1: [{k+1}]
follower keyspace @ index k+10 after applying snapshot: [{k+1}, {k+10}]

Now the follower is in an inconsistent state!

So for this change to be safe, we must guarantee that the applied index of the follower stays constant between the time that it constructs its snapshot sst (using its own state to determine the range del boundaries) and the time that it applies that snapshot sst. I don't think we have that guarantee today. I put together a patch that tests for this here and haven't seen it fire yet while running unit tests, but that alone doesn't give me a ton of confidence.

I'd like to get @bdarnell and @tbg's thoughts on this. Am I missing something? Do we have a ticking time bomb here?

bdarnell · 2020-04-07T15:55:05Z

However, at this point, I don't think we have any firm guarantee that the follower actually applies the snapshot before applying any new entries in its log

I'm also not aware of such a guarantee. This does look like a potential issue.

At startup, the range ending in /Max is upreplicated from n1 to two other nodes.

My thinking is that we should never do anything with the range ending in /Max except to split it. And we pre-split ranges during import, so the problem here is about the original last range being up-replicated before the import splits, right?

Maybe the answer is to special case the empty /Max range and skip/shrink the range tombstone for that range, but still use tombstones out to the range boundary everywhere else.

petermattis · 2020-04-07T17:00:00Z

However, at this point, I don't think we have any firm guarantee that the follower actually applies the snapshot before applying any new entries in its log.

Huh? I haven't looked at this code in a long time, but I would expect appending Raft log entries to be mutually exclusive with snapshot application. Is that not true? Or perhaps it is true, but the follower is scanning its key space too early.

bdarnell · 2020-04-07T17:05:57Z

Receiving and applying snapshots has become spread out over time, and we generate this range tombstone fairly early in the process. It's not obvious to me that we're blocking everything else on the range for the entire duration (but we might be). Scanning the keyspace later in the process would be one solution, but I think it would require rewriting the sst files for additional IO pressure.

tbg · 2020-04-08T11:21:13Z

This does seem like a real issue. Best illustrated by this code:

cockroach/pkg/kv/kvserver/store_snapshot.go

Lines 838 to 844 in 37a1bd1

    
           inSnap, err := ss.Receive(ctx, stream, *header) 
        
           if err != nil { 
        
           	return err 
        
           } 
        
           if err := s.processRaftSnapshotRequest(ctx, header, inSnap); err != nil { 
        
           	return sendSnapshotError(stream, errors.Wrap(err.GoError(), "failed to apply snapshot")) 
        
           }

ss.Receive() receives the snapshot, and also calls ConstrainToKeys. When Receive() returns, no locks whatsoever are held. Receiving the snapshot simply isn't the right time to look at the engine (it's external I/O - we don't hold locks then - don't think we ever did after code yellow?). It's applying the snapshot that has the proper synchronization (and "stops the world" as far as that replica is concerned).

A straightforward way to hit this today could be when a follower is catching up through a backlog of committed entries while receiving a snapshot. In that case, the applied index would go up rapidly (but the snapshot might still be necessary).
Similarly, I would be worried that this bug would stab us in the back in subtle ways the moment we started getting into asynchronous entry application (if we ever did).

I don't think we're particularly likely to hit this issue today, though leaving the bug in is certainly not an option. As Nathan suggested we can use the quick fix of discarding the snapshot at apply time if we find that the applied index has changed (which in practice we expect to "never" be the case). Or we re-compute the constrained span under the right lock and use that as a (better) signal. Either way, a loud message plus sentry telemetry is in order when it happens as we'll want to know.

Quick fix aside, how should this work? I don't like seeing subtly wrong code plus a band-aid in the long term (and chances are that if we don't fix it soon, we never will). Can we leave the SSTs "open" so that we can add the tombstone "at the bottom" in at apply time? I suppose the answer is "no" since the deletion tombstone needs to come first to avoid shadowing the rest of the SST (unless we add some kind of "TombstoneBehind" operation)? But also we can't hope to fix up the existing SSTs at apply time if needed, since the new keys that pop up may overlay with data ingested in the SSTs, and so a very fine-grained surgery would be needed.

petermattis · 2020-04-08T13:06:28Z

Quick fix aside, how should this work? I don't like seeing subtly wrong code plus a band-aid in the long term (and chances are that if we don't fix it soon, we never will). Can we leave the SSTs "open" so that we can add the tombstone "at the bottom" in at apply time? I suppose the answer is "no" since the deletion tombstone needs to come first to avoid shadowing the rest of the SST (unless we add some kind of "TombstoneBehind" operation)? But also we can't hope to fix up the existing SSTs at apply time if needed, since the new keys that pop up may overlay with data ingested in the SSTs, and so a very fine-grained surgery would be needed.

Every record in an ingested sstable occurs at the same point in time and there is special case logic so that range tombstones are considered to happen "before" any point operation. While the point operations have to be added in order while the sstable is being built, the range tombstone can be added at a later point. So your suggestion to leave the sstable "open" until the snapshot is applied could be done without any change to sstable ingestion. I'm not pushing for this approach, merely indicating that it could be done.

tbg · 2020-04-08T13:23:54Z

Interesting, that's good to know. Seems like a sane way of doing things, but curious if there are other alternatives to consider.

petermattis · 2020-09-10T14:57:27Z

I wonder if this is still a problem given that we prioritize compaction of range tombstones and perform compaction of L6 tables in order to drop range tombstones. @jbowens can you put some thinking into this? We may also want to perform some experiments to see if this remains a problem. If we ingest an sstable into L6 that has a wide range tombstone, how quickly will we compact it if there are other concurrent ingestions taking place?

jbowens · 2021-07-06T19:57:05Z

I wonder if this is still a problem given that we prioritize compaction of range tombstones and perform compaction of L6 tables in order to drop range tombstones. @jbowens can you put some thinking into this? We may also want to perform some experiments to see if this remains a problem. If we ingest an sstable into L6 that has a wide range tombstone, how quickly will we compact it if there are other concurrent ingestions taking place?

I think today we'll rewrite these sstables to drop the wide range tombstones as soon as we don't have score-based compactions. This code will set the file's RangeDeletionsBytesEstimate to the sum of all the data blocks sizes, which will trigger the elision-only compaction when there aren't score-based compactions.:

				// If the file is in the last level of the LSM, there is no
				// data beneath it. The fact that there is still a range
				// tombstone in a bottomost file suggests that an open
				// snapshot kept the tombstone around. Estimate disk usage
				// within the file itself.
				if level == numLevels-1 {
					size, err := r.EstimateDiskUsage(startUserKey, endUserKey)
					if err != nil {
						return err
					}
					stats.RangeDeletionsBytesEstimate += size
					return nil
				}

In this case, the RangeDeletionsBytesEstimate estimate is very off, because the tombstone ingested within the sstable won't drop anything at all. We don't account for the fact that range deletions within an ingested table don't drop data at that sequence number, which means we do rewrite these sstables to remove the wide range deletions.

I agree it would be worthwhile to do some experiments (sorry I didn't get to them before my storage team hiatus!).

If we ever pursue cockroachdb/pebble#25, maybe we could incorporate clearing key spans as an optional part of the ingestion operation. We'd only need to write the range tombstones as a part of the WAL batch if they are indeed in-use. This would force the ingestions into L0 if the span is in-use anywhere within the LSM though.

mwang1026 · 2021-12-07T21:04:09Z

Spoke with @jbowens and we're going to close because we haven't seen the support ticket load we used to see re: this issue since we changed the range tombstone compaction priority

jbowens · 2022-11-16T18:15:34Z

Going to reopen this issue, since it’s still somewhat relevant for wide range tombstones ingested into L6. We’re hoping to remove the range tombstones from ingested snapshot sstables as a part of the virtual sstable work by allowing for an Ingest operation that ‘excises’ existing data in a span through virtualizing overlapping sstables.

Previously, a file in L6 that contained range deletion(s) was assumed to contain range deletions due to open snapshots at the time it was written. Table stats, and in particular the RangeDeletionBytesEstimate value, was calculated assuming the range deletion dropped all keys within the sstable that fall within the range deletion's bounds. This is a decent heuristic for when snapshots were indeed open, preventing the elision of the range deletion. However, in Cockroach KV snapshot reception writes range deletions to each of the ingested sstables (see cockroachdb/cockroach#44048). These range deletions delete nothing within the tables themselves. They exist only to clear out existing state below the table in the LSM. In these cases, the RangeDeletionBytesEstimate value would be a gross overestimate. Compacting the table would only drop the tombstones themselves, and none of the actual range data contained in point blocks. This commit adjusts the logic for calculating RangeDeletionBytesEstimate to take the range deletion sequence number into account. This small change will prevent sstables ingested through Cockroach snapshot reception from having a RangeDeletionBytesEstimate that incorrectly sums all data blocks. This has two consequences: a) In times of low compaction pressure, we will no longer rewrite the ingested sstables through elision-only compactions simply to remove the tombstones. b) When compacting from L5 into L6, we will no longer artificially deflate these sstable's size when calculating the min-overlapping ratio heuristic. Together, these effects should reduce write amplification/bandwidth, especially in the presence of significant rebalancing.

jbowens · 2024-04-12T17:31:23Z

Going to close this out now that we have excises.

petermattis added the A-kv-replication Relating to Raft, consensus, and coordination. label Jan 15, 2020

nvanbenschoten assigned OwenQian and nvanbenschoten Jan 15, 2020

OwenQian mentioned this issue Feb 4, 2020

storage: constrained span of rangedel in ClearRange to keys in range #44725

Merged

craig bot closed this as completed in 37a1bd1 Mar 6, 2020

nvanbenschoten reopened this Apr 6, 2020

nvanbenschoten added A-storage Relating to our storage engine (Pebble) on-disk storage. and removed A-kv-replication Relating to Raft, consensus, and coordination. labels Sep 10, 2020

nvanbenschoten removed their assignment Sep 10, 2020

petermattis added the C-performance Perf of queries or internals. Solution not expected to change functional behavior. label Sep 10, 2020

petermattis assigned jbowens Sep 10, 2020

jlinder added the T-storage Storage Team label Jun 16, 2021

jbowens mentioned this issue Jul 6, 2021

compaction: excessively large L6 sstables cockroachdb/pebble#1181

Closed

mwang1026 closed this as completed Dec 7, 2021

petermattis mentioned this issue Apr 27, 2022

storage: snapshots ingested into L0 during decommission #80589

Closed

jbowens reopened this Nov 16, 2022

jbowens mentioned this issue Mar 22, 2023

db: adjust range-del bytes estimate for L6 files cockroachdb/pebble#2411

Merged

jbowens closed this as completed Apr 12, 2024

jbowens added this to [Deprecated] Storage Jun 4, 2024

jbowens moved this to Done in [Deprecated] Storage Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: avoid excessively wide range tombstones during Raft snapshot reception #44048

storage: avoid excessively wide range tombstones during Raft snapshot reception #44048

petermattis commented Jan 15, 2020 •

edited

Loading

nvanbenschoten commented Apr 6, 2020

bdarnell commented Apr 7, 2020

petermattis commented Apr 7, 2020

bdarnell commented Apr 7, 2020

tbg commented Apr 8, 2020

petermattis commented Apr 8, 2020

tbg commented Apr 8, 2020

petermattis commented Sep 10, 2020

jbowens commented Jul 6, 2021

mwang1026 commented Dec 7, 2021

jbowens commented Nov 16, 2022

jbowens commented Apr 12, 2024

storage: avoid excessively wide range tombstones during Raft snapshot reception #44048

storage: avoid excessively wide range tombstones during Raft snapshot reception #44048

Comments

petermattis commented Jan 15, 2020 • edited Loading

nvanbenschoten commented Apr 6, 2020

bdarnell commented Apr 7, 2020

petermattis commented Apr 7, 2020

bdarnell commented Apr 7, 2020

tbg commented Apr 8, 2020

petermattis commented Apr 8, 2020

tbg commented Apr 8, 2020

petermattis commented Sep 10, 2020

jbowens commented Jul 6, 2021

mwang1026 commented Dec 7, 2021

jbowens commented Nov 16, 2022

jbowens commented Apr 12, 2024

petermattis commented Jan 15, 2020 •

edited

Loading