Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: avoid excessively wide range tombstones during Raft snapshot reception #44048

Closed
petermattis opened this issue Jan 15, 2020 · 20 comments · Fixed by #44725
Closed

storage: avoid excessively wide range tombstones during Raft snapshot reception #44048

petermattis opened this issue Jan 15, 2020 · 20 comments · Fixed by #44725
Assignees
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-storage Storage Team

Comments

@petermattis
Copy link
Collaborator

petermattis commented Jan 15, 2020

kvBatchSnapshotStrategy blindly adds a range tombstone to the sstables it generates during Raft snapshot reception for each of the 3 key spans for a range. For most ranges, this is perfectly fine, but for the last range in the key space this ends up adding a range tombstone from [<start>,/Max]. This key range overlaps with any future ranges split off the end.

Why is this a problem? This wide range tombstone acts as a "block" in the RocksDB/Pebble LSM preventing ingestion into a level. We see this in TPCC imports. At startup, the range ending in /Max is upreplicated from n1 to two other nodes. Those other nodes ingest an sstable with a range tombstone ending in /Max. Subsequently, this range is split many times for import and when import tries to ingest sstables on these follower nodes, the ingestion hits L0 rather than L6. This in turn causes increased compaction pressure allowing more sstables to build up in L0. The evidence so far is a downward spiral results.

While kvBatchSnapshotStrategy appears to be the proximate culprit of such wide range tombstones, @nvanbenschoten speculates that Replica.clearSubsumedReplicaDiskData could have this same problem.

The suggestion is to introduce some additional checks to narrow the scope of the range tombstone. Specifically, Store.receiveSnapshot can create an iterator and SeekLT on the end key of the range. This last key in the range will be used to lower the upper bound of the range tombstone, rather than blindly using the upper bound of the range.

Cc @nvanbenschoten, @OwenQian

@petermattis petermattis added the A-kv-replication Relating to Raft, consensus, and coordination. label Jan 15, 2020
OwenQian pushed a commit to OwenQian/cockroach that referenced this issue Feb 5, 2020
Constrains the width of the range deletion tombstone to the span of keys
actually present within the range. If the range has no kv-entries, then skip the
rangedel completely.

Before this change, when receiving a snapshot, the original file
would have a range deletion tombstone that spanned the entire range written to
it regardless of the actual keys contained in the range or if the range was
empty. This resulted in the creation of excessively wide tombstones, which has
significant performance implications since the wide tombstones impede
compaction.

Fixes cockroachdb#44048.

Release note: None.
OwenQian pushed a commit to OwenQian/cockroach that referenced this issue Feb 6, 2020
Constrains the width of the range deletion tombstone to the span of keys
actually present within the range. If the range has no kv-entries, then skip the
rangedel completely.

Before this change, when receiving a snapshot, the original file
would have a range deletion tombstone that spanned the entire range written to
it regardless of the actual keys contained in the range or if the range was
empty. This resulted in the creation of excessively wide tombstones, which has
significant performance implications since the wide tombstones impede
compaction.

Fixes cockroachdb#44048.

Release note: None.
OwenQian pushed a commit to OwenQian/cockroach that referenced this issue Feb 27, 2020
Constrains the width of the range deletion tombstone to the span of keys
actually present within the range. If the range has no kv-entries, then skip the
rangedel completely.

Before this change, when receiving a snapshot, the original file
would have a range deletion tombstone that spanned the entire range written to
it regardless of the actual keys contained in the range or if the range was
empty. This resulted in the creation of excessively wide tombstones, which has
significant performance implications since the wide tombstones impede
compaction.

Fixes cockroachdb#44048.

Release note: None.

address review comments

Update documentation

Update TestStoreRangeMergeRaftSnapshot to constrain range of expected tombstone.

Constrain width of range deletion tombstone for subsumed ranges
OwenQian pushed a commit to OwenQian/cockroach that referenced this issue Feb 27, 2020
Constrains the width of the range deletion tombstone to the span of keys
actually present within the range. If the range has no kv-entries, then skip the
rangedel completely.

Before this change, when receiving a snapshot, the original file
would have a range deletion tombstone that spanned the entire range written to
it regardless of the actual keys contained in the range or if the range was
empty. This resulted in the creation of excessively wide tombstones, which has
significant performance implications since the wide tombstones impede
compaction.

Fixes cockroachdb#44048.

Release note: None.

address review comments

Update documentation

Update TestStoreRangeMergeRaftSnapshot to constrain range of expected tombstone.

Constrain width of range deletion tombstone for subsumed ranges
OwenQian pushed a commit to OwenQian/cockroach that referenced this issue Feb 27, 2020
Constrains the width of the range deletion tombstone to the span of keys
actually present within the range. If the range has no kv-entries, then skip
the rangedel completely.

Before this change, when receiving a snapshot, the original file
would have a range deletion tombstone that spanned the entire range written to
it regardless of the actual keys contained in the range or if the range was
empty. This resulted in the creation of excessively wide tombstones, which has
significant performance implications since the wide tombstones impede
compaction.

Fixes cockroachdb#44048.

Release note: None.

Update documentation

Constrain width of range deletion tombstone for subsumed ranges
OwenQian pushed a commit to OwenQian/cockroach that referenced this issue Mar 4, 2020
Constrains the width of the range deletion tombstone to the span of keys
actually present within the range. If the range has no kv-entries, then skip
the rangedel completely.

Before this change, when receiving a snapshot, the original file
would have a range deletion tombstone that spanned the entire range written to
it regardless of the actual keys contained in the range or if the range was
empty. This resulted in the creation of excessively wide tombstones, which has
significant performance implications since the wide tombstones impede
compaction.

Fixes cockroachdb#44048.

Release note: None.

Update documentation

Constrain width of range deletion tombstone for subsumed ranges

Address nathan's reviews
OwenQian pushed a commit to OwenQian/cockroach that referenced this issue Mar 5, 2020
Constrains the width of the range deletion tombstone to the span of keys
actually present within the range. If the range has no kv-entries, then skip
the rangedel completely.

Before this change, when receiving a snapshot, the original file
would have a range deletion tombstone that spanned the entire range written to
it regardless of the actual keys contained in the range or if the range was
empty. This resulted in the creation of excessively wide tombstones, which has
significant performance implications since the wide tombstones impede
compaction.

Fixes cockroachdb#44048.

Release note: None.

Update documentation.

Constrain width of range deletion tombstone for subsumed ranges.

Address nathan's reviews.

Address comments and fix pkg renaming issues with engine to storage.
craig bot pushed a commit that referenced this issue Mar 6, 2020
44725: storage: constrained span of rangedel in ClearRange to keys in range r=nvanbenschoten a=OwenQian

Constrains the width of the range deletion tombstone to the span of keys
actually present within the range. If the range has no kv-entries, then skip the
rangedel completely.

Before this change, when receiving a snapshot, the original file
would have a range deletion tombstone that spanned the entire range written to
it regardless of the actual keys contained in the range or if the range was
empty. This resulted in the creation of excessively wide tombstones, which has
significant performance implications since the wide tombstones impede
compaction.

Fixes #44048.

Rebased off #45100.

Release note: None.

45157: sql: add inverted indexes on arrays r=jordanlewis a=jordanlewis

Closes #43199.

This commit adds inverted index support to arrays. Inverted index
entries are created from arrays by simply encoding a key that contains
the array element's table key encoding. Nulls are not indexed, since in
SQL, ARRAY[1, NULL] @> ARRAY[NULL] returns false.

For example, in a table t(int, int[]) with an inverted index with id 3
on the int[] column the row (10, [1, NULL, 2]) produces 2 index keys:

```
/tableId/3/1/10
/tableId/3/2/10
```

This encoding scheme is much simpler than the one for JSON, since arrays
don't have "paths": their elements are simply ordinary datums.

Release note (sql change): The inverted index implementation now
supports indexing array columns. This permits accelerating containment
queries (@> and <@) on array columns by adding an index to them.

45642: ui: Set react component `key` prop to fix react errors r=nathanstilwell a=koorosh

Set react component `key` prop to fix react errors

Resolves: #45188

Co-authored-by: Owen Qian <[email protected]>
Co-authored-by: Jordan Lewis <[email protected]>
Co-authored-by: Andrii Vorobiov <[email protected]>
@craig craig bot closed this as completed in 37a1bd1 Mar 6, 2020
@nvanbenschoten
Copy link
Member

I was thinking about this change over the weekend and I'm becoming concerned that the fix in #44725 is unsafe. Specifically, I'm concerned about a hazard where a follower replica applies an entry to its replicated state machine between the time that it receives a snapshot and the time that it applies that snapshot such that the snapshot's range deletion tombstone is no longer sufficiently wide to clear all old state.

Consider the case where a range has a raft log that is appending keys to the end of the keyspace and removing keys from the beginning of the keyspace. Raft log index 5 writes key 5 and deletes key 4, index 6 writes key 6 and deletes key 5, index k writes key k and deletes key k-1, etc. Now imagine that a follower has applied up to log index k and is being sent a snapshot that includes the materialized state of index k+10 (i.e. the index of the snapshot is k+10). At the point that the follower begins receiving the snapshot, we have the following states:

range desc:        [0, 9999)
leader keyspace:   [{k+10}]
follower keyspace: [{k}]
snapshot keys:     [{k+10}]

As of #44725, the follower will scan its own keyspace upon receipt of the snapshot and use it to generate a range deletion tombstone. In this case, its tombstone will cover a single key: k. So the SST it ends up ingesting will look like:

snapshot sst:
  put: k+10
  delete range: [k, k+1)

So far, so good.

However, at this point, I don't think we have any firm guarantee that the follower actually applies the snapshot before applying any new entries in its log. Of course, it's very unlikely that it will receive any new entires between its current applied index and the snapshot index because the leader wouldn't be sending the follower a snapshot if it could catch it up from its log, but I think it's possible to construct such situations. For instance, what if leadership is transferred and the new leader does happen to have index k+1 in its log? It should try to catch the follower up by appending k+1 to the followers log.

So in this case, we might run into a problem where the follower applies log index k+1 and then applies the snapshot at k+10. By the time it applies the snapshot, its snapshots range deletion tombstone will no longer cover the entire range:

follower keyspace @ index k:    [{k}]
follower keyspace @ index k+1 after applying entry k+1: [{k+1}]
follower keyspace @ index k+10 after applying snapshot: [{k+1}, {k+10}]

Now the follower is in an inconsistent state!

So for this change to be safe, we must guarantee that the applied index of the follower stays constant between the time that it constructs its snapshot sst (using its own state to determine the range del boundaries) and the time that it applies that snapshot sst. I don't think we have that guarantee today. I put together a patch that tests for this here and haven't seen it fire yet while running unit tests, but that alone doesn't give me a ton of confidence.

I'd like to get @bdarnell and @tbg's thoughts on this. Am I missing something? Do we have a ticking time bomb here?

@nvanbenschoten nvanbenschoten reopened this Apr 6, 2020
@bdarnell
Copy link
Contributor

bdarnell commented Apr 7, 2020

However, at this point, I don't think we have any firm guarantee that the follower actually applies the snapshot before applying any new entries in its log

I'm also not aware of such a guarantee. This does look like a potential issue.

At startup, the range ending in /Max is upreplicated from n1 to two other nodes.

My thinking is that we should never do anything with the range ending in /Max except to split it. And we pre-split ranges during import, so the problem here is about the original last range being up-replicated before the import splits, right?

Maybe the answer is to special case the empty /Max range and skip/shrink the range tombstone for that range, but still use tombstones out to the range boundary everywhere else.

@petermattis
Copy link
Collaborator Author

However, at this point, I don't think we have any firm guarantee that the follower actually applies the snapshot before applying any new entries in its log.

Huh? I haven't looked at this code in a long time, but I would expect appending Raft log entries to be mutually exclusive with snapshot application. Is that not true? Or perhaps it is true, but the follower is scanning its key space too early.

@bdarnell
Copy link
Contributor

bdarnell commented Apr 7, 2020

Receiving and applying snapshots has become spread out over time, and we generate this range tombstone fairly early in the process. It's not obvious to me that we're blocking everything else on the range for the entire duration (but we might be). Scanning the keyspace later in the process would be one solution, but I think it would require rewriting the sst files for additional IO pressure.

@tbg
Copy link
Member

tbg commented Apr 8, 2020

This does seem like a real issue. Best illustrated by this code:

inSnap, err := ss.Receive(ctx, stream, *header)
if err != nil {
return err
}
if err := s.processRaftSnapshotRequest(ctx, header, inSnap); err != nil {
return sendSnapshotError(stream, errors.Wrap(err.GoError(), "failed to apply snapshot"))
}

ss.Receive() receives the snapshot, and also calls ConstrainToKeys. When Receive() returns, no locks whatsoever are held. Receiving the snapshot simply isn't the right time to look at the engine (it's external I/O - we don't hold locks then - don't think we ever did after code yellow?). It's applying the snapshot that has the proper synchronization (and "stops the world" as far as that replica is concerned).

A straightforward way to hit this today could be when a follower is catching up through a backlog of committed entries while receiving a snapshot. In that case, the applied index would go up rapidly (but the snapshot might still be necessary).
Similarly, I would be worried that this bug would stab us in the back in subtle ways the moment we started getting into asynchronous entry application (if we ever did).

I don't think we're particularly likely to hit this issue today, though leaving the bug in is certainly not an option. As Nathan suggested we can use the quick fix of discarding the snapshot at apply time if we find that the applied index has changed (which in practice we expect to "never" be the case). Or we re-compute the constrained span under the right lock and use that as a (better) signal. Either way, a loud message plus sentry telemetry is in order when it happens as we'll want to know.

Quick fix aside, how should this work? I don't like seeing subtly wrong code plus a band-aid in the long term (and chances are that if we don't fix it soon, we never will). Can we leave the SSTs "open" so that we can add the tombstone "at the bottom" in at apply time? I suppose the answer is "no" since the deletion tombstone needs to come first to avoid shadowing the rest of the SST (unless we add some kind of "TombstoneBehind" operation)? But also we can't hope to fix up the existing SSTs at apply time if needed, since the new keys that pop up may overlay with data ingested in the SSTs, and so a very fine-grained surgery would be needed.

@petermattis
Copy link
Collaborator Author

Quick fix aside, how should this work? I don't like seeing subtly wrong code plus a band-aid in the long term (and chances are that if we don't fix it soon, we never will). Can we leave the SSTs "open" so that we can add the tombstone "at the bottom" in at apply time? I suppose the answer is "no" since the deletion tombstone needs to come first to avoid shadowing the rest of the SST (unless we add some kind of "TombstoneBehind" operation)? But also we can't hope to fix up the existing SSTs at apply time if needed, since the new keys that pop up may overlay with data ingested in the SSTs, and so a very fine-grained surgery would be needed.

Every record in an ingested sstable occurs at the same point in time and there is special case logic so that range tombstones are considered to happen "before" any point operation. While the point operations have to be added in order while the sstable is being built, the range tombstone can be added at a later point. So your suggestion to leave the sstable "open" until the snapshot is applied could be done without any change to sstable ingestion. I'm not pushing for this approach, merely indicating that it could be done.

@tbg
Copy link
Member

tbg commented Apr 8, 2020

Interesting, that's good to know. Seems like a sane way of doing things, but curious if there are other alternatives to consider.

@nvanbenschoten nvanbenschoten added A-storage Relating to our storage engine (Pebble) on-disk storage. and removed A-kv-replication Relating to Raft, consensus, and coordination. labels Sep 10, 2020
@nvanbenschoten nvanbenschoten removed their assignment Sep 10, 2020
@petermattis petermattis added the C-performance Perf of queries or internals. Solution not expected to change functional behavior. label Sep 10, 2020
@petermattis
Copy link
Collaborator Author

I wonder if this is still a problem given that we prioritize compaction of range tombstones and perform compaction of L6 tables in order to drop range tombstones. @jbowens can you put some thinking into this? We may also want to perform some experiments to see if this remains a problem. If we ingest an sstable into L6 that has a wide range tombstone, how quickly will we compact it if there are other concurrent ingestions taking place?

@jbowens
Copy link
Collaborator

jbowens commented Jul 6, 2021

I wonder if this is still a problem given that we prioritize compaction of range tombstones and perform compaction of L6 tables in order to drop range tombstones. @jbowens can you put some thinking into this? We may also want to perform some experiments to see if this remains a problem. If we ingest an sstable into L6 that has a wide range tombstone, how quickly will we compact it if there are other concurrent ingestions taking place?

I think today we'll rewrite these sstables to drop the wide range tombstones as soon as we don't have score-based compactions. This code will set the file's RangeDeletionsBytesEstimate to the sum of all the data blocks sizes, which will trigger the elision-only compaction when there aren't score-based compactions.:

				// If the file is in the last level of the LSM, there is no
				// data beneath it. The fact that there is still a range
				// tombstone in a bottomost file suggests that an open
				// snapshot kept the tombstone around. Estimate disk usage
				// within the file itself.
				if level == numLevels-1 {
					size, err := r.EstimateDiskUsage(startUserKey, endUserKey)
					if err != nil {
						return err
					}
					stats.RangeDeletionsBytesEstimate += size
					return nil
				}

In this case, the RangeDeletionsBytesEstimate estimate is very off, because the tombstone ingested within the sstable won't drop anything at all. We don't account for the fact that range deletions within an ingested table don't drop data at that sequence number, which means we do rewrite these sstables to remove the wide range deletions.

I agree it would be worthwhile to do some experiments (sorry I didn't get to them before my storage team hiatus!).

If we ever pursue cockroachdb/pebble#25, maybe we could incorporate clearing key spans as an optional part of the ingestion operation. We'd only need to write the range tombstones as a part of the WAL batch if they are indeed in-use. This would force the ingestions into L0 if the span is in-use anywhere within the LSM though.

@mwang1026
Copy link

Spoke with @jbowens and we're going to close because we haven't seen the support ticket load we used to see re: this issue since we changed the range tombstone compaction priority

@jbowens
Copy link
Collaborator

jbowens commented Nov 16, 2022

Going to reopen this issue, since it’s still somewhat relevant for wide range tombstones ingested into L6. We’re hoping to remove the range tombstones from ingested snapshot sstables as a part of the virtual sstable work by allowing for an Ingest operation that ‘excises’ existing data in a span through virtualizing overlapping sstables.

@jbowens jbowens reopened this Nov 16, 2022
jbowens added a commit to jbowens/pebble that referenced this issue Mar 22, 2023
Previously, a file in L6 that contained range deletion(s) was assumed to
contain range deletions due to open snapshots at the time it was written. Table
stats, and in particular the RangeDeletionBytesEstimate value, was calculated
assuming the range deletion dropped all keys within the sstable that fall
within the range deletion's bounds. This is a decent heuristic for when
snapshots were indeed open, preventing the elision of the range deletion.

However, in Cockroach KV snapshot reception writes range deletions to each of
the ingested sstables (see cockroachdb/cockroach#44048). These range deletions
delete nothing within the tables themselves. They exist only to clear out
existing state below the table in the LSM. In these cases, the
RangeDeletionBytesEstimate value would be a gross overestimate. Compacting the
table would only drop the tombstones themselves, and none of the actual range
data contained in point blocks.

This commit adjusts the logic for calculating RangeDeletionBytesEstimate to
take the range deletion sequence number into account. This small change will
prevent sstables ingested through Cockroach snapshot reception from having a
RangeDeletionBytesEstimate that incorrectly sums all data blocks. This has two
consequences:

a) In times of low compaction pressure, we will no longer rewrite the ingested
sstables through elision-only compactions simply to remove the tombstones.
b) When compacting from L5 into L6, we will no longer artificially deflate
these sstable's size when calculating the min-overlapping ratio heuristic.

Together, these effects should reduce write amplification/bandwidth, especially
in the presence of significant rebalancing.
jbowens added a commit to jbowens/pebble that referenced this issue Mar 28, 2023
Previously, a file in L6 that contained range deletion(s) was assumed to
contain range deletions due to open snapshots at the time it was written. Table
stats, and in particular the RangeDeletionBytesEstimate value, was calculated
assuming the range deletion dropped all keys within the sstable that fall
within the range deletion's bounds. This is a decent heuristic for when
snapshots were indeed open, preventing the elision of the range deletion.

However, in Cockroach KV snapshot reception writes range deletions to each of
the ingested sstables (see cockroachdb/cockroach#44048). These range deletions
delete nothing within the tables themselves. They exist only to clear out
existing state below the table in the LSM. In these cases, the
RangeDeletionBytesEstimate value would be a gross overestimate. Compacting the
table would only drop the tombstones themselves, and none of the actual range
data contained in point blocks.

This commit adjusts the logic for calculating RangeDeletionBytesEstimate to
take the range deletion sequence number into account. This small change will
prevent sstables ingested through Cockroach snapshot reception from having a
RangeDeletionBytesEstimate that incorrectly sums all data blocks. This has two
consequences:

a) In times of low compaction pressure, we will no longer rewrite the ingested
sstables through elision-only compactions simply to remove the tombstones.
b) When compacting from L5 into L6, we will no longer artificially deflate
these sstable's size when calculating the min-overlapping ratio heuristic.

Together, these effects should reduce write amplification/bandwidth, especially
in the presence of significant rebalancing.
jbowens added a commit to cockroachdb/pebble that referenced this issue Mar 28, 2023
Previously, a file in L6 that contained range deletion(s) was assumed to
contain range deletions due to open snapshots at the time it was written. Table
stats, and in particular the RangeDeletionBytesEstimate value, was calculated
assuming the range deletion dropped all keys within the sstable that fall
within the range deletion's bounds. This is a decent heuristic for when
snapshots were indeed open, preventing the elision of the range deletion.

However, in Cockroach KV snapshot reception writes range deletions to each of
the ingested sstables (see cockroachdb/cockroach#44048). These range deletions
delete nothing within the tables themselves. They exist only to clear out
existing state below the table in the LSM. In these cases, the
RangeDeletionBytesEstimate value would be a gross overestimate. Compacting the
table would only drop the tombstones themselves, and none of the actual range
data contained in point blocks.

This commit adjusts the logic for calculating RangeDeletionBytesEstimate to
take the range deletion sequence number into account. This small change will
prevent sstables ingested through Cockroach snapshot reception from having a
RangeDeletionBytesEstimate that incorrectly sums all data blocks. This has two
consequences:

a) In times of low compaction pressure, we will no longer rewrite the ingested
sstables through elision-only compactions simply to remove the tombstones.
b) When compacting from L5 into L6, we will no longer artificially deflate
these sstable's size when calculating the min-overlapping ratio heuristic.

Together, these effects should reduce write amplification/bandwidth, especially
in the presence of significant rebalancing.
@jbowens
Copy link
Collaborator

jbowens commented Apr 12, 2024

Going to close this out now that we have excises.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-storage Storage Team
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

10 participants