-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: snapshots ingested into L0 during decommission #80589
Comments
The first step in this investigation is to establish a reproduction of the issue. One suggestion from @tbg is to plumb through a statistic from Pebble which would record at which level an SST was ingested. This would then allow to easily track when an SST is ingested into L0 and yell loudly about it. Then any simple up-replication or decommission workload can be run with this patch to establish how common this problem is. |
Reiterating what @sumeerbhola said in the storage weekly, with current Pebble it's my expectation that these files would be ingested into L0. During ingest, there's two kinds of overlap that Pebble needs to consider: data overlap and boundary overlap. Data overlap exists when a key that already exists within the LSM must be shadowed by the ingested sstable. Pebble doesn't perform an exact check for this overlap, but it will perform a seek into overlapping sstables to check for the happy case of zero LSM keys within the bounds of the ingested sstable which if confirmed guarantees zero data overlap. Boundary overlap exists when the bounds of the ingested sstable overlap with the bounds of an existing sstable. When deciding the ingest target level, Pebble has two constraints imposed by these kinds of overlap:
The first constraint preserves the LSM sequence number invariant, ensuring newer keys are higher within the LSM than older keys at the same user key. The second constraint preserves the non-overlapping ordering of files within a level, so that a read always needs to consult at most one file per level 1 through 6. Even if the node contains zero data overlap, boundary overlap at every level L1...L6 seems extremely likely, which would force the sstable into L0. For a level to not have boundary overlap, the sstable boundaries and range boundaries would've had to happen to coincide. I think there are quite a few avenues for experimentation once we get a reproduction. |
Repro for cockroachdb#80589. 100% of all snaps go to L0, see log.txt. Release note: None
Well that almost reproed itself: #80611 100% of all upreplication snaps in the simplest possible experiment go to L0:
Might not be fair to call it "100%", rather "most but not always all of the bytes in any given snapshot go to L0". |
This makes a lot of sense. What kinds of avenues did you have in mind? The most straightforward approach (conceptually) seems to be constraining all (or all above L6) SSTs to the range boundaries of single ranges. This has some parallels to the idea of "guards" in the PebblesDB paper. I don't know whether this is workable in practice, as it seems like it would lead to very small SSTs in the top few non-overlapping levels of the LSM (L1-L3). PebblesDB also doesn't seem to enforce these guards in L0, which would be a problem. We could also explore compaction strategies that would make compacting such an SST cheaper. Even if an SST with no data overlap ends up in L0 due to boundary overlap, it could quickly be compacted if we could cheaply split the SSTs below it in half to remove the boundary overlap. Do we already have a way to cheaply split an SST into two files? Is this an operation Pebble ever employs during compaction? |
This issue reminds me of #44048 which was eventually closed with no action taken. |
We don't currently, but the disaggregated storage design describes a 'virtual sstable' that separates the logical boundaries of a metadata entry within a level from the physical sstable boundaries. Once we have these intermediary 'virtual sstables' we could write a single manifest entry that splits the sstable's metadata within a level into two without recompacting the sstable. An ingest could include this logical split in the same manifest entry that adds the sstable into the LSM. I think these virtual sstables are our best bet for dealing with this problem, but they require large, invasive changes to Pebble. I think this problem is also somewhat similar to the problem of a sequential import writing into the middle of the keyspace. cockroachdb/pebble#1671 is somewhat relevant, but unfortunately would probably be less useful for snapshot ingestion where ingested sstables are more randomly distributed across the keyspace. |
There is some nuance in determining when an ingested file overlaps with one or more files in a level, as file boundary overlap _and_ data overlap is considered. Pad out the existing documentation, providing examples to demonstrate how boundary and data overlap are used to pick a level for ingestion. Rework the existing data-drive test cases, splitting up and documenting the various scenarios. Motivated by cockroachdb/cockroach#80589.
@tbg I don't quite understand this repro, since it doesn't seem to be doing any work other than start a cluster. Is this just up-replication of tiny ranges? |
self-assigning to investigate in more detail, based on repros that will be provided by @tbg |
There is some nuance in determining when an ingested file overlaps with one or more files in a level, as file boundary overlap _and_ data overlap is considered. Pad out the existing documentation, providing examples to demonstrate how boundary and data overlap are used to pick a level for ingestion. Rework the existing data-drive test cases, splitting up and documenting the various scenarios. Motivated by cockroachdb/cockroach#80589.
There is some nuance in determining when an ingested file overlaps with one or more files in a level, as file boundary overlap _and_ data overlap is considered. Pad out the existing documentation, providing examples to demonstrate how boundary and data overlap are used to pick a level for ingestion. Rework the existing data-drive test cases, splitting up and documenting the various scenarios. Motivated by cockroachdb/cockroach#80589.
Ran the repro from https://github.com/tbg/cockroach/blob/25c16a9a07b53ddeefc43cfe4b4ea06a004c2c89/repro.sh with additional instrumentation (and with a 5 node cluster, instead of 7) and the large file in each snapshot always gets ingested into L6. The typical output looks like the following, where the "details" are sorted in decreasing file size and b is bytes, il is ingested-level, dol is highest-data-overlap-level (7 means there was no data overlap), and bl is Lbase.
I'm going to try with |
|
Recommissioned the 2 nodes after ~18min. Almost the whole LSM had been compacted away, so the highest-data-overlap-level is usually 7, and so is the ingested-level. |
I think I managed to reproduce this, and even (accidentally) with the default snapshot rate limits. The custom chart below shows that when we decommission three nodes in the cluster, we get a - not quite inverted, but sort of unhappy - LSM, and both the p50 and p99 latencies go x5 (it's hard to see this from the screenshot, but that's roughly what it is). The workload here is kv0 with a block size of 1000 (note: one workload instance per node)
on this cluster
(3000 IOPS 125mb/s EBS). I think what's important is that I let the workload build up organically to ~28GiB of It would be worth verifying this experiment with #81039 to see a large spike in ingestions into L0 during the decommissioning process. |
I recommissioned the nodes and, a few hours later, decommissioned them again, to similar results. (I forgot So this does seem like a robust way to reproduce the problem. I'm going to leave it at that and return to looking into the raft recv queue but happy to help others reproduce this issue. |
I don't think there is anything left to do here. cockroachdb/pebble#1683 is the pebble issue for virtual ssts that should significantly reduce this occurence. And we could also do #80607 to be certain that snapshots won't overload the LSM. |
In a recent customer issue we observed that during a decomission, when we were streaming snapshots to a new node 90% of snapshots were ingested into L0. The expectation is that Pebble should be capable to ingest all incoming snapshots into a lower level. The goal of this ticket is perform an investigation to understand why does Pebble behave this way and come up with a proposal for improving the snapshot ingestion path.
Jira issue: CRDB-15336
The text was updated successfully, but these errors were encountered: