storage: snapshots ingested into L0 during decommission #80589

lunevalex · 2022-04-26T21:30:12Z

In a recent customer issue we observed that during a decomission, when we were streaming snapshots to a new node 90% of snapshots were ingested into L0. The expectation is that Pebble should be capable to ingest all incoming snapshots into a lower level. The goal of this ticket is perform an investigation to understand why does Pebble behave this way and come up with a proposal for improving the snapshot ingestion path.

Jira issue: CRDB-15336

lunevalex · 2022-04-26T21:32:39Z

The first step in this investigation is to establish a reproduction of the issue. One suggestion from @tbg is to plumb through a statistic from Pebble which would record at which level an SST was ingested. This would then allow to easily track when an SST is ingested into L0 and yell loudly about it. Then any simple up-replication or decommission workload can be run with this patch to establish how common this problem is.

jbowens · 2022-04-26T22:10:12Z

Reiterating what @sumeerbhola said in the storage weekly, with current Pebble it's my expectation that these files would be ingested into L0.

During ingest, there's two kinds of overlap that Pebble needs to consider: data overlap and boundary overlap. Data overlap exists when a key that already exists within the LSM must be shadowed by the ingested sstable. Pebble doesn't perform an exact check for this overlap, but it will perform a seek into overlapping sstables to check for the happy case of zero LSM keys within the bounds of the ingested sstable which if confirmed guarantees zero data overlap. Boundary overlap exists when the bounds of the ingested sstable overlap with the bounds of an existing sstable.

When deciding the ingest target level, Pebble has two constraints imposed by these kinds of overlap:

Pebble cannot ingest a sstable into or below a level L1-L6 with data overlap.
Pebble cannot ingest a sstable into a level L1-L6 with boundary overlap.

The first constraint preserves the LSM sequence number invariant, ensuring newer keys are higher within the LSM than older keys at the same user key. The second constraint preserves the non-overlapping ordering of files within a level, so that a read always needs to consult at most one file per level 1 through 6.

Even if the node contains zero data overlap, boundary overlap at every level L1...L6 seems extremely likely, which would force the sstable into L0. For a level to not have boundary overlap, the sstable boundaries and range boundaries would've had to happen to coincide.

I think there are quite a few avenues for experimentation once we get a reproduction.

Repro for cockroachdb#80589. 100% of all snaps go to L0, see log.txt. Release note: None

tbg · 2022-04-27T05:39:05Z

Well that almost reproed itself: #80611

100% of all upreplication snaps in the simplest possible experiment go to L0:

W220427 05:34:25.889367 1090 kv/kvserver/pkg/kv/kvserver/replica_raftstorage.go:952 ⋮ [n2,s2,r35/3:{-}] 93 XXX ingestion into L0: {Bytes:8322 ApproxIngestedIntoL0Bytes:6611}

Might not be fair to call it "100%", rather "most but not always all of the bytes in any given snapshot go to L0".

nvanbenschoten · 2022-04-27T14:15:33Z

Even if the node contains zero data overlap, boundary overlap at every level L1...L6, seems extremely likely, which would force the sstable into L0. For a level to not have boundary overlap, the sstable boundaries and range boundaries would've had to happen to coincide.

I think there are quite a few avenues for experimentation once we get a reproduction.

This makes a lot of sense. What kinds of avenues did you have in mind?

The most straightforward approach (conceptually) seems to be constraining all (or all above L6) SSTs to the range boundaries of single ranges. This has some parallels to the idea of "guards" in the PebblesDB paper. I don't know whether this is workable in practice, as it seems like it would lead to very small SSTs in the top few non-overlapping levels of the LSM (L1-L3). PebblesDB also doesn't seem to enforce these guards in L0, which would be a problem.

We could also explore compaction strategies that would make compacting such an SST cheaper. Even if an SST with no data overlap ends up in L0 due to boundary overlap, it could quickly be compacted if we could cheaply split the SSTs below it in half to remove the boundary overlap. Do we already have a way to cheaply split an SST into two files? Is this an operation Pebble ever employs during compaction?

petermattis · 2022-04-27T17:06:39Z

This issue reminds me of #44048 which was eventually closed with no action taken.

jbowens · 2022-04-27T20:38:50Z

Do we already have a way to cheaply split an SST into two files?

We don't currently, but the disaggregated storage design describes a 'virtual sstable' that separates the logical boundaries of a metadata entry within a level from the physical sstable boundaries. Once we have these intermediary 'virtual sstables' we could write a single manifest entry that splits the sstable's metadata within a level into two without recompacting the sstable. An ingest could include this logical split in the same manifest entry that adds the sstable into the LSM.

I think these virtual sstables are our best bet for dealing with this problem, but they require large, invasive changes to Pebble.

I think this problem is also somewhat similar to the problem of a sequential import writing into the middle of the keyspace. cockroachdb/pebble#1671 is somewhat relevant, but unfortunately would probably be less useful for snapshot ingestion where ingested sstables are more randomly distributed across the keyspace.

There is some nuance in determining when an ingested file overlaps with one or more files in a level, as file boundary overlap _and_ data overlap is considered. Pad out the existing documentation, providing examples to demonstrate how boundary and data overlap are used to pick a level for ingestion. Rework the existing data-drive test cases, splitting up and documenting the various scenarios. Motivated by cockroachdb/cockroach#80589.

sumeerbhola · 2022-04-28T13:11:33Z

Well that almost reproed itself: #80611

@tbg I don't quite understand this repro, since it doesn't seem to be doing any work other than start a cluster. Is this just up-replication of tiny ranges?
I'd like to investigate further by drilling down into where the global keyspace ssts would land if data overlap were the only consideration, and I need a realistic repro for this investigation, with a reasonable volume of data in each range that is experiencing rebalancing. Perhaps something like kv0 with some tricks to force it to do some rebalancing. Ideally not unrealistically high rebalancing volume either, since it may cause the same range to be frequently added back to the same node (which I suspect will not be the usual case). Can you recommend something, or hack something up for me to use? Is the import phase of TPCC expected to generate a fair amount of rebalancing?

sumeerbhola · 2022-04-28T13:39:11Z

self-assigning to investigate in more detail, based on repros that will be provided by @tbg

There is some nuance in determining when an ingested file overlaps with one or more files in a level, as file boundary overlap _and_ data overlap is considered. Pad out the existing documentation, providing examples to demonstrate how boundary and data overlap are used to pick a level for ingestion. Rework the existing data-drive test cases, splitting up and documenting the various scenarios. Motivated by cockroachdb/cockroach#80589.

sumeerbhola · 2022-04-29T15:44:52Z

Ran the repro from https://github.com/tbg/cockroach/blob/25c16a9a07b53ddeefc43cfe4b4ea06a004c2c89/repro.sh with additional instrumentation (and with a 5 node cluster, instead of 7) and the large file in each snapshot always gets ingested into L6. The typical output looks like the following, where the "details" are sorted in decreasing file size and b is bytes, il is ingested-level, dol is highest-data-overlap-level (7 means there was no data overlap), and bl is Lbase.

Ingest: bytes:170 MiB, l0:6.7 KiB, details: (b:170 MiB il:6 dol:7 bl:3) (b:1.9 KiB il:0 dol:4 bl:3) (b:1.4 KiB il:0 dol:4 bl:3) (b:1.2 KiB il:0 dol:3 bl:3) (b:1.1 KiB il:0 dol:4 bl:3) (b:1.1 KiB il:0 dol:4 bl:3)

I'm going to try with workload init tpcc to see if that results in a messier LSM (I am assuming init is doing normal writes while import is ingesting sstables).

sumeerbhola · 2022-04-29T16:25:52Z

workload init tpcc shows more promising results. More than 50% of snapshots show the largest sst being ingested into L6, but it is not close to 100% like before. Examples like the following. The highest-data-overlap-level is always 7, so the reason we are ingesting to a higher level than 6 is because of file overlap.

Ingest: bytes:32 MiB, l0:6.7 KiB, details: (b:32 MiB il:4 dol:7 bl:4) (b:1.8 KiB il:0 dol:7 bl:4) (b:1.3 KiB il:0 dol:7 bl:4) (b:1.2 KiB il:0 dol:5 bl:4) (b:1.2 KiB il:0 dol:7 bl:4) (b:1.1 KiB il:0 dol:7 bl:4) 
Ingest: bytes:48 MiB, l0:48 MiB, details: (b:48 MiB il:0 dol:7 bl:4) (b:1.9 KiB il:4 dol:7 bl:4) (b:1.4 KiB il:4 dol:7 bl:4) (b:1.2 KiB il:4 dol:5 bl:4) (b:1.1 KiB il:4 dol:7 bl:4) (b:1.1 KiB il:4 dol:7 bl:4) 
Ingest: bytes:50 MiB, l0:0 B, details: (b:50 MiB il:5 dol:7 bl:4) (b:1.5 KiB il:4 dol:7 bl:4) (b:1.4 KiB il:4 dol:7 bl:4) (b:1.2 KiB il:4 dol:5 bl:4) (b:1.1 KiB il:4 dol:7 bl:4) (b:1.1 KiB il:4 dol:7 bl:4)
Ingest: bytes:51 MiB, l0:51 MiB, details: (b:51 MiB il:0 dol:7 bl:4) (b:1.5 KiB il:0 dol:7 bl:4) (b:1.4 KiB il:0 dol:7 bl:4) (b:1.2 KiB il:0 dol:5 bl:4) (b:1.1 KiB il:0 dol:7 bl:4) (b:1.1 KiB il:0 dol:7 bl:4)

sumeerbhola · 2022-04-29T16:38:59Z

Recommissioned the 2 nodes after ~18min. Almost the whole LSM had been compacted away, so the highest-data-overlap-level is usually 7, and so is the ingested-level.

tbg · 2022-05-06T07:08:05Z

I think I managed to reproduce this, and even (accidentally) with the default snapshot rate limits. The custom chart below shows that when we decommission three nodes in the cluster, we get a - not quite inverted, but sort of unhappy - LSM, and both the p50 and p99 latencies go x5 (it's hard to see this from the screenshot, but that's roughly what it is).

The workload here is kv0 with a block size of 1000 (note: one workload instance per node)

roachprod ssh tobias-lsm -- sudo systemd-run --unit kv-rand -- ./cockroach workload run kv --read-percent 0 --concurrency 10 --min-block-bytes 10000 --max-block-bytes 10000 --tolerate-errors

on this cluster

roachprod create -n 10 --os-volume-size 100 --clouds aws --local-ssd=false tobias-lsm

(3000 IOPS 125mb/s EBS).

I think what's important is that I let the workload build up organically to ~28GiB of live_bytes before starting. This gives a full LSM where presumably lots of snapshots have to go to L0.

It would be worth verifying this experiment with #81039 to see a large spike in ingestions into L0 during the decommissioning process.

tbg · 2022-05-06T18:26:16Z

I recommissioned the nodes and, a few hours later, decommissioned them again, to similar results. (I forgot --wait=none, so the nodes were fully decommissioned and are no longer charted; that's why we only see replica growth in the graphs).

So this does seem like a robust way to reproduce the problem. I'm going to leave it at that and return to looking into the raft recv queue but happy to help others reproduce this issue.

sumeerbhola · 2022-06-07T20:08:17Z

I don't think there is anything left to do here. cockroachdb/pebble#1683 is the pebble issue for virtual ssts that should significantly reduce this occurence. And we could also do #80607 to be certain that snapshots won't overload the LSM.

lunevalex added the C-investigation Further steps needed to qualify. C-label will change. label Apr 26, 2022

blathers-crl bot added T-storage Storage Team T-kv KV Team labels Apr 26, 2022

tbg added a commit to tbg/cockroach that referenced this issue Apr 27, 2022

kvserver: repro L0 snap ingestion

75cdd89

Repro for cockroachdb#80589. 100% of all snaps go to L0, see log.txt. Release note: None

tbg mentioned this issue Apr 27, 2022

kvserver: repro L0 snap ingestion #80611

Closed

nicktrav mentioned this issue Apr 28, 2022

db: extend documentation on ingested file overlap; improve test cases cockroachdb/pebble#1674

Merged

sumeerbhola self-assigned this Apr 28, 2022

mwang1026 removed the T-kv KV Team label May 2, 2022

sumeerbhola mentioned this issue May 4, 2022

db: split physical sstables into virtual sstables at ingest time cockroachdb/pebble#1683

Closed

jlinder added sync-me and removed sync-me labels May 20, 2022

tbg mentioned this issue May 23, 2022

[WIP] admission,kvserver: admission control for snapshot ingest #80914

Draft

sumeerbhola closed this as completed Jun 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: snapshots ingested into L0 during decommission #80589

storage: snapshots ingested into L0 during decommission #80589

lunevalex commented Apr 26, 2022 •

edited by cockroach-jira-scripts

Loading

lunevalex commented Apr 26, 2022

jbowens commented Apr 26, 2022 •

edited

Loading

tbg commented Apr 27, 2022 •

edited

Loading

nvanbenschoten commented Apr 27, 2022

petermattis commented Apr 27, 2022

jbowens commented Apr 27, 2022 •

edited

Loading

sumeerbhola commented Apr 28, 2022

sumeerbhola commented Apr 28, 2022

sumeerbhola commented Apr 29, 2022

sumeerbhola commented Apr 29, 2022 •

edited

Loading

sumeerbhola commented Apr 29, 2022

tbg commented May 6, 2022

tbg commented May 6, 2022 •

edited

Loading

sumeerbhola commented Jun 7, 2022

storage: snapshots ingested into L0 during decommission #80589

storage: snapshots ingested into L0 during decommission #80589

Comments

lunevalex commented Apr 26, 2022 • edited by cockroach-jira-scripts Loading

lunevalex commented Apr 26, 2022

jbowens commented Apr 26, 2022 • edited Loading

tbg commented Apr 27, 2022 • edited Loading

nvanbenschoten commented Apr 27, 2022

petermattis commented Apr 27, 2022

jbowens commented Apr 27, 2022 • edited Loading

sumeerbhola commented Apr 28, 2022

sumeerbhola commented Apr 28, 2022

sumeerbhola commented Apr 29, 2022

sumeerbhola commented Apr 29, 2022 • edited Loading

sumeerbhola commented Apr 29, 2022

tbg commented May 6, 2022

tbg commented May 6, 2022 • edited Loading

sumeerbhola commented Jun 7, 2022

lunevalex commented Apr 26, 2022 •

edited by cockroach-jira-scripts

Loading

jbowens commented Apr 26, 2022 •

edited

Loading

tbg commented Apr 27, 2022 •

edited

Loading

jbowens commented Apr 27, 2022 •

edited

Loading

sumeerbhola commented Apr 29, 2022 •

edited

Loading

tbg commented May 6, 2022 •

edited

Loading