Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: snapshots ingested into L0 during decommission #80589

Closed
lunevalex opened this issue Apr 26, 2022 · 14 comments
Closed

storage: snapshots ingested into L0 during decommission #80589

lunevalex opened this issue Apr 26, 2022 · 14 comments
Assignees
Labels
C-investigation Further steps needed to qualify. C-label will change. T-storage Storage Team

Comments

@lunevalex
Copy link
Collaborator

lunevalex commented Apr 26, 2022

In a recent customer issue we observed that during a decomission, when we were streaming snapshots to a new node 90% of snapshots were ingested into L0. The expectation is that Pebble should be capable to ingest all incoming snapshots into a lower level. The goal of this ticket is perform an investigation to understand why does Pebble behave this way and come up with a proposal for improving the snapshot ingestion path.

Jira issue: CRDB-15336

@lunevalex lunevalex added the C-investigation Further steps needed to qualify. C-label will change. label Apr 26, 2022
@blathers-crl blathers-crl bot added T-storage Storage Team T-kv KV Team labels Apr 26, 2022
@lunevalex
Copy link
Collaborator Author

The first step in this investigation is to establish a reproduction of the issue. One suggestion from @tbg is to plumb through a statistic from Pebble which would record at which level an SST was ingested. This would then allow to easily track when an SST is ingested into L0 and yell loudly about it. Then any simple up-replication or decommission workload can be run with this patch to establish how common this problem is.

@jbowens
Copy link
Collaborator

jbowens commented Apr 26, 2022

Reiterating what @sumeerbhola said in the storage weekly, with current Pebble it's my expectation that these files would be ingested into L0.

During ingest, there's two kinds of overlap that Pebble needs to consider: data overlap and boundary overlap. Data overlap exists when a key that already exists within the LSM must be shadowed by the ingested sstable. Pebble doesn't perform an exact check for this overlap, but it will perform a seek into overlapping sstables to check for the happy case of zero LSM keys within the bounds of the ingested sstable which if confirmed guarantees zero data overlap. Boundary overlap exists when the bounds of the ingested sstable overlap with the bounds of an existing sstable.

When deciding the ingest target level, Pebble has two constraints imposed by these kinds of overlap:

  1. Pebble cannot ingest a sstable into or below a level L1-L6 with data overlap.
  2. Pebble cannot ingest a sstable into a level L1-L6 with boundary overlap.

The first constraint preserves the LSM sequence number invariant, ensuring newer keys are higher within the LSM than older keys at the same user key. The second constraint preserves the non-overlapping ordering of files within a level, so that a read always needs to consult at most one file per level 1 through 6.

Even if the node contains zero data overlap, boundary overlap at every level L1...L6 seems extremely likely, which would force the sstable into L0. For a level to not have boundary overlap, the sstable boundaries and range boundaries would've had to happen to coincide.

I think there are quite a few avenues for experimentation once we get a reproduction.

tbg added a commit to tbg/cockroach that referenced this issue Apr 27, 2022
Repro for cockroachdb#80589.

100% of all snaps go to L0, see log.txt.

Release note: None
@tbg
Copy link
Member

tbg commented Apr 27, 2022

Well that almost reproed itself: #80611

100% of all upreplication snaps in the simplest possible experiment go to L0:

W220427 05:34:25.889367 1090 kv/kvserver/pkg/kv/kvserver/replica_raftstorage.go:952 ⋮ [n2,s2,r35/3:{-}] 93 XXX ingestion into L0: {Bytes:8322 ApproxIngestedIntoL0Bytes:6611}

Might not be fair to call it "100%", rather "most but not always all of the bytes in any given snapshot go to L0".

@nvanbenschoten
Copy link
Member

Even if the node contains zero data overlap, boundary overlap at every level L1...L6, seems extremely likely, which would force the sstable into L0. For a level to not have boundary overlap, the sstable boundaries and range boundaries would've had to happen to coincide.

I think there are quite a few avenues for experimentation once we get a reproduction.

This makes a lot of sense. What kinds of avenues did you have in mind?

The most straightforward approach (conceptually) seems to be constraining all (or all above L6) SSTs to the range boundaries of single ranges. This has some parallels to the idea of "guards" in the PebblesDB paper. I don't know whether this is workable in practice, as it seems like it would lead to very small SSTs in the top few non-overlapping levels of the LSM (L1-L3). PebblesDB also doesn't seem to enforce these guards in L0, which would be a problem.

We could also explore compaction strategies that would make compacting such an SST cheaper. Even if an SST with no data overlap ends up in L0 due to boundary overlap, it could quickly be compacted if we could cheaply split the SSTs below it in half to remove the boundary overlap. Do we already have a way to cheaply split an SST into two files? Is this an operation Pebble ever employs during compaction?

@petermattis
Copy link
Collaborator

This issue reminds me of #44048 which was eventually closed with no action taken.

@jbowens
Copy link
Collaborator

jbowens commented Apr 27, 2022

Do we already have a way to cheaply split an SST into two files?

We don't currently, but the disaggregated storage design describes a 'virtual sstable' that separates the logical boundaries of a metadata entry within a level from the physical sstable boundaries. Once we have these intermediary 'virtual sstables' we could write a single manifest entry that splits the sstable's metadata within a level into two without recompacting the sstable. An ingest could include this logical split in the same manifest entry that adds the sstable into the LSM.

I think these virtual sstables are our best bet for dealing with this problem, but they require large, invasive changes to Pebble.

I think this problem is also somewhat similar to the problem of a sequential import writing into the middle of the keyspace. cockroachdb/pebble#1671 is somewhat relevant, but unfortunately would probably be less useful for snapshot ingestion where ingested sstables are more randomly distributed across the keyspace.

nicktrav added a commit to nicktrav/pebble that referenced this issue Apr 28, 2022
There is some nuance in determining when an ingested file overlaps with
one or more files in a level, as file boundary overlap _and_ data
overlap is considered.

Pad out the existing documentation, providing examples to demonstrate
how boundary and data overlap are used to pick a level for ingestion.

Rework the existing data-drive test cases, splitting up and documenting
the various scenarios.

Motivated by cockroachdb/cockroach#80589.
@sumeerbhola
Copy link
Collaborator

Well that almost reproed itself: #80611

@tbg I don't quite understand this repro, since it doesn't seem to be doing any work other than start a cluster. Is this just up-replication of tiny ranges?
I'd like to investigate further by drilling down into where the global keyspace ssts would land if data overlap were the only consideration, and I need a realistic repro for this investigation, with a reasonable volume of data in each range that is experiencing rebalancing. Perhaps something like kv0 with some tricks to force it to do some rebalancing. Ideally not unrealistically high rebalancing volume either, since it may cause the same range to be frequently added back to the same node (which I suspect will not be the usual case). Can you recommend something, or hack something up for me to use? Is the import phase of TPCC expected to generate a fair amount of rebalancing?

@sumeerbhola sumeerbhola self-assigned this Apr 28, 2022
@sumeerbhola
Copy link
Collaborator

self-assigning to investigate in more detail, based on repros that will be provided by @tbg

nicktrav added a commit to nicktrav/pebble that referenced this issue Apr 28, 2022
There is some nuance in determining when an ingested file overlaps with
one or more files in a level, as file boundary overlap _and_ data
overlap is considered.

Pad out the existing documentation, providing examples to demonstrate
how boundary and data overlap are used to pick a level for ingestion.

Rework the existing data-drive test cases, splitting up and documenting
the various scenarios.

Motivated by cockroachdb/cockroach#80589.
nicktrav added a commit to cockroachdb/pebble that referenced this issue Apr 28, 2022
There is some nuance in determining when an ingested file overlaps with
one or more files in a level, as file boundary overlap _and_ data
overlap is considered.

Pad out the existing documentation, providing examples to demonstrate
how boundary and data overlap are used to pick a level for ingestion.

Rework the existing data-drive test cases, splitting up and documenting
the various scenarios.

Motivated by cockroachdb/cockroach#80589.
@sumeerbhola
Copy link
Collaborator

Ran the repro from https://github.com/tbg/cockroach/blob/25c16a9a07b53ddeefc43cfe4b4ea06a004c2c89/repro.sh with additional instrumentation (and with a 5 node cluster, instead of 7) and the large file in each snapshot always gets ingested into L6. The typical output looks like the following, where the "details" are sorted in decreasing file size and b is bytes, il is ingested-level, dol is highest-data-overlap-level (7 means there was no data overlap), and bl is Lbase.

Ingest: bytes:170 MiB, l0:6.7 KiB, details: (b:170 MiB il:6 dol:7 bl:3) (b:1.9 KiB il:0 dol:4 bl:3) (b:1.4 KiB il:0 dol:4 bl:3) (b:1.2 KiB il:0 dol:3 bl:3) (b:1.1 KiB il:0 dol:4 bl:3) (b:1.1 KiB il:0 dol:4 bl:3) 

I'm going to try with workload init tpcc to see if that results in a messier LSM (I am assuming init is doing normal writes while import is ingesting sstables).

@sumeerbhola
Copy link
Collaborator

sumeerbhola commented Apr 29, 2022

workload init tpcc shows more promising results. More than 50% of snapshots show the largest sst being ingested into L6, but it is not close to 100% like before. Examples like the following. The highest-data-overlap-level is always 7, so the reason we are ingesting to a higher level than 6 is because of file overlap.

Ingest: bytes:32 MiB, l0:6.7 KiB, details: (b:32 MiB il:4 dol:7 bl:4) (b:1.8 KiB il:0 dol:7 bl:4) (b:1.3 KiB il:0 dol:7 bl:4) (b:1.2 KiB il:0 dol:5 bl:4) (b:1.2 KiB il:0 dol:7 bl:4) (b:1.1 KiB il:0 dol:7 bl:4) 
Ingest: bytes:48 MiB, l0:48 MiB, details: (b:48 MiB il:0 dol:7 bl:4) (b:1.9 KiB il:4 dol:7 bl:4) (b:1.4 KiB il:4 dol:7 bl:4) (b:1.2 KiB il:4 dol:5 bl:4) (b:1.1 KiB il:4 dol:7 bl:4) (b:1.1 KiB il:4 dol:7 bl:4) 
Ingest: bytes:50 MiB, l0:0 B, details: (b:50 MiB il:5 dol:7 bl:4) (b:1.5 KiB il:4 dol:7 bl:4) (b:1.4 KiB il:4 dol:7 bl:4) (b:1.2 KiB il:4 dol:5 bl:4) (b:1.1 KiB il:4 dol:7 bl:4) (b:1.1 KiB il:4 dol:7 bl:4)
Ingest: bytes:51 MiB, l0:51 MiB, details: (b:51 MiB il:0 dol:7 bl:4) (b:1.5 KiB il:0 dol:7 bl:4) (b:1.4 KiB il:0 dol:7 bl:4) (b:1.2 KiB il:0 dol:5 bl:4) (b:1.1 KiB il:0 dol:7 bl:4) (b:1.1 KiB il:0 dol:7 bl:4) 

@sumeerbhola
Copy link
Collaborator

Recommissioned the 2 nodes after ~18min. Almost the whole LSM had been compacted away, so the highest-data-overlap-level is usually 7, and so is the ingested-level.

@tbg
Copy link
Member

tbg commented May 6, 2022

I think I managed to reproduce this, and even (accidentally) with the default snapshot rate limits. The custom chart below shows that when we decommission three nodes in the cluster, we get a - not quite inverted, but sort of unhappy - LSM, and both the p50 and p99 latencies go x5 (it's hard to see this from the screenshot, but that's roughly what it is).

The workload here is kv0 with a block size of 1000 (note: one workload instance per node)

roachprod ssh tobias-lsm -- sudo systemd-run --unit kv-rand -- ./cockroach workload run kv --read-percent 0 --concurrency 10 --min-block-bytes 10000 --max-block-bytes 10000 --tolerate-errors

on this cluster

roachprod create -n 10 --os-volume-size 100 --clouds aws --local-ssd=false tobias-lsm

(3000 IOPS 125mb/s EBS).

I think what's important is that I let the workload build up organically to ~28GiB of live_bytes before starting. This gives a full LSM where presumably lots of snapshots have to go to L0.

It would be worth verifying this experiment with #81039 to see a large spike in ingestions into L0 during the decommissioning process.

image

@tbg
Copy link
Member

tbg commented May 6, 2022

I recommissioned the nodes and, a few hours later, decommissioned them again, to similar results. (I forgot --wait=none, so the nodes were fully decommissioned and are no longer charted; that's why we only see replica growth in the graphs).

So this does seem like a robust way to reproduce the problem. I'm going to leave it at that and return to looking into the raft recv queue but happy to help others reproduce this issue.

image

@sumeerbhola
Copy link
Collaborator

I don't think there is anything left to do here. cockroachdb/pebble#1683 is the pebble issue for virtual ssts that should significantly reduce this occurence. And we could also do #80607 to be certain that snapshots won't overload the LSM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-investigation Further steps needed to qualify. C-label will change. T-storage Storage Team
Projects
None yet
Development

No branches or pull requests

8 participants