storage: slow SST `NextKey()` scans during restore #88329

erikgrinaker · 2022-09-21T07:52:24Z

Backup restoration uses an SST iterator that multiplexes several input SSTs, and then scans across them using NextKey(). pebbleIterator will try to step once, and if it's not on a new key then it will seek:

cockroach/pkg/storage/pebble_iterator.go

Lines 487 to 492 in 4cf84e0

    
           // If the Next() call above didn't move to a different key, seek to it. 
        
           if p.UnsafeKey().Key.Equal(p.keyBuf) { 
        
           	// This is equivalent to: 
        
           	// p.iter.SeekGE(EncodeKey(MVCCKey{p.UnsafeKey().Key.Next(), hlc.Timestamp{}})) 
        
           	seekKey := append(p.keyBuf, 0, 0) 
        
           	p.iter.SeekGE(seekKey)

However, this has been seen to be very expensive. Pebble itself has an optimization to step rather than seek if the target is nearby, but no such optimization exists for the SST iterator. Furthermore, if the SSTs are exhausted, the SST reader will attempt to reseek the SST, which is very expensive.

We need to improve performance here.

Jira issue: CRDB-19772

Epic CRDB-20465

The text was updated successfully, but these errors were encountered:

erikgrinaker · 2022-09-21T07:52:38Z

Fix in progress here: cockroachdb/pebble#1960

blathers-crl · 2022-09-21T07:52:44Z

Hi @erikgrinaker, please add branch-* labels to identify which branch(es) this release-blocker affects.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

``` 63300403 db: enable TrySeekUsingNext after Next in external iters fa910870 db: add ExternalIter_NonOverlapping_SeekNextScan benchmark 1d444f36 sstable: include more blocks' stats in BlockBytes d8f4eb38 docs: fix date for struct zeroing annotation c04d1287 metamorphic: always synchronize Clone on underlying Batch 81a4342c docs: add benchmark annotation for cockroachdb#1822 ``` Addresses cockroachdb#88329. Release note: None

88291: roachtest: use default value of max_refresh_spans_bytes for tpch_concurrency r=yuzefovich a=yuzefovich This commit makes it so that we use the default value of `kv.transaction.max_refresh_spans_bytes` cluster setting in the tpch_concurrency roachtest. The idea is that we should be testing what we ship, and we do understand why the increase of the default for that setting in 22.2 led to regression on this roachtest. Many improvements have been made to get that back, so we now are on par with 22.1, and the corresponding issue has been closed. As a result, one test config is now removed. I decided to keep the "no streamer" config as it still seems useful, at least for 23.1 release cycle. Related to #81451. Release note: None 88349: nightlies: fix cloud unit test nightly script r=rhu713 a=adityamaru Release note: None 88354: vendor: bump Pebble to 63300403d537 r=nicktrav a=jbowens ``` 63300403 db: enable TrySeekUsingNext after Next in external iters fa910870 db: add ExternalIter_NonOverlapping_SeekNextScan benchmark 1d444f36 sstable: include more blocks' stats in BlockBytes d8f4eb38 docs: fix date for struct zeroing annotation c04d1287 metamorphic: always synchronize Clone on underlying Batch 81a4342c docs: add benchmark annotation for #1822 ``` Addresses #88329. Release note: None 88357: bazel,ci: find `compare_test` binary under `bazel-bin` r=healthy-pod a=rickystewart Since the Go 1.19 upgrade this has been broken as `realpath` has been getting the `-test.timeout` argument and been getting confused. Also since Go 1.19 it is must easier to find this binary which is right under the normal `bazel-bin`. Release note: None 88359: sql: fix beautiful diagram which gofmt messed up r=Xiang-Gu a=ajwerner This works. Don't ask me why. Release note: None Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: adityamaru <[email protected]> Co-authored-by: Jackson Owens <[email protected]> Co-authored-by: Ricky Stewart <[email protected]> Co-authored-by: Andrew Werner <[email protected]>

``` d8728d2a db: fix RangeKeyChanged and -WithLimit interaction 986f0c8d sstable: fix interaction between bpf and monotonic bounds optimization c13723fd db: enable TrySeekUsingNext after Next in external iters d0971a91 db: add ExternalIter_NonOverlapping_SeekNextScan benchmark 5269d612 sstable: include more blocks' stats in BlockBytes 3b11f3dd db: expand iter_histories test coverage 5580541b db: refactor TestRangeKeys into TestIterHistories ``` Close cockroachdb#88329. Close cockroachdb#88296. Release note: None

jbowens · 2022-09-26T14:09:31Z

Resolved by #88354 (master) and #88584 (22.2).

This setting was previously disabled because of timeouts being observed when restoring our TPCCInc fixtures. The cause of those timeouts has been identified as cockroachdb#88329 making it safe to re-enable merging of spans during restore. This settings prevents restore from over-splitting and leaving the cluster with a merge hangover post restore. Informs: cockroachdb#86470 Release note (sql change): Sets `backup.restore_span.target_size` to default to 384 MiB so that restore merges upto that size of spans when reading from the backup before actually ingesting data. This should reduce the number of ranges created during restore and thereby reduce the merging of ranges that needs to occur post restore.

87449: workload,ttl: add TTL workload for benchmarking time to finish r=rafiss a=ecwall fixes #88172 Measures time row-level TTL job takes to run on a table: 1) Drop TTL table IF EXISTS. 2) Create a table without TTL. 3) Insert initialRowCount number of rows. 4) Gets number of rows that should expire. 5) Wait for table ranges to stabilize after scattering. 6) Enable TTL on table. 7) Poll table until TTL job is complete. Note: Ops is a no-op and no histograms are used. Benchmarking is done inside Hooks and details are logged. Adds useDistSQL field to TTL job progress protobuf for visibility into which version was run during cluster upgrades. Release justification: Added TTL workload. Release note: None 89317: sql,tree: improve function resolution efficiency r=ajwerner a=ajwerner #### sql: prevent allocations by avoiding some name pointers We don't need pointers for these names. They generally won't escape. #### sql,tree: change SearchPath to avoid allocations The closure-oriented interface was forcing the closures and the variables they referenced to escape to the heap. This change, while not beautiful, ends up being much more efficient. ``` name old time/op new time/op delta SQL/MultinodeCockroach/Upsert/count=1000-16 20.4ms ±11% 18.9ms ± 8% -7.47% (p=0.000 n=20+19) name old alloc/op new alloc/op delta SQL/MultinodeCockroach/Upsert/count=1000-16 10.1MB ±29% 9.8MB ±29% ~ (p=0.231 n=20+20) name old allocs/op new allocs/op delta SQL/MultinodeCockroach/Upsert/count=1000-16 56.3k ± 7% 50.2k ±10% -10.81% (p=0.000 n=19+19) ``` Release note: None 89333: backupccl: enable `restore_span.target_size` r=dt,stevendanna a=adityamaru This setting was previously disabled because of timeouts being observed when restoring our TPCCInc fixtures. The cause of those timeouts has been identified as #88329 making it safe to re-enable merging of spans during restore. This settings prevents restore from over-splitting and leaving the cluster with a merge hangover post restore. Informs: #86470 Release note (sql change): Sets `backup.restore_span.target_size` to default to 384 MiB so that restore merges upto that size of spans when reading from the backup before actually ingesting data. This should reduce the number of ranges created during restore and thereby reduce the merging of ranges that needs to occur post restore. Co-authored-by: Evan Wall <[email protected]> Co-authored-by: Andrew Werner <[email protected]> Co-authored-by: adityamaru <[email protected]>

This setting was previously disabled because of timeouts being observed when restoring our TPCCInc fixtures. The cause of those timeouts has been identified as #88329 making it safe to re-enable merging of spans during restore. This settings prevents restore from over-splitting and leaving the cluster with a merge hangover post restore. Informs: #86470 Release note (sql change): Sets `backup.restore_span.target_size` to default to 384 MiB so that restore merges upto that size of spans when reading from the backup before actually ingesting data. This should reduce the number of ranges created during restore and thereby reduce the merging of ranges that needs to occur post restore.

jbowens · 2022-10-17T14:16:28Z

exalate reopened

This setting was previously disabled because of timeouts being observed when restoring our TPCCInc fixtures. The cause of those timeouts has been identified as #88329 making it safe to re-enable merging of spans during restore. This settings prevents restore from over-splitting and leaving the cluster with a merge hangover post restore. Informs: #86470 Release note (sql change): Sets `backup.restore_span.target_size` to default to 384 MiB so that restore merges upto that size of spans when reading from the backup before actually ingesting data. This should reduce the number of ranges created during restore and thereby reduce the merging of ranges that needs to occur post restore.

nicktrav · 2023-01-30T22:34:58Z

We're going to re-run the benchmarks to confirm that recent fixes made an impact here.

erikgrinaker added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-storage Relating to our storage engine (Pebble) on-disk storage. T-storage Storage Team labels Sep 21, 2022

erikgrinaker assigned jbowens Sep 21, 2022

erikgrinaker added the GA-blocker label Sep 21, 2022

erikgrinaker added the branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 label Sep 21, 2022

This was referenced Sep 21, 2022

db: enable TrySeekUsingNext after Next in external iters cockroachdb/pebble#1960

Merged

crl-release-22.2: db: enable TrySeekUsingNext after Next in external iters cockroachdb/pebble#1969

Merged

jbowens mentioned this issue Sep 21, 2022

vendor: bump Pebble to 63300403d537 #88354

Merged

jbowens mentioned this issue Sep 23, 2022

release-22.2: vendor: bump Pebble to d8728d2a354d #88584

Merged

jbowens closed this as completed Sep 26, 2022

nicktrav added the sync-me label Sep 26, 2022

adityamaru mentioned this issue Oct 4, 2022

backupccl: enable restore_span.target_size #89333

Merged

blathers-crl bot mentioned this issue Oct 4, 2022

release-22.2: backupccl: enable restore_span.target_size #89351

Merged

exalate-issue-sync bot reopened this Oct 12, 2022

exalate-issue-sync bot removed the sync-me label Oct 12, 2022

jbowens closed this as completed Oct 17, 2022

blathers-crl bot mentioned this issue Oct 24, 2022

release-22.2.0: release-22.2: backupccl: enable restore_span.target_size #90573

Closed

blathers-crl bot mentioned this issue Oct 25, 2022

release-22.2.0: backupccl: enable restore_span.target_size #90630

Closed

nicktrav added the sync-me label May 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: slow SST `NextKey()` scans during restore #88329

storage: slow SST `NextKey()` scans during restore #88329

erikgrinaker commented Sep 21, 2022 •

edited by exalate-issue-sync bot

Loading

erikgrinaker commented Sep 21, 2022

blathers-crl bot commented Sep 21, 2022

jbowens commented Sep 26, 2022

jbowens commented Oct 17, 2022

nicktrav commented Jan 30, 2023

storage: slow SST NextKey() scans during restore #88329

storage: slow SST NextKey() scans during restore #88329

Comments

erikgrinaker commented Sep 21, 2022 • edited by exalate-issue-sync bot Loading

erikgrinaker commented Sep 21, 2022

blathers-crl bot commented Sep 21, 2022

jbowens commented Sep 26, 2022

jbowens commented Oct 17, 2022

nicktrav commented Jan 30, 2023

storage: slow SST `NextKey()` scans during restore #88329

storage: slow SST `NextKey()` scans during restore #88329

erikgrinaker commented Sep 21, 2022 •

edited by exalate-issue-sync bot

Loading