Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: slow SST NextKey() scans during restore #88329

Closed
erikgrinaker opened this issue Sep 21, 2022 · 5 comments
Closed

storage: slow SST NextKey() scans during restore #88329

erikgrinaker opened this issue Sep 21, 2022 · 5 comments
Assignees
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-performance Perf of queries or internals. Solution not expected to change functional behavior. GA-blocker sync-me T-storage Storage Team

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Sep 21, 2022

Backup restoration uses an SST iterator that multiplexes several input SSTs, and then scans across them using NextKey(). pebbleIterator will try to step once, and if it's not on a new key then it will seek:

// If the Next() call above didn't move to a different key, seek to it.
if p.UnsafeKey().Key.Equal(p.keyBuf) {
// This is equivalent to:
// p.iter.SeekGE(EncodeKey(MVCCKey{p.UnsafeKey().Key.Next(), hlc.Timestamp{}}))
seekKey := append(p.keyBuf, 0, 0)
p.iter.SeekGE(seekKey)

However, this has been seen to be very expensive. Pebble itself has an optimization to step rather than seek if the target is nearby, but no such optimization exists for the SST iterator. Furthermore, if the SSTs are exhausted, the SST reader will attempt to reseek the SST, which is very expensive.

Screenshot 2022-09-07 at 18 56 47

We need to improve performance here.

Jira issue: CRDB-19772

Epic CRDB-20465

@erikgrinaker erikgrinaker added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-storage Relating to our storage engine (Pebble) on-disk storage. T-storage Storage Team labels Sep 21, 2022
@erikgrinaker
Copy link
Contributor Author

Fix in progress here: cockroachdb/pebble#1960

@blathers-crl
Copy link

blathers-crl bot commented Sep 21, 2022

Hi @erikgrinaker, please add branch-* labels to identify which branch(es) this release-blocker affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@erikgrinaker erikgrinaker added the branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 label Sep 21, 2022
jbowens added a commit to jbowens/cockroach that referenced this issue Sep 21, 2022
```
63300403 db: enable TrySeekUsingNext after Next in external iters
fa910870 db: add ExternalIter_NonOverlapping_SeekNextScan benchmark
1d444f36 sstable: include more blocks' stats in BlockBytes
d8f4eb38 docs: fix date for struct zeroing annotation
c04d1287 metamorphic: always synchronize Clone on underlying Batch
81a4342c docs: add benchmark annotation for cockroachdb#1822
```

Addresses cockroachdb#88329.

Release note: None
craig bot pushed a commit that referenced this issue Sep 21, 2022
88291: roachtest: use default value of max_refresh_spans_bytes for tpch_concurrency r=yuzefovich a=yuzefovich

This commit makes it so that we use the default value of `kv.transaction.max_refresh_spans_bytes` cluster setting in the tpch_concurrency roachtest. The idea is that we should be testing what we ship, and we do understand why the increase of the default for that setting in 22.2 led to regression on this roachtest. Many improvements have been made to get that back, so we now are on par with 22.1, and the corresponding issue has been closed. As a result, one test config is now removed. I decided to keep the "no streamer" config as it still seems useful, at least for 23.1 release cycle.

Related to #81451.

Release note: None

88349: nightlies: fix cloud unit test nightly script r=rhu713 a=adityamaru

Release note: None

88354: vendor: bump Pebble to 63300403d537 r=nicktrav a=jbowens

```
63300403 db: enable TrySeekUsingNext after Next in external iters
fa910870 db: add ExternalIter_NonOverlapping_SeekNextScan benchmark
1d444f36 sstable: include more blocks' stats in BlockBytes
d8f4eb38 docs: fix date for struct zeroing annotation
c04d1287 metamorphic: always synchronize Clone on underlying Batch
81a4342c docs: add benchmark annotation for #1822
```

Addresses #88329.

Release note: None

88357: bazel,ci: find `compare_test` binary under `bazel-bin` r=healthy-pod a=rickystewart

Since the Go 1.19 upgrade this has been broken as `realpath` has been getting the `-test.timeout` argument and been getting confused. Also since Go 1.19 it is must easier to find this binary which is right under the normal `bazel-bin`.

Release note: None

88359: sql: fix beautiful diagram which gofmt messed up r=Xiang-Gu a=ajwerner

This works. Don't ask me why.

Release note: None

Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: adityamaru <[email protected]>
Co-authored-by: Jackson Owens <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
Co-authored-by: Andrew Werner <[email protected]>
jbowens added a commit to jbowens/cockroach that referenced this issue Sep 23, 2022
```
d8728d2a db: fix RangeKeyChanged and -WithLimit interaction
986f0c8d sstable: fix interaction between bpf and monotonic bounds optimization
c13723fd db: enable TrySeekUsingNext after Next in external iters
d0971a91 db: add ExternalIter_NonOverlapping_SeekNextScan benchmark
5269d612 sstable: include more blocks' stats in BlockBytes
3b11f3dd db: expand iter_histories test coverage
5580541b db: refactor TestRangeKeys into TestIterHistories
```

Close cockroachdb#88329.
Close cockroachdb#88296.

Release note: None
@jbowens
Copy link
Collaborator

jbowens commented Sep 26, 2022

Resolved by #88354 (master) and #88584 (22.2).

@jbowens jbowens closed this as completed Sep 26, 2022
adityamaru added a commit to adityamaru/cockroach that referenced this issue Oct 4, 2022
This setting was previously disabled because of timeouts being
observed when restoring our TPCCInc fixtures. The cause of those
timeouts has been identified as
cockroachdb#88329 making it safe
to re-enable merging of spans during restore. This settings prevents
restore from over-splitting and leaving the cluster with a merge hangover
post restore.

Informs: cockroachdb#86470

Release note (sql change): Sets `backup.restore_span.target_size`
to default to 384 MiB so that restore merges upto that size of spans
when reading from the backup before actually ingesting data. This should
reduce the number of ranges created during restore and thereby reduce
the merging of ranges that needs to occur post restore.
craig bot pushed a commit that referenced this issue Oct 4, 2022
87449: workload,ttl: add TTL workload for benchmarking time to finish r=rafiss a=ecwall

fixes #88172

Measures time row-level TTL job takes to run on a table:
1) Drop TTL table IF EXISTS.
2) Create a table without TTL.
3) Insert initialRowCount number of rows.
4) Gets number of rows that should expire.
5) Wait for table ranges to stabilize after scattering.
6) Enable TTL on table.
7) Poll table until TTL job is complete.
Note: Ops is a no-op and no histograms are used.
Benchmarking is done inside Hooks and details are logged.

Adds useDistSQL field to TTL job progress protobuf for
visibility into which version was run during cluster
upgrades.

Release justification: Added TTL workload.

Release note: None

89317: sql,tree: improve function resolution efficiency r=ajwerner a=ajwerner

#### sql: prevent allocations by avoiding some name pointers

We don't need pointers for these names. They generally won't escape.

#### sql,tree: change SearchPath to avoid allocations
The closure-oriented interface was forcing the closures and the variables they
referenced to escape to the heap. This change, while not beautiful, ends up being
much more efficient.

```
name                                         old time/op    new time/op    delta
SQL/MultinodeCockroach/Upsert/count=1000-16    20.4ms ±11%    18.9ms ± 8%   -7.47%  (p=0.000 n=20+19)

name                                         old alloc/op   new alloc/op   delta
SQL/MultinodeCockroach/Upsert/count=1000-16    10.1MB ±29%     9.8MB ±29%     ~     (p=0.231 n=20+20)

name                                         old allocs/op  new allocs/op  delta
SQL/MultinodeCockroach/Upsert/count=1000-16     56.3k ± 7%     50.2k ±10%  -10.81%  (p=0.000 n=19+19)
```

Release note: None

89333: backupccl: enable `restore_span.target_size` r=dt,stevendanna a=adityamaru

This setting was previously disabled because of timeouts being observed when restoring our TPCCInc fixtures. The cause of those timeouts has been identified as
#88329 making it safe to re-enable merging of spans during restore. This settings prevents restore from over-splitting and leaving the cluster with a merge hangover post restore.

Informs: #86470

Release note (sql change): Sets `backup.restore_span.target_size` to default to 384 MiB so that restore merges upto that size of spans when reading from the backup before actually ingesting data. This should reduce the number of ranges created during restore and thereby reduce the merging of ranges that needs to occur post restore.

Co-authored-by: Evan Wall <[email protected]>
Co-authored-by: Andrew Werner <[email protected]>
Co-authored-by: adityamaru <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Oct 4, 2022
This setting was previously disabled because of timeouts being
observed when restoring our TPCCInc fixtures. The cause of those
timeouts has been identified as
#88329 making it safe
to re-enable merging of spans during restore. This settings prevents
restore from over-splitting and leaving the cluster with a merge hangover
post restore.

Informs: #86470

Release note (sql change): Sets `backup.restore_span.target_size`
to default to 384 MiB so that restore merges upto that size of spans
when reading from the backup before actually ingesting data. This should
reduce the number of ranges created during restore and thereby reduce
the merging of ranges that needs to occur post restore.
@exalate-issue-sync exalate-issue-sync bot reopened this Oct 12, 2022
@jbowens
Copy link
Collaborator

jbowens commented Oct 17, 2022

exalate reopened

@jbowens jbowens closed this as completed Oct 17, 2022
blathers-crl bot pushed a commit that referenced this issue Oct 24, 2022
This setting was previously disabled because of timeouts being
observed when restoring our TPCCInc fixtures. The cause of those
timeouts has been identified as
#88329 making it safe
to re-enable merging of spans during restore. This settings prevents
restore from over-splitting and leaving the cluster with a merge hangover
post restore.

Informs: #86470

Release note (sql change): Sets `backup.restore_span.target_size`
to default to 384 MiB so that restore merges upto that size of spans
when reading from the backup before actually ingesting data. This should
reduce the number of ranges created during restore and thereby reduce
the merging of ranges that needs to occur post restore.
blathers-crl bot pushed a commit that referenced this issue Oct 25, 2022
This setting was previously disabled because of timeouts being
observed when restoring our TPCCInc fixtures. The cause of those
timeouts has been identified as
#88329 making it safe
to re-enable merging of spans during restore. This settings prevents
restore from over-splitting and leaving the cluster with a merge hangover
post restore.

Informs: #86470

Release note (sql change): Sets `backup.restore_span.target_size`
to default to 384 MiB so that restore merges upto that size of spans
when reading from the backup before actually ingesting data. This should
reduce the number of ranges created during restore and thereby reduce
the merging of ranges that needs to occur post restore.
@nicktrav
Copy link
Collaborator

We're going to re-run the benchmarks to confirm that recent fixes made an impact here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-performance Perf of queries or internals. Solution not expected to change functional behavior. GA-blocker sync-me T-storage Storage Team
Projects
None yet
Development

No branches or pull requests

3 participants