Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ccl/backupccl: add new split and scatter processor that generates import spans #94805

Merged
merged 1 commit into from
Jan 24, 2023

Conversation

rhu713
Copy link
Contributor

@rhu713 rhu713 commented Jan 5, 2023

Previously, restore creates all of its import spans all at once and stores them
in memory. OOMs caused by the size of these import spans on restore of large
backups with many incremental chains has been the cause of many escalations.
This patch modifies import span creation so that import spans are generated one
at a time. This span generator then used in the split and scatter processor to
generate the import spans that are used in the rest of restore instead of
having the spans specified in the processor's spec. A future patch will add
memory monitoring to the import span generation to further safeguard against
OOMs in restore.

This patch also changes the import span generation algorithm. The cluster
setting bulkio.restore.use_simple_import_spans is introduced in this patch,
which, if set to true, will revert the algorithm back to makeSimpleImportSpans.

Release note: None

@rhu713 rhu713 requested review from a team as code owners January 5, 2023 23:06
@rhu713 rhu713 requested review from benbardin and mgartner and removed request for a team January 5, 2023 23:06
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@rhu713 rhu713 requested review from dt and adityamaru and removed request for a team, benbardin and mgartner January 5, 2023 23:06
@rhu713 rhu713 force-pushed the slim-manifest-restore branch from 905db07 to 97a092f Compare January 6, 2023 17:26
Copy link
Member

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

After a first pass, I'm not feeling too scared about backports here, since it seems like the new processor is mostly just moving from restore_job and the old proc and making it look like a proc.

The new covering generation in makeStreamingImportSpans looks like it has more net-new/different functionality, vs just being moved to run in a different place so that's where I think we want the most scrutiny when we think of a backport. I mentioned an idea below to allow toggling back and forth. I think I could also maybe have benefited from a little more commentary, either inline or maybe just a little prose block comment up front about the new approach, how it assembles a span, what an "interesting" point is? Or maybe I'm just being dense

pkg/ccl/backupccl/restore_span_covering.go Outdated Show resolved Hide resolved
pkg/ccl/backupccl/restore_span_covering.go Outdated Show resolved Hide resolved
highWaterMark int64
ceiling int64
res roachpb.RowCount
importSpans map[int64]roachpb.Span
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at first I got a little confused thinking this was all the import spans, and thus would oom, and seeing progress reports have their key compared to its content made me think "oh yeah, it must be holding all the import spans, that is scary."

but then I think I realized on closer reading that it is only "in-flight" import spans: we generate some by pulling from the channel until our ceiling is high enough to match any progress we've heard, then we compare that our ith span's key matches the reported complete i'th span's reported key, and then we empty out that map entry, so the map should only have entires for unreported spans below the max-reported span? is that right?

if so a comment, and maybe naming this something like inflightImportSpans or something, might make this more obvious?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's right, once all spans up to index N are completed, then the highWaterMark will be moved up to N and everything below N will be removed from both maps.

I've done the rename and added a comment describing this behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@rhu713 rhu713 force-pushed the slim-manifest-restore branch 3 times, most recently from 54bf2d3 to 0d0a96a Compare January 11, 2023 22:22
Copy link
Contributor Author

@rhu713 rhu713 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR! I've added a toggle back to makeSimpleImportSpans and added additional commentary about the new import span generation.

pkg/ccl/backupccl/restore_span_covering.go Outdated Show resolved Hide resolved
pkg/ccl/backupccl/restore_span_covering.go Outdated Show resolved Hide resolved
highWaterMark int64
ceiling int64
res roachpb.RowCount
importSpans map[int64]roachpb.Span
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's right, once all spans up to index N are completed, then the highWaterMark will be moved up to N and everything below N will be removed from both maps.

I've done the rename and added a comment describing this behavior.

@rhu713 rhu713 force-pushed the slim-manifest-restore branch 4 times, most recently from e104f97 to 35ebc76 Compare January 23, 2023 16:55
@rhu713
Copy link
Contributor Author

rhu713 commented Jan 23, 2023

bors r+

@craig
Copy link
Contributor

craig bot commented Jan 23, 2023

Build failed:

…ort spans

Previously, restore creates all of its import spans all at once and stores them
in memory. OOMs caused by the size of these import spans on restore of large
backups with many incremental chains has been the cause of many escalations.
This patch modifies import span creation so that import spans are generated one
at a time. This span generator then used in the split and scatter processor to
generate the import spans that are used in the rest of restore instead of
having the spans specified in the processor's spec. A future patch will add
memory monitoring to the import span generation to further safeguard against
OOMs in restore.

This patch also changes the import span generation algorithm. The cluster
setting `bulkio.restore.use_simple_import_spans` is introduced in this patch,
which, if set to true, will revert the algorithm back to makeSimpleImportSpans.

Release note: None
@rhu713 rhu713 force-pushed the slim-manifest-restore branch from 35ebc76 to f3e0327 Compare January 24, 2023 14:18
@rhu713
Copy link
Contributor Author

rhu713 commented Jan 24, 2023

bors r+

@craig
Copy link
Contributor

craig bot commented Jan 24, 2023

Build succeeded:

@craig craig bot merged commit 1f2ead2 into cockroachdb:master Jan 24, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jan 24, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from f3e0327 to blathers/backport-release-22.2-94805: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@yuzefovich
Copy link
Member

This has a warning

INFO: From Generating into bazel-out/darwin_arm64-opt-exec-2B5CBBC6/bin/pkg/sql/execinfrapb/execinfrapb_go_proto_/github.com/cockroachdb/cockroach/pkg/sql/execinfrapb:
WARNING: field GenerativeSplitAndScatterSpec.HighWater is a non-nullable bytes type, nullable=false has no effect

@rhu713 rhu713 deleted the slim-manifest-restore branch January 25, 2023 15:15
@healthy-pod
Copy link
Contributor

@rhu713 / @dt This PR (as far as I can tell) added ~4000+ tests to pkg/ccl/backupccl. Can you explain to a backupccl noob if this is expected and how did the change lead to that?

@dt
Copy link
Member

dt commented Feb 1, 2023

@healthy-podTestRestoreEntryCover has a few nested loops over different testing configurations like such as testing with 1, 2, 4, or 10 backups, each with 1, 2, 5, etc spans of data, etc. At the center of these loops, once we've picked the value for each parameter, we have a unique combination of parameters to test, and call t.Run("backups=x,spans=y,etc"... to test that configuration. If we look at the lengths of each of those loops we get 8*7*8*2*5=4480. This PR includes the addition of for _, simpleImportSpans := range []bool{true, false} { loop in the there, to test each of those configuration in one of two sub-configurations, meaning there are now 4480*2 distinct test cases.

So it isn't so much that someone wrote 4000 tests, but that we had 4000 test cases and we now test each case in two different configurations. That said, it does appear that this test got much more than twice (50-60x) slower.

@healthy-pod
Copy link
Contributor

Makes sense thanks for explaining

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants