Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-22.2: move alloc heavy Files field from manifest to SST, use slim manifest in backup restore #97210

Conversation

rhu713
Copy link
Contributor

@rhu713 rhu713 commented Feb 15, 2023

Backport:

Please see individual PRs for details.

/cc @cockroachdb/release

Release justification: high priority need for the memory reduction in backup and restore

adityamaru and others added 6 commits February 15, 2023 15:29
Repeated fields in a backup's manifest do not scale well as
the amount of backed up data, and the length of the incremental
chain of backup grows. This has been a problem we have been aware
of, and has motivated us to incrementally move all repeated fields
out of the manifest and into their standalone metadata SST files.
The advantage of this is that during incremental backups or restores
we do not need to perform large allocations when unmarshalling the
manifest, but instead stream results from the relevant SST as and
when we need them. In support issues such as
cockroachdb#93272 we have seen this
unmarshalling step results in OOMs thereby preventing further incremental
backups, or making the backups unrestoreable. Efforts for moving backup,
and restore to have all their metadata in SSTs and rely on streaming reads
and writes is ongoing but outside the scope of this patch.

This patch is meant to be a targetted fix with an eye for backports.
Past experimentation has shown us that the `Files` repeated field in the
manifest is the largest cause of bloated, unmarshalable manifests. This change
teaches backup to continue writing a manifest file, but a slimmer one
with the `Files` field nil'ed out. The values in the `Files` field are
instead written to an SST file that sits alongside the `SLIM_BACKUP_MANIFEST`.
To maintain mixed-version compatability with nodes that rely on a regular
manifest, we continue to write a `BACKUP_MANIFEST` alongside its slim version.
On the read path, we add an optimization that first reads the slim
manifest if present. This way we avoid unmarshalling the alloc heavy
`Files` field, and instead teach all the places in the code that need the
`Files` to reach out to the metadata SST and read the values one by one. To
support both the slim and not-so-slim manifests we introduce an interface
that iterates over the Files depending on the manifest passed to it.

To reiterate, this work is a subset of the improvements we will get
from moving all repeated fields to SSTs and is expected to be superseded
by those efforts when they come to fruition.

Fixes: cockroachdb#93272

Release note (performance improvement): long chains of incremental backups
and restore of such chains will now allocate less memory during the unmarshaling
of metadata
…ort spans

Previously, restore creates all of its import spans all at once and stores them
in memory. OOMs caused by the size of these import spans on restore of large
backups with many incremental chains has been the cause of many escalations.
This patch modifies import span creation so that import spans are generated one
at a time. This span generator then used in the split and scatter processor to
generate the import spans that are used in the rest of restore instead of
having the spans specified in the processor's spec. A future patch will add
memory monitoring to the import span generation to further safeguard against
OOMs in restore.

This patch also changes the import span generation algorithm. The cluster
setting `bulkio.restore.use_simple_import_spans` is introduced in this patch,
which, if set to true, will revert the algorithm back to makeSimpleImportSpans.

Release note: None
…essor

The generative split and scatter processor is currently causing tests to fail
under race because there are many goroutines that are operating with the same
splitAndScatterer, which cannot be used concurrently as the underlying key
rewriter cannot be used concurrently. Modify the processor so that every worker
that uses the splitAndScatterer now uses its own instance.

Fixes: cockroachdb#95808

Release note: None
…cessor

Add the rest of the missing context cancel checks in restore's
generativeSplitAndScatterProcessor. Add a red/green test to show that
runGenerativeSplitAndScatter is interrupted if its supplied context is
canceled.

Fixes: cockroachdb#95257

Release note: None
In cockroachdb#95257 we saw a restore grind to a halt 2 hours into a
5 hour roachtest. The stacks indicated that we may have seen
a context cancellation that was not being respected by the
goroutine running `generateAndSendImportSpans`. This resulted
in the `generative_split_and_scatter_processor` getting stuck writing
to a channel nobody was reading from
(https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/backupccl/restore_span_covering.go#L516)
since the other goroutines
in the processor had seen the ctx cancellation and exited.
A side effect of the generative processor not shutting down
was that the downstream restore data processors would also
hang on their call to `input.Next()` as they would not receive
a row or a meta from the generative processor signalling them
to shutdown. This fix adds a ctx cancellation check to the
goroutine described above, thereby allowing a
graceful teardown of the flow.

This fix also adds the JobID to the generative processor spec
so that logs on remote nodes are correctly tagged with the JobID
making for easier debugging.

Informs: cockroachdb#95257

Release note (bug fix): fixes a bug where a restore flow could
hang indefinitely in the face of a context cancellation, manifesting
as a stuck restore job.
@blathers-crl
Copy link

blathers-crl bot commented Feb 15, 2023

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Patches should only be created for serious issues or test-only changes.
  • Patches should not break backwards-compatibility.
  • Patches should change as little code as possible.
  • Patches should not change on-disk formats or node communication protocols.
  • Patches should not add new functionality.
  • Patches must not add, edit, or otherwise modify cluster versions; or add version gates.
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters.
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.

Add a brief release justification to the body of your PR to justify this backport.

Some other things to consider:

  • What did we do to ensure that a user that doesn’t know & care about this backport, has no idea that it happened?
  • Will this work in a cluster of mixed patch versions? Did we test that?
  • If a user upgrades a patch version, uses this feature, and then downgrades, what happens?

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@rhu713 rhu713 force-pushed the backport22.2-93997-94805-96313-96529-96302-96245 branch from 70f2e81 to 63c6057 Compare February 15, 2023 21:57
@rhu713 rhu713 requested a review from adityamaru February 15, 2023 21:57
@rhu713 rhu713 marked this pull request as ready for review February 15, 2023 22:03
@rhu713 rhu713 requested review from a team as code owners February 15, 2023 22:03
@rhu713 rhu713 requested review from mgartner and removed request for a team and mgartner February 15, 2023 22:03
@rhu713 rhu713 force-pushed the backport22.2-93997-94805-96313-96529-96302-96245 branch 4 times, most recently from e1c15b8 to a054962 Compare February 15, 2023 23:34
…t to SST

As part of an effort to make backup manifests scale better for larger clusters,
this patch moves descriptors and descriptor changes from the manifest to an
external SST. This avoids the need to alloc enough memory to hold every
descriptor and descriptor revision for every layer of a backup during a backup
or restore job. This patch also changes the access pattern for descriptors and
descriptor changes to use iterators, so that they can be accessed in a
streaming manner from the external SST.

Release note: None
@rhu713 rhu713 force-pushed the backport22.2-93997-94805-96313-96529-96302-96245 branch from a054962 to e12b1f1 Compare February 16, 2023 00:10
@adityamaru adityamaru requested a review from dt February 16, 2023 14:28
@adityamaru
Copy link
Contributor

Thanks for doing this, the list of commits you have here are all the ones I could think of as well. It looks like the backport was clean and there is nothing new except for the cluster setting to default external manifest SSTs to off?

Given the size of this backport I'm wondering if we should:

  • Run the 200inc restore on 22.2 as well
  • Run a mixed version database/table backup+restore
  • Hopefully our nightlies will shake out other things

The next release 22.2.6 is scheduled for Mar 7, 2023 so hopefully, we can get all these smoke tests done before then.

@adityamaru
Copy link
Contributor

I think it's better to get this in for some bake time, unless @dt has any comments. We can follow it up with a mixed version test and the big 200 layer test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants