-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-22.2: move alloc heavy Files field from manifest to SST, use slim manifest in backup restore #97210
release-22.2: move alloc heavy Files field from manifest to SST, use slim manifest in backup restore #97210
Conversation
Repeated fields in a backup's manifest do not scale well as the amount of backed up data, and the length of the incremental chain of backup grows. This has been a problem we have been aware of, and has motivated us to incrementally move all repeated fields out of the manifest and into their standalone metadata SST files. The advantage of this is that during incremental backups or restores we do not need to perform large allocations when unmarshalling the manifest, but instead stream results from the relevant SST as and when we need them. In support issues such as cockroachdb#93272 we have seen this unmarshalling step results in OOMs thereby preventing further incremental backups, or making the backups unrestoreable. Efforts for moving backup, and restore to have all their metadata in SSTs and rely on streaming reads and writes is ongoing but outside the scope of this patch. This patch is meant to be a targetted fix with an eye for backports. Past experimentation has shown us that the `Files` repeated field in the manifest is the largest cause of bloated, unmarshalable manifests. This change teaches backup to continue writing a manifest file, but a slimmer one with the `Files` field nil'ed out. The values in the `Files` field are instead written to an SST file that sits alongside the `SLIM_BACKUP_MANIFEST`. To maintain mixed-version compatability with nodes that rely on a regular manifest, we continue to write a `BACKUP_MANIFEST` alongside its slim version. On the read path, we add an optimization that first reads the slim manifest if present. This way we avoid unmarshalling the alloc heavy `Files` field, and instead teach all the places in the code that need the `Files` to reach out to the metadata SST and read the values one by one. To support both the slim and not-so-slim manifests we introduce an interface that iterates over the Files depending on the manifest passed to it. To reiterate, this work is a subset of the improvements we will get from moving all repeated fields to SSTs and is expected to be superseded by those efforts when they come to fruition. Fixes: cockroachdb#93272 Release note (performance improvement): long chains of incremental backups and restore of such chains will now allocate less memory during the unmarshaling of metadata
Release note: None
…ort spans Previously, restore creates all of its import spans all at once and stores them in memory. OOMs caused by the size of these import spans on restore of large backups with many incremental chains has been the cause of many escalations. This patch modifies import span creation so that import spans are generated one at a time. This span generator then used in the split and scatter processor to generate the import spans that are used in the rest of restore instead of having the spans specified in the processor's spec. A future patch will add memory monitoring to the import span generation to further safeguard against OOMs in restore. This patch also changes the import span generation algorithm. The cluster setting `bulkio.restore.use_simple_import_spans` is introduced in this patch, which, if set to true, will revert the algorithm back to makeSimpleImportSpans. Release note: None
…essor The generative split and scatter processor is currently causing tests to fail under race because there are many goroutines that are operating with the same splitAndScatterer, which cannot be used concurrently as the underlying key rewriter cannot be used concurrently. Modify the processor so that every worker that uses the splitAndScatterer now uses its own instance. Fixes: cockroachdb#95808 Release note: None
…cessor Add the rest of the missing context cancel checks in restore's generativeSplitAndScatterProcessor. Add a red/green test to show that runGenerativeSplitAndScatter is interrupted if its supplied context is canceled. Fixes: cockroachdb#95257 Release note: None
In cockroachdb#95257 we saw a restore grind to a halt 2 hours into a 5 hour roachtest. The stacks indicated that we may have seen a context cancellation that was not being respected by the goroutine running `generateAndSendImportSpans`. This resulted in the `generative_split_and_scatter_processor` getting stuck writing to a channel nobody was reading from (https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/backupccl/restore_span_covering.go#L516) since the other goroutines in the processor had seen the ctx cancellation and exited. A side effect of the generative processor not shutting down was that the downstream restore data processors would also hang on their call to `input.Next()` as they would not receive a row or a meta from the generative processor signalling them to shutdown. This fix adds a ctx cancellation check to the goroutine described above, thereby allowing a graceful teardown of the flow. This fix also adds the JobID to the generative processor spec so that logs on remote nodes are correctly tagged with the JobID making for easier debugging. Informs: cockroachdb#95257 Release note (bug fix): fixes a bug where a restore flow could hang indefinitely in the face of a context cancellation, manifesting as a stuck restore job.
Thanks for opening a backport. Please check the backport criteria before merging:
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
Add a brief release justification to the body of your PR to justify this backport. Some other things to consider:
|
70f2e81
to
63c6057
Compare
e1c15b8
to
a054962
Compare
…t to SST As part of an effort to make backup manifests scale better for larger clusters, this patch moves descriptors and descriptor changes from the manifest to an external SST. This avoids the need to alloc enough memory to hold every descriptor and descriptor revision for every layer of a backup during a backup or restore job. This patch also changes the access pattern for descriptors and descriptor changes to use iterators, so that they can be accessed in a streaming manner from the external SST. Release note: None
a054962
to
e12b1f1
Compare
Thanks for doing this, the list of commits you have here are all the ones I could think of as well. It looks like the backport was clean and there is nothing new except for the cluster setting to default external manifest SSTs to off? Given the size of this backport I'm wondering if we should:
The next release 22.2.6 is scheduled for Mar 7, 2023 so hopefully, we can get all these smoke tests done before then. |
I think it's better to get this in for some bake time, unless @dt has any comments. We can follow it up with a mixed version test and the big 200 layer test. |
Backport:
Please see individual PRs for details.
/cc @cockroachdb/release
Release justification: high priority need for the memory reduction in backup and restore