Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backupccl: fix key rewriter race in generative split and scatter processor #96313

Merged
merged 1 commit into from
Feb 1, 2023

Conversation

rhu713
Copy link
Contributor

@rhu713 rhu713 commented Jan 31, 2023

The generative split and scatter processor is currently causing tests to fail under race because there are many goroutines that are operating with the same splitAndScatterer, which cannot be used concurrently as the underlying key rewriter cannot be used concurrently. Modify the processor so that every worker that uses the splitAndScatterer now uses its own instance.

Fixes: #95808

Release note: None

@rhu713 rhu713 requested a review from adityamaru January 31, 2023 22:14
@rhu713 rhu713 requested a review from a team as a code owner January 31, 2023 22:14
@blathers-crl
Copy link

blathers-crl bot commented Jan 31, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

return nil, err
}
numSplitWorkers := int(spec.NumNodes)
numSecondSplitWorkers := 2 * numSplitWorkers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a comment here about why we chose 2 * numSplitWorkers? I realize this is probably not your doing but a high level reason explaining the factor of 2 would be useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, I can't really find out why exactly the factor of 2 was chosen but I edited the TODO to perhaps tune it in the future.

@@ -342,8 +360,8 @@ func runGenerativeSplitAndScatter(

// TODO(pbardea): This tries to cover for a bad scatter by having 2 * the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move this TODO with the 2* and think about whether it is still applicable? If we want to punt the thinking for later can we add your or my name in the TODO 😋

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the TODO to where the factor of 2 is and put my name on it.

// scatterers contain the splitAndScatterers for the first group of split and
// scatter workers. Each worker needs its own scatterer as one cannot be used
// concurrently.
scatterers []splitAndScatterer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about calling this one chunkSplitAndScatterer and in the comment indicating that these are used to split and scatter importSpanChunks?

Then the one below can be called chunkEntrySplitAndScatterer or entrySplitAndScatterer? It doesn't perform scatters so maybe we mention that in the comment and describe what it does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, renamed and added the comments.

flowCtx: flowCtx,
spec: spec,
output: output,
scatterers: scatterers[:numSplitWorkers],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sanity checking that this creates a copy of the scatterers slice and doesn't just copy the pointer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these do just copy the pointer but I changed the code to use two slices just so that we are not confused in the future.

doneScatterCh chan<- entryNode,
) error {
log.Infof(ctx, "Running generative split and scatter with %d total spans, %d chunk size, %d nodes",
spec.NumEntries, spec.ChunkSize, spec.NumNodes)
g := ctxgroup.WithContext(ctx)

splitWorkers := int(spec.NumNodes)
splitWorkers := len(scatterers)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe we call this splitAndScatterWorkers since they do both?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -291,6 +308,7 @@ func runGenerativeSplitAndScatter(
importSpanChunksCh := make(chan scatteredChunk, splitWorkers*2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should've done this in your previous PR sorry, but do you think you could add a line or two above each goroutine in this method describing what each of them does?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, added comments.

@adityamaru adityamaru self-requested a review January 31, 2023 23:32
@rhu713 rhu713 force-pushed the split-scatter-race branch 3 times, most recently from cb7fb4a to 21dde30 Compare February 1, 2023 17:06
@adityamaru
Copy link
Contributor

adityamaru commented Feb 1, 2023

you need a rebase to get past the failing DD tests, I ran into the same thing this morning. Since you're going to push anyways can I also bug you to change the genSpan signature to take in a channel instead of capturing importSpansCh? A couple of folks have been confused by the channel reassignment - https://github.com/cockroachdb/cockroach/pull/96302/files/1d882cd1fff94e8b3a0276169a14801f3ea94adb#r1093591682. I'm happy to send a separate patch for that as well, lemme know!

…essor

The generative split and scatter processor is currently causing tests to fail
under race because there are many goroutines that are operating with the same
splitAndScatterer, which cannot be used concurrently as the underlying key
rewriter cannot be used concurrently. Modify the processor so that every worker
that uses the splitAndScatterer now uses its own instance.

Fixes: cockroachdb#95808

Release note: None
@rhu713 rhu713 force-pushed the split-scatter-race branch from 21dde30 to 100b2b9 Compare February 1, 2023 19:08
@rhu713 rhu713 requested a review from a team as a code owner February 1, 2023 19:08
@rhu713
Copy link
Contributor Author

rhu713 commented Feb 1, 2023

bors r+

@craig craig bot merged commit 7ac7861 into cockroachdb:master Feb 1, 2023
@craig
Copy link
Contributor

craig bot commented Feb 1, 2023

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ccl/backupccl: TestBackupRestoreMultiNodeLocal failed
3 participants