Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

c2c: improve initial scan performance #97592

Closed
msbutler opened this issue Feb 23, 2023 · 2 comments
Closed

c2c: improve initial scan performance #97592

msbutler opened this issue Feb 23, 2023 · 2 comments
Labels
A-disaster-recovery C-investigation Further steps needed to qualify. C-label will change. T-disaster-recovery

Comments

@msbutler
Copy link
Collaborator

msbutler commented Feb 23, 2023

Occasionaly during an initial scan, a node on the destination cluster can fall behind on work, leading to poor initial scan performance. This issue tracks why this occurs.

This can be observed in the upcoming c2c/tpcc/warehouses=1000/duration=60/cutover=30 roachtest metrics: node 6 (red in right graphs) on the destination side fell very behind during the first 10 minutes of ingestion, relative to other nodes, and then ended up with the bulk of the work!
image

Specifically between 14:30 to 14:40 the lagging node still ingested SSTs via replication, but not much though the stream ingestions processor. For some reason, after 14:40, node6 picked up it's work and began directly ingesting the bulk of the SSTs.

Jira issue: CRDB-24777

Epic CRDB-19402

@msbutler msbutler added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). C-investigation Further steps needed to qualify. C-label will change. T-disaster-recovery labels Feb 23, 2023
@blathers-crl
Copy link

blathers-crl bot commented Feb 23, 2023

cc @cockroachdb/disaster-recovery

@exalate-issue-sync exalate-issue-sync bot removed the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Mar 2, 2023
@exalate-issue-sync exalate-issue-sync bot changed the title c2c: investigate lagging and uneven work distribution on nodes during initial scan c2c: improve initial scan performance Mar 2, 2023
lidorcarmel added a commit to lidorcarmel/cockroach that referenced this issue May 17, 2023
Before this change, c2c buffered an SST in memory and flushed it to KV, then,
if the SST spans multiple ranges (likely) the addSST fails and therefore
split and rewritten to KV. We see about 20-30 retries per addSST happening
in a roachtest.

The code to split these SSTs on range boundaries already exists, but disabled
without a RangeCache. This pr passes a RangeCache to the SST batcher to enable
flushing on range boundaries.

Epic: CRDB-24777
Informs: cockroachdb#97592

Release note: None
blathers-crl bot pushed a commit that referenced this issue Jun 5, 2023
Before this change, c2c buffered an SST in memory and flushed it to KV, then,
if the SST spans multiple ranges (likely) the addSST fails and therefore
split and rewritten to KV. We see about 20-30 retries per addSST happening
in a roachtest.

The code to split these SSTs on range boundaries already exists, but disabled
without a RangeCache. This pr passes a RangeCache to the SST batcher to enable
flushing on range boundaries.

Epic: CRDB-24777
Informs: #97592

Release note: None
@msbutler msbutler removed the C-test-failure Broken test (automatically or manually discovered). label Jun 14, 2023
@stevendanna
Copy link
Collaborator

There is always more to do here. But I'm going to close this for now as @adityamaru made nice improvements here and we can open new issues for specific additional improvements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery C-investigation Further steps needed to qualify. C-label will change. T-disaster-recovery
Projects
No open projects
Archived in project
Development

No branches or pull requests

3 participants