Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crosscluster/logical: prevent data looping via OriginID session variable #126404

Merged

Conversation

msbutler
Copy link
Collaborator

@msbutler msbutler commented Jun 28, 2024

Previously, LDR would prevent data looping by spinning up rangefeeds with
filtering. This big hammer prevented LDR replicated data from appearing in
destination side changefeeds.

This patch replaces this data loop prevention strategy by 1) binding an
OriginID of 1 to the MVCCValueHeader of each replicated KV during ingestion; 2)
filtering these KVs with their OriginID value when these replicated KVs appear
as LDR source side rangefeed events.

To implement 1), the Internal Execetor in the LDR row processor now sets the
OriginIDForLogicalDataReplication session variable to 1, which has the effect
of binding OriginID=1 to each batch request header created by the
InternalExecutor's write queries. The request header value will be plumbed to
each KV's Value header in the kv layer.

To implement 2), source side rangefeeds are now initialized with the
WithEmitMatchingOriginIDs option, causing rangefeeds to only emit local writes,
with OriginID=0.

Note that a similar ingestion side plumbing strategy will be used for
OriginTimestamp even though each ingested row may have a different timestamp.
We can still bind the OriginTimestamp to the Internal Executor session before
each query because 1) each IE query creates a new session; 2) we do not plan to
use multi row insert statements during LDR ingestion via sql.

Fixes #126253

Release note: none

@msbutler msbutler self-assigned this Jun 28, 2024
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@msbutler msbutler force-pushed the butler-prevent-data-looping-origin-id branch 3 times, most recently from 4055a38 to 92e03ff Compare July 1, 2024 14:06
@msbutler msbutler changed the title Butler prevent data looping origin crosscluster/logical: prevent data looping via OriginID session variable Jul 1, 2024
@msbutler msbutler marked this pull request as ready for review July 1, 2024 14:15
@msbutler msbutler requested a review from a team July 1, 2024 14:15
@msbutler msbutler requested review from a team as code owners July 1, 2024 14:15
@msbutler msbutler requested review from dt, stevendanna and a team and removed request for a team July 1, 2024 14:15
@msbutler
Copy link
Collaborator Author

msbutler commented Jul 1, 2024

First patch in separate PR.

Copy link
Member

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you'll wanna rebase

Previously, LDR would prevent data looping by spinning up rangefeeds with
filtering. This big hammer prevented LDR replicated data from appearing in
destination side changefeeds.

This patch replaces this data loop prevention strategy by 1) binding an
OriginID of 1 to the MVCCValueHeader of each replicated KV during ingestion; 2)
filtering these KVs with their OriginID value when these replicated KVs appear
as LDR source side rangefeed events.

To implement 1), the Internal Execetor in the LDR row processor now sets the
OriginIDForLogicalDataReplication session variable to 1, which has the effect
of binding OriginID=1 to each batch request header created by the
InternalExecutor's write queries. The request header value will be plumbed to
each KV's Value header in the kv layer.

To implement 2), source side rangefeeds are now initialized with the
WithEmitMatchingOriginIDs option, causing rangefeeds to only emit local writes,
with OriginID=0.

Note that a similar ingestion side plumbing strategy will be used for
OriginTimestamp even though each ingested row may have a different timestamp.
We can still bind the OriginTimestamp to the Internal Executor session before
each query because 1) each IE query creates a new session; 2) we do not plan to
use multi row insert statements during LDR ingestion via sql.

Fixes cockroachdb#126253

Release note: none
@msbutler msbutler force-pushed the butler-prevent-data-looping-origin-id branch from 92e03ff to c7b720c Compare July 1, 2024 18:34
Copy link

blathers-crl bot commented Jul 1, 2024

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@msbutler
Copy link
Collaborator Author

msbutler commented Jul 1, 2024

TFTR!

bors r=dt

@craig craig bot merged commit 053bc17 into cockroachdb:master Jul 1, 2024
22 checks passed
msbutler added a commit to msbutler/cockroach that referenced this pull request Jul 4, 2024
126404: crosscluster/logical: prevent data looping via OriginID session variable r=dt a=msbutler

Previously, LDR would prevent data looping by spinning up rangefeeds with
filtering. This big hammer prevented LDR replicated data from appearing in
destination side changefeeds.

This patch replaces this data loop prevention strategy by 1) binding an
OriginID of 1 to the MVCCValueHeader of each replicated KV during ingestion; 2)
filtering these KVs with their OriginID value when these replicated KVs appear
as LDR source side rangefeed events.

To implement 1), the Internal Execetor in the LDR row processor now sets the
OriginIDForLogicalDataReplication session variable to 1, which has the effect
of binding OriginID=1 to each batch request header created by the
InternalExecutor's write queries. The request header value will be plumbed to
each KV's Value header in the kv layer.

To implement 2), source side rangefeeds are now initialized with the
WithEmitMatchingOriginIDs option, causing rangefeeds to only emit local writes,
with OriginID=0.

Note that a similar ingestion side plumbing strategy will be used for
OriginTimestamp even though each ingested row may have a different timestamp.
We can still bind the OriginTimestamp to the Internal Executor session before
each query because 1) each IE query creates a new session; 2) we do not plan to
use multi row insert statements during LDR ingestion via sql.

Fixes cockroachdb#126253

Release note: none

Co-authored-by: Michael Butler <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

streamingccl/logical: prevent data looping with OriginID MVCCValueHeader filtering
3 participants