Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changefeedccl: Backfill checkpointing appears to be broken #96959

Closed
miretskiy opened this issue Feb 10, 2023 · 1 comment
Closed

changefeedccl: Backfill checkpointing appears to be broken #96959

miretskiy opened this issue Feb 10, 2023 · 1 comment
Assignees
Labels
A-cdc Change Data Capture C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-postmortem Originated from a Postmortem action item. O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs T-cdc

Comments

@miretskiy
Copy link
Contributor

miretskiy commented Feb 10, 2023

Customer reports that a reasonably sized table (40k ranges) cannot complete backfill;
towards the end changefeed restarts, and no backfill checkpoints appear to be written.

It appears that backfill checkpointing functionality regressed at some point w/out
tests picking up on that.

Jira issue: CRDB-24433

Epic CRDB-11783

@miretskiy miretskiy added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs A-cdc Change Data Capture T-cdc backport-22.1.x labels Feb 10, 2023
@blathers-crl
Copy link

blathers-crl bot commented Feb 10, 2023

cc @cockroachdb/cdc

miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Feb 12, 2023
An over than 2 year old change
(cockroachdb#71848)
that added support for checkpointing during backfill after schema change,
inadvertently broke initial scan checkpointing funcitonality

Exacerbating the problem, the existing test
`TestChangefeedBackfillCheckpoint` continued to work fine.
Treason why it was passing was because the test was looking
for a checkpoint whose timestamp matched bacfill timestamp.
The bug involved incorrect initialize/use of 0 timestamp.
It just so happens, that after initial scan completes, the
rangefeed starts, and the very first thing it does is to
generate a 0 timestamp checkpoint.  So, the test was
observing this event, and continued to work.
This PR does not have a dedicated test because the existing
tests work fine -- provided we ignore 0 timestamp checkpoint,
which is what this PR does in addition to addressing
the root cause of the bug.

Informs cockroachdb#96959

Release note (enterprise change): Fix a bug in changefeeds, where
long running initial scans will fail to generate checkpoint.
Failure to generate checkpoint is particularly bad if the
changefeed restarts for whatever reason.  Without checkpoints,
the changefeed will restart from the beginning, and in the worst
case, when exporting substantially sized tables, changefeed
initial scan may have hard time completing.
craig bot pushed a commit that referenced this issue Feb 13, 2023
96995: changefeedccl: Fix initial scan checkpointing r=miretskiy a=miretskiy

An over than 2 year old change
(#71848) that added support for checkpointing during backfill after schema change, inadvertently broke initial scan checkpointing functionality

Exacerbating the problem, the existing test
`TestChangefeedBackfillCheckpoint` continued to work fine. The reason why it was passing was because the test was looking for a checkpoint whose timestamp matched backfill timestamp. The bug involved incorrect initialize/use of 0 timestamp. It just so happens, that after initial scan completes, the rangefeed starts, and the very first thing it does is to generate a 0 timestamp checkpoint.  So, the test was observing this event, and continued to work.
This PR does not have a dedicated test because the existing tests work fine -- provided we ignore 0 timestamp checkpoint, which is what this PR does in addition to addressing the root cause of the bug.

Informs #96959

Release note (enterprise change): Fix a bug in changefeeds, where long running initial scans will fail to generate checkpoint. Failure to generate checkpoint is particularly bad if the changefeed restarts for whatever reason.  Without checkpoints, the changefeed will restart from the beginning, and in the worst case, when exporting substantially sized tables, changefeed initial scan may have hard time completing.

97037: acceptance: skip TestDockerCLI test_demo_partitioning.tcl only r=tbg a=herkolategan

Renamed `test_demo_partitioning.tcl` to `test_demo_partitioning.tcl.disabled` which will cause TestDockerCLI to skip the test file.

Refs: #96797

Reason: flaky test

Epic: None

Release note: None

Co-authored-by: Yevgeniy Miretskiy <[email protected]>
Co-authored-by: Herko Lategan <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Feb 13, 2023
An over than 2 year old change
(#71848)
that added support for checkpointing during backfill after schema change,
inadvertently broke initial scan checkpointing funcitonality

Exacerbating the problem, the existing test
`TestChangefeedBackfillCheckpoint` continued to work fine.
Treason why it was passing was because the test was looking
for a checkpoint whose timestamp matched bacfill timestamp.
The bug involved incorrect initialize/use of 0 timestamp.
It just so happens, that after initial scan completes, the
rangefeed starts, and the very first thing it does is to
generate a 0 timestamp checkpoint.  So, the test was
observing this event, and continued to work.
This PR does not have a dedicated test because the existing
tests work fine -- provided we ignore 0 timestamp checkpoint,
which is what this PR does in addition to addressing
the root cause of the bug.

Informs #96959

Release note (enterprise change): Fix a bug in changefeeds, where
long running initial scans will fail to generate checkpoint.
Failure to generate checkpoint is particularly bad if the
changefeed restarts for whatever reason.  Without checkpoints,
the changefeed will restart from the beginning, and in the worst
case, when exporting substantially sized tables, changefeed
initial scan may have hard time completing.
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Feb 22, 2023
An over than 2 year old change
(cockroachdb#71848)
that added support for checkpointing during backfill after schema change,
inadvertently broke initial scan checkpointing funcitonality

Exacerbating the problem, the existing test
`TestChangefeedBackfillCheckpoint` continued to work fine.
Treason why it was passing was because the test was looking
for a checkpoint whose timestamp matched bacfill timestamp.
The bug involved incorrect initialize/use of 0 timestamp.
It just so happens, that after initial scan completes, the
rangefeed starts, and the very first thing it does is to
generate a 0 timestamp checkpoint.  So, the test was
observing this event, and continued to work.
This PR does not have a dedicated test because the existing
tests work fine -- provided we ignore 0 timestamp checkpoint,
which is what this PR does in addition to addressing
the root cause of the bug.

Informs cockroachdb#96959

Release note (enterprise change): Fix a bug in changefeeds, where
long running initial scans will fail to generate checkpoint.
Failure to generate checkpoint is particularly bad if the
changefeed restarts for whatever reason.  Without checkpoints,
the changefeed will restart from the beginning, and in the worst
case, when exporting substantially sized tables, changefeed
initial scan may have hard time completing.
@osmes osmes added the O-postmortem Originated from a Postmortem action item. label Feb 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cdc Change Data Capture C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-postmortem Originated from a Postmortem action item. O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs T-cdc
Projects
None yet
Development

No branches or pull requests

2 participants