Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changefeedccl: flakey TestChangefeedRestartDuringBackfill #75080

Closed
Tracked by #75639
irfansharif opened this issue Jan 18, 2022 · 4 comments · Fixed by #106433
Closed
Tracked by #75639

changefeedccl: flakey TestChangefeedRestartDuringBackfill #75080

irfansharif opened this issue Jan 18, 2022 · 4 comments · Fixed by #106433
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). T-cdc

Comments

@irfansharif
Copy link
Contributor

irfansharif commented Jan 18, 2022

Describe the problem

I've seen TestChangefeedRestartDuringBackfill flake recently (CI failure here). It's possibly fallout from #73876. This issue tracks the investigation.

To Reproduce

dev test pkg/ccl/changefeedccl -f=TestChangefeedRestartDuringBackfill -v --timeout 10s --ignore-cache --show-logs

Jira issue: CRDB-12455

@irfansharif irfansharif added the C-test-failure Broken test (automatically or manually discovered). label Jan 18, 2022
irfansharif added a commit to irfansharif/cockroach that referenced this issue Jan 19, 2022
under span configs. This test flakes pretty reliably after span configs
were enabled (cockroachdb#73876). Investigating this further is being tracked in
\cockroachdb#75080; lets have this test use the old subsystem for now (only down in
KV; we've narrowed down the failure to having something to do with
concurrent range splits, within the tenant keyspace, while a changefeed
is declared).

Release note: None
@irfansharif
Copy link
Contributor Author

#75146 is a temporary bandaid for CI. Going through the bors queue over the last day looks like this test has been a frequent cause for failed builds. This is pretty easy to repro, especially with the following:

@@ -481,10 +485,10 @@ func withKnobsFn(fn updateKnobsFn) feedTestOption {
 var _ = withKnobsFn(nil /* fn */)

 func newTestOptions() feedTestOptions {
        // percentTenant is the percentange of tests that will be run against
        // a SQL-node in a multi-tenant server. 1 for all tests to be run on a
        // tenant.
-       const percentTenant = 0.25
+       const percentTenant = 1
        return feedTestOptions{
                useTenant: rand.Float32() < percentTenant,
        }

What I'm seeing happen is this test getting wedged here, reading from the changefeed:

assertPayloads(t, foo, []string{

The wedging only happens when there's a lot of concurrent split activity within the tenant's keyspace. The changefeed declared earlier in the test is declared over a specific table that forms a small portion of a larger single tenant range. With span configs, there's concurrent split queue activity splitting that range along the tenant's own table boundaries. There's something racey with the changefeed's internals and the range split activity that then causes the changefeed to get wedged. I'm not familiar with the internals here but going to start digging. @miretskiy, at first glance, does any of this ring a bell? Could you point me to the code that's responsible for the underlying rangefeed being robust to range splits? Has anything changed here recently? Is there anything special about declaring changefeeds over secondary tenants? Do we have tests that stress changefeed + concurrent split behavior?

To repro:

# Apply the patch above. With #75146 this will take 20s, each run taking <2s. 
# But by splitting in the tenant range, one run or another is bound to get wedged.
dev test pkg/ccl/changefeedccl -f=TestChangefeedRestartDuringBackfill -v --count 10 --timeout 30s --ignore-cache --show-logs

@miretskiy miretskiy self-assigned this Jan 19, 2022
craig bot pushed a commit that referenced this issue Jan 19, 2022
74863: import: check readability earlier r=benbardin a=benbardin

Release note (sql change): Import now checks readability earlier for multiple files, to fail sooner if e.g. permissions are invalid.

74914: opt,tree: fix bugs with Next(), Prev(), and histogram calculation for DTimeTZ r=rytaft a=rytaft

**sql/sem/tree: fix Next() and Prev() for DTimeTZ**

Prior to this commit, the `DTimeTZ` functions `Next()` and `Prev()`
could skip over valid values according to the ordering of `DTimeTZ`
values in an index (which matches the ordering defined by the
`TimeTZ` functions `After()` and `Before()`).

This commit fixes these functions so that `Next()` now returns the smallest
valid `DTimeTZ` that is greater than the receiver, and `Prev()` returns
the largest valid `DTimeTZ` that is less than the receiver. This is
an important invariant that the optimizer relies on when building index
constraints.

Fixes #74912

Release note (bug fix): Fixed a bug that could occur when a `TIMETZ`
column was indexed, and a query predicate constrained that column using
a `<` or `>` operator with a `timetz` constant. If the column contained values
with time zones that did not match the time zone of the `timetz` constant,
it was possible that not all matching values could be returned by the
query. Specifically, the results may not have included values within one
microsecond of the predicate's absolute time. This bug was introduced
when the timetz datatype was first added in 20.1. It exists on all
versions of 20.1, 20.2, 21.1, and 21.2 prior to this patch.

**opt: fix bug in histogram calculation for TimeTZ**

This commit fixes a bug in the histogram estimation code for `TimeTZ`
that made the faulty assumption that `TimeTZ` values are ordered by
`TimeOfDay`. This is incorrect since it does not take the `OffsetSecs`
into account. As a result, it was possible to estimate that the size
of a histogram bucket was negative, which caused problems in the
statistics estimation code. This commit fixes the problem by taking
into account both `TimeOfDay` and `OffsetSecs` when estimating the size of
a bucket in a `TimeTZ` histogram.

Fixes #74667

Release note (bug fix): Fixed an internal error, "estimated row count must
be non-zero", that could occur during planning for queries over a table
with a `TimeTZ` column. This error was due to a faulty assumption in the
statistics estimation code about ordering of `TimeTZ` values, which has now
been fixed. The error could occur when `TimeTZ` values used in the query had
a different time zone offset than the `TimeTZ` values stored in the table.

75112: sql: fix casts between REG* types r=mgartner a=mgartner

The newly introduced `castMap` does not contain entries for casts
between all combinations of REG* types, which is consistent with
Postgres, but inconsistent with behavior in versions up to 21.2 where
these casts are allowed.

The `castMap` changes result in more than just backward incompatibility.
We allow branches of CASE statements to be equivalent types (i.e., types
in the same family), like `REGCLASS` and `REGTYPE`, and we automatically
add casts to a query plan to support this. However, because these casts
don't exist in the `castMap`, internal errors are raised when we try to
fetch the volatility of the cast while building logical properties.

According to Postgres's type conversion rules for CASE, we should only
allow branches to be different types if they can be implicitly cast to
the first non-NULL branch. Implicit casts between REG* types are not
allowed, so CASE expressions with branches of different REG* types
should result in a user error like `CASE/WHEN could not convert type
regclass to regtype`. However, this is a much larger project and the
change will not be fully backward compatible. This work is tracked by
issue #75103.

For now, this commit adds casts between REG* types to the `castMap` to
maintain backward compatibility and prevent an internal error.

There is no release note because this bug does not exist in any
releases.

Fixes #74784

Release note: None

75119: sql: deflake TestPerfLogging r=rytaft a=rytaft

This commit deflakes `TestPerfLogging` by ensuring that test cases
that should not produce log entries do not match with unrelated log
entries and thus cause the test to fail. This is ensured by making
the regex more precise for the specific test case.

Fixes #74811

Release note: None

75146: backupccl: "skip" TestChangefeedRestartDuringBackfill.. r=irfansharif a=irfansharif

under span configs. This test flakes pretty reliably after span configs
were enabled (#73876). Investigating this further is being tracked in
\#75080; lets have this test use the old subsystem for now (only down in
KV; we've narrowed down the failure to having something to do with
concurrent range splits, within the tenant keyspace, while a changefeed
is declared).

Release note: None

Co-authored-by: Ben Bardin <[email protected]>
Co-authored-by: Rebecca Taft <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
gtr pushed a commit to gtr/cockroach that referenced this issue Jan 24, 2022
under span configs. This test flakes pretty reliably after span configs
were enabled (cockroachdb#73876). Investigating this further is being tracked in
\cockroachdb#75080; lets have this test use the old subsystem for now (only down in
KV; we've narrowed down the failure to having something to do with
concurrent range splits, within the tenant keyspace, while a changefeed
is declared).

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this issue Jan 24, 2022
Refs: cockroachdb#75080

Reason: flaky test

Generated by bin/skip-test.

Release justification: non-production code changes

Release note: None
craig bot pushed a commit that referenced this issue Jan 24, 2022
75469: ccl/changefeedccl: skip TestChangefeedRestartDuringBackfill r=irfansharif a=irfansharif

Refs: #75080

Reason: flaky test

Generated by bin/skip-test.

Release justification: non-production code changes

Release note: None

Co-authored-by: irfan sharif <[email protected]>
@blathers-crl blathers-crl bot added the T-cdc label Jan 31, 2022
@blathers-crl
Copy link

blathers-crl bot commented Jan 31, 2022

cc @cockroachdb/cdc

@irfansharif
Copy link
Contributor Author

PS: I'm not actively investigating this. Letting it sit on the CDC board to remind ourselves to address it during stability at the latest.

miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Jul 7, 2023
Fixes cockroachdb#75080
Removed `TestChangefeedRestartDuringBackfill` test which was previously
skipped, and remaing skipped for very long time.

The reason for removal is that this test is exceedingly brittle
and very stale.  Furthermore, the functionality of restart during
backfill already test extensively by non-flaky tests
that verify restart and checkpoint functionality
(`TestChangefeedCheckpointSchemaChange`,
`TestChangefeedBackfillCheckpoint`,
`TestCoreChangefeedBackfillScanCheckpoint`).

Release note: None
craig bot pushed a commit that referenced this issue Jul 10, 2023
106433: changefeedccl: Remove stale test r=miretskiy a=miretskiy

Fixes #75080
Removed `TestChangefeedRestartDuringBackfill` test which was previously skipped, and remaing skipped for very long time.

The reason for removal is that this test is exceedingly brittle and very stale.  Furthermore, the functionality of restart during backfill already test extensively by non-flaky tests that verify restart and checkpoint functionality
(`TestChangefeedCheckpointSchemaChange`,
`TestChangefeedBackfillCheckpoint`,
`TestCoreChangefeedBackfillScanCheckpoint`).

Release note: None

Co-authored-by: Yevgeniy Miretskiy <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). T-cdc
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants