ccl: TestTenantLogic_create_index timed out #88202

yuzefovich · 2022-09-20T01:12:57Z

On the staging build here:

=== RUN   TestTenantLogic_create_index/resume-with-diff-tenant-resume-spans
    logic.go:2819: let $schema_changer_state = on
    logic.go:3069: 
        
        /home/roach/.cache/bazel/_bazel_roach/c5a4e7d36696d9cd970af2045211a7df/sandbox/processwrapper-sandbox/7835/execroot/com_github_cockroachdb_cockroach/bazel-out/k8-fastbuild/bin/pkg/ccl/logictestccl/tests/3node-tenant/3node-tenant_test_/3node-tenant_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/create_index:264: 
        expected success, but found
        (XXUUU) setting updated but timed out waiting to read new value
        set_cluster_setting.go:572: in waitForSettingUpdate()
    logic.go:2015: 
         pq: setting updated but timed out waiting to read new value
[00:06:01] --- done: /home/roach/.cache/bazel/_bazel_roach/c5a4e7d36696d9cd970af2045211a7df/sandbox/processwrapper-sandbox/7835/execroot/com_github_cockroachdb_cockroach/bazel-out/k8-fastbuild/bin/pkg/ccl/logictestccl/tests/3node-tenant/3node-tenant_test_/3node-tenant_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/create_index with config 3node-tenant: 51 tests, 2 failures
panic: test timed out after 59m55s

Seems related to #87201, cc @ajwerner

Jira issue: CRDB-19728

The text was updated successfully, but these errors were encountered:

ajwerner · 2022-09-20T04:38:35Z

There's something fishy about this one. I say that because we log the side effect with an event logger before we wait for the new setting value. We don't see a log event for that reset in the test logs.

ajwerner · 2022-09-20T05:02:42Z

I rescind my analysis. We attach the logging to a different transaction than we use to write the cluster setting (for better or for worse, and most likely, for worse).

ajwerner · 2022-09-20T13:28:18Z

The timeout here is because the pausepoint was still set when we went to do post-test dropping of the database.

blathers-crl · 2022-09-22T04:17:17Z

Hi @ajwerner, please add branch-* labels to identify which branch(es) this release-blocker affects.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

ajwerner · 2022-09-22T04:18:18Z

There's something very fishy going on here. I've added more logging, and it appears that the rangefeed is just missing the deletion event straight. It looks like there's a catch-up scan that somehow misses it or something. I'm honestly not sure. The repro is pretty slow. I'll spend more time digging tomorrow, but it's worrying.

ajwerner · 2022-09-22T13:01:20Z

I think this might be related to logictest-use-mvcc-range-tombstones-for-point-deletes. @erikgrinaker does that make sense to you? I notice that in all the cases, it seems like the rangefeed does not see the delete operation and it seems like this metamorphic variable is true. At first I thought there was no way this could be it because we'd see it failing all the time, but then I looked more closely, and I noticed that it'll only be possible if the transaction leaves an intent to be resolved. That will only happen if we miss the 1PC optimization. In general, the transaction for set cluster setting will be 1PC, so I think that's what makes this rare.

Right now I'm headed to the office and have it running with that disabled, but I'm realizing as I type this that I could just as well disable 1PC and see whether it fails readily.

erikgrinaker · 2022-09-22T13:27:05Z

It looks like there's a catch-up scan that somehow misses it or something.

I think this checks out. This is the only externally visible behavior that will differ with logictest-use-mvcc-range-tombstones-for-point-deletes. We still emit real-time point delete rangefeed events, since the tombstone replacement happens after the write (or after intent resolution), but the catchup scan won't see the point tombstones because they no longer exist, and range tombstones are ignored.

Is that the only issue here? If so, I can add a testing knob which replaces rangefeed events for point-sized range tombstones with regular point tombstone events. Alternatively, we can skip the relevant tests when that parameter is enabled.

ajwerner · 2022-09-22T13:35:09Z

It will mean that any attempt to reset a cluster setting might fail in the rare case that a catch-up scan occurs.

This actually explains some other failures we saw related to enabling settings.

Another option is to somehow make the settingwatcher observe the range deletion tombstones. I suspect that the spanconfig watcher will also be sad about missing entries.

erikgrinaker · 2022-09-22T13:45:57Z

Ok, I'll look into it, but may not get around to it until tomorrow.

ajwerner · 2022-09-22T14:03:49Z

Honestly, I'm inclined to do something hacky here like avoid this tombstone on system tables in the resolve intent code.

erikgrinaker · 2022-09-22T14:16:55Z

That's fine too.

yuzefovich added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Sep 20, 2022

blathers-crl bot added the T-sql-schema-deprecated Use T-sql-foundations instead label Sep 20, 2022

ajwerner self-assigned this Sep 20, 2022

ajwerner added the GA-blocker label Sep 22, 2022

ajwerner added the branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 label Sep 22, 2022

erikgrinaker assigned erikgrinaker and unassigned ajwerner Sep 22, 2022

erikgrinaker removed the GA-blocker label Sep 22, 2022

ajwerner mentioned this issue Sep 22, 2022

kvserver/batcheval: ignore UseRangeTombstonesForPointDeletes knob sometimes #88468

Merged

craig bot closed this as completed in #88468 Sep 22, 2022

craig bot closed this as completed in d390148 Sep 22, 2022

blathers-crl bot mentioned this issue Sep 22, 2022

release-22.2: kvserver/batcheval: ignore UseRangeTombstonesForPointDeletes knob sometimes #88500

Merged

exalate-issue-sync bot added T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) and removed T-sql-schema-deprecated Use T-sql-foundations instead labels May 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ccl: TestTenantLogic_create_index timed out #88202

ccl: TestTenantLogic_create_index timed out #88202

yuzefovich commented Sep 20, 2022 •

edited by cockroach-jira-scripts

Loading

ajwerner commented Sep 20, 2022

ajwerner commented Sep 20, 2022

ajwerner commented Sep 20, 2022

blathers-crl bot commented Sep 22, 2022

ajwerner commented Sep 22, 2022

ajwerner commented Sep 22, 2022

erikgrinaker commented Sep 22, 2022

ajwerner commented Sep 22, 2022

erikgrinaker commented Sep 22, 2022

ajwerner commented Sep 22, 2022

erikgrinaker commented Sep 22, 2022

ccl: TestTenantLogic_create_index timed out #88202

ccl: TestTenantLogic_create_index timed out #88202

Comments

yuzefovich commented Sep 20, 2022 • edited by cockroach-jira-scripts Loading

ajwerner commented Sep 20, 2022

ajwerner commented Sep 20, 2022

ajwerner commented Sep 20, 2022

blathers-crl bot commented Sep 22, 2022

ajwerner commented Sep 22, 2022

ajwerner commented Sep 22, 2022

erikgrinaker commented Sep 22, 2022

ajwerner commented Sep 22, 2022

erikgrinaker commented Sep 22, 2022

ajwerner commented Sep 22, 2022

erikgrinaker commented Sep 22, 2022

yuzefovich commented Sep 20, 2022 •

edited by cockroach-jira-scripts

Loading