-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cli/democluster: TestTransientClusterMultitenant failed #96162
Comments
full error trace below. This is telling me that the execution of cluster upgrades when a tenant is first booted up is currently brittle - apparently it's possible for the SQL liveness session to expire before/during the migration. What is a good answer here? Make the upgrade jobs retry after refreshing the SQL liveness session? cc @ajwerner if you have ideas. I think @healthy-pod you'll find this interesting too.
|
Yeah, I think in some sense we're missing logic to treat this as a retry. I don't think the session actually died. I think that it was so overloaded that the sql query thought the session died. I'm trying to repro and understand. The job itself did not fail -- and it if it had, it would be retried. |
There's some weird things going on here. For one, I see both |
I'd like to see the details. |
This bors failure is another failure mode of the same test: https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_BazelEssentialCi/8508170?showRootCauses=false&expandBuildChangesSection=true&expandBuildProblemsSection=true&expandBuildTestsSection=true I'm bisecting. |
The bisect points conclusively at 3ff9bc9. |
Cool - I know @stevendanna is investigating something about that already. |
Some clues about things being weird:
Notice that tenant nsql2 seems to think that its own rpc address is n3's RPC address? |
i've hit this twice on an unrelated PR. Going to skip |
Informs cockroachdb#96162 Release note: None Epic: none
96360: multitenant: skip TestTransientClusterMultiTenant r=knz a=msbutler Informs #96162 Release note: None Epic: none Co-authored-by: Michael Butler <[email protected]>
I found the likely cause of the flake: a cancellable context was passed to the |
Epic: CRDB-18499
cli/democluster.TestTransientClusterMultitenant failed with artifacts on master @ 69dd453d0e61e258f402c5751de310405743cd18:
Parameters:
TAGS=bazel,gss
Help
See also: How To Investigate a Go Test Failure (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-23974
The text was updated successfully, but these errors were encountered: