Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: c2c/shutdown/dest/coordinator failed - failed ingestion job resume after node failure #103008

Closed
cockroach-teamcity opened this issue May 10, 2023 · 4 comments
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-disaster-recovery X-noreuse Prevent automatic commenting from CI test failures
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented May 10, 2023

roachtest.c2c/shutdown/dest/coordinator failed with artifacts on master @ 245c3b9db56567d9820babe460c5ed8a3786890e:

test artifacts and logs in: /artifacts/c2c/shutdown/dest/coordinator/run_1
(sql_runner.go:218).Scan: error scanning '&{0xc0062706c0 <nil>}': pq: system-jobs-scan: rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: connection interrupted (did the remote node shut down or are there networking issues?)"
(1) pq: system-jobs-scan: rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: connection interrupted (did the remote node shut down or are there networking issues?)"
Error types: (1) *pq.Error
(monitor.go:127).Wait: monitor failure: monitor task failed: context canceled while waiting for job to finish: context canceled
(test_runner.go:1089).func1: 1 dead node(s) detected

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-27786

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-disaster-recovery labels May 10, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone May 10, 2023
@msbutler msbutler self-assigned this May 10, 2023
@msbutler msbutler removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label May 10, 2023
@msbutler
Copy link
Collaborator

this infra flake will be fixed by #102382

@cockroach-teamcity
Copy link
Member Author

roachtest.c2c/shutdown/dest/coordinator failed with artifacts on master @ 107ad579b1ff8311ac4f4df4783985010a0322f9:

test artifacts and logs in: /artifacts/c2c/shutdown/dest/coordinator/run_1
(monitor.go:127).Wait: monitor failure: monitor task failed: unexpectedly found job 865597372800925699 in state pause-requested
(test_runner.go:1108).func1: 1 dead node(s) detected

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@msbutler
Copy link
Collaborator

This latest failure seems legit. Here's the timeline:

  1. We issue a cutover command
  2. We kill the destination job coordinator (node 6)
  3. Node 5 attempts to take over the job lease and resume the job, but fails to do so because:
 producer job is not active and in status STREAM_INACTIVE) but is being paused›
  1. After this error, I see some very strange log lines about the ingestion job stepping through the succeed state.
I230516 09:43:20.725988 41597 jobs/registry.go:1592 ⋮ [T1,n1] 630  STREAM INGESTION job 865597372800925699: stepping through state succeeded
E230516 09:43:20.735804 41597 jobs/adopt.go:479 ⋮ [T1,n1] 631  job 865597372800925699: adoption completed with error job 865597372800925699: could not mark as succeeded: job 865597372800925699: job with status pause-requested cannot be marked as succeeded

@msbutler msbutler added the X-noreuse Prevent automatic commenting from CI test failures label May 16, 2023
@msbutler msbutler changed the title roachtest: c2c/shutdown/dest/coordinator failed roachtest: c2c/shutdown/dest/coordinator failed - failed ingestion job resume after node failure May 16, 2023
@stevendanna
Copy link
Collaborator

We believe this one is fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-disaster-recovery X-noreuse Prevent automatic commenting from CI test failures
Projects
No open projects
Archived in project
Development

No branches or pull requests

3 participants