roachtest: restore/tpce/400GB/aws/nodes=8/cpus=8 failed #111159

cockroach-teamcity · 2023-09-23T08:42:10Z

roachtest.restore/tpce/400GB/aws/nodes=8/cpus=8 failed with artifacts on master @ b1a641f1d27653a4e93154926be57d0cae864bd6:

(test_runner.go:1141).runTest: test timed out (1h0m0s)
(monitor.go:153).Wait: monitor failure: unexpected node event: n5: cockroach process died (exit code 137)
test artifacts and logs in: /artifacts/restore/tpce/400GB/aws/nodes=8/cpus=8/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=aws , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for aws clusters

/cc @cockroachdb/disaster-recovery _{This test on roachdash | Improve this report!

Jira issue: CRDB-31794}

The text was updated successfully, but these errors were encountered:

adityamaru · 2023-09-25T17:48:15Z

restore in question, looks like it started at 0736 and saw an update at 0744 but then there was no job update:

902378867917422593      RESTORE RESTORE FROM 's3://cockroach-fixtures-us-east-2/backups/tpc-e/customers=25000/v22.2.0/inc-count=48/2022/12/20-225543.90?AUTH=implicit' AS OF SYSTEM TIME '2022-12-21T02:00:00Z'               root    {110,106,122,154,143,136,156,119,123,111,125,105,118,107,153,115,157,150,140,109,148,146,152,129,117,113,112,145,134,141,137,151,139,133,135,132,147,114,149,126,130,116,120,155,142,127,128,131,108,124,144,138,121} running NULL    2023-09-23 07:36:02.996412+00   2023-09-23 07:36:17.682181+00   NULL    2023-09-23 07:44:30.039126+00   0.16437001526355743   NULL            1       6628408645837087790     2023-09-23 07:36:17.682221+00   2023-09-23 07:36:47.682221+00   1       NULL    NULL

The test timed out at 0833. Time to look at the stacks.

adityamaru · 2023-09-25T18:48:19Z

Okay so one thing to note it looks like this test used makeSimpleImportSpans:

goroutine 19272 [chan send, 61 minutes]:
github.com/cockroachdb/cockroach/pkg/ccl/backupccl.generateAndSendImportSpans({0x79097e0?, 0xc0092b09b0?}, {0xc0055ed000?, 0xc016eda780?, 0x630d4c7?}, {0xc00e686500?, 0xc000000000?, 0x0?}, 0x0?, 0xc014d25980, ...)
        github.com/cockroachdb/cockroach/pkg/ccl/backupccl/restore_span_covering.go:522 +0x1268
github.com/cockroachdb/cockroach/pkg/ccl/backupccl.runGenerativeSplitAndScatter.func1({0x79097e0, 0xc0092b09b0})
        github.com/cockroachdb/cockroach/pkg/ccl/backupccl/generative_split_and_scatter_processor.go:309 +0x32e
github.com/cockroachdb/cockroach/pkg/util/ctxgroup.Group.GoCtx.func1()
        github.com/cockroachdb/cockroach/pkg/util/ctxgroup/ctxgroup.go:168 +0x25
golang.org/x/sync/errgroup.(*Group).Go.func1()
        golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75 +0x64
created by golang.org/x/sync/errgroup.(*Group).Go
        golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:72 +0xa5

L522 is

cockroach/pkg/ccl/backupccl/restore_span_covering.go

Line 522 in be12ddb

spanCh <- sp

msbutler · 2023-09-25T18:51:08Z

oh lemme revert #109943

adityamaru · 2023-09-25T18:52:57Z

I think its still useful to understand the stuckage here, considering we have in the past asked users to switch to makeSimpleImportSpans during escalations.

adityamaru · 2023-09-25T19:28:58Z

yeah I think we're missing a context cancellation check there. The test suggests that the channel that sp is being sent to is no longer being read from because that goroutine has exited -

cockroach/pkg/ccl/backupccl/generative_split_and_scatter_processor.go

Line 331 in 7ea6764

for entry := range restoreSpanEntriesCh {

. Sending a patch, and I think we shouldn't revert your change this was a legitimate shutdown bug. Why the processors were shutdown I'm not sure but it could've been a blip causing the job to replan and retry.

msbutler · 2023-09-25T19:44:01Z

nice find!!!

msbutler · 2023-09-25T19:46:49Z

Perhaps I'll revert my change after branch cut.

…111232 110967: asim: enable random zone config event generation r=kvoli a=wenyihu6 Previously, zone config event generation used hardcoded span configurations. This limits our ability to test the allocator more thoroughly. To improve this, this patch enables random span configs to be generated and applied as part of the simulation. These configurations are generated by randomly selecting the primary region, region V.S. zone survival goal, and leaseholder preference. ``` The following command is now supported: "rand_events" [cycle_via_random_survival_goals] ``` Part of: #106192 Release Note: none Epic: none 111192: bulk: allow caller-configurable priority in SSTBatcher r=adityamaru a=stevendanna This adds the ability for some callers to use a higher admission priority in SSTBatcher. This is helpful for streaming where we want to run at a priority that isn't subject to the elastic admission regime. Epic: none Release note: None 111206: kv: fix (store|node) not found err checking r=erikgrinaker a=kvoli `StoreNotFoundError` and `NodeNotFoundError` errors were moved to the `kvpb` pkg in #110374. As part of the move, `crdb_internal` functions which checked if the error were `DescNotFoundError` were also updated so that node/store not found errors would be recognized e.g. ``` errors.Is(kvpb.NewNodeNotFoundError(nodeID), &kvpb.DescNotFoundError{}) ``` This didn't work, because the error doesn't match the reference error variable being given. It does match the type. Update these error assertions to use `HasType` instead. Resolves: #111084 Epic: none Release note: None 111214: release: fix roachtest artifacts name r=srosenberg a=rail This fixes the roachtest artifacts directory name. Epic: none Release note: None 111217: cloud/azure: Fix azure schemes r=benbardin a=benbardin Part of: https://cockroachlabs.atlassian.net/browse/CRDB-31120 Release note (bug fix): Fixes azure schemes in storage, kms and external conns. 111223: storage: populate timestamp field in lock table values r=nvanbenschoten a=nvanbenschoten Informs #109645. This commit starts writing the `Timestamp` field in lock table `MVCCMetadata` values for shared and exclusive locks. This mirrors the behavior of intent locks. This is not strictly needed, as the timestamp is always equal to `Txn.WriteTimestamp`, but it is cheap to do and helps unify some stats logic, which uses this field to compute "lock age". Maybe we'll get rid of this for all lock strengths one day... Release note: None 111230: authors: add xhesikam to authors r=xhesikam a=xhesikam Release note: None Epic: None 111231: backupccl: add missing ctx cancel check r=msbutler a=adityamaru In #111159 we deduced from the stacks a situation in which the goroutine draining `spanCh` had exited due to a context cancelation but the writer was not listening for a ctx cancelation. This manifests as a stuck restore when using the non-default make simple import spans implementation. Fixes: #111159 Release note: None 111232: kv: deflake TestLeaseExpirationBasedDrainTransferWithExtension r=nvanbenschoten a=nvanbenschoten Informs #110715, which will be fixed by a non-clean backport (see 42e45b4) of this commit. This commit deflakes `TestLeaseExpirationBasedDrainTransferWithExtension` by disabling the node suspect timer in leaseTransferTest tests. These tests manually control the clock and have a habit of inducing destabilizing clock jumps. In this case, if n2 looked at liveness immediately after one of these manual clock jumps, it would mark n1 as suspect and refuse to transfer it the lease for the 30 second `server.time_after_store_suspect`, which is longer than the 5 second `server.shutdown.lease_transfer_wait`. This would cause the test to fail. Before this patch, the test would fail under stress race in about 8 minutes. Since the patch, it hasn't failed in over 30 minutes. Release note: None Co-authored-by: wenyihu6 <[email protected]> Co-authored-by: Steven Danna <[email protected]> Co-authored-by: Austen McClernon <[email protected]> Co-authored-by: Rail Aliiev <[email protected]> Co-authored-by: Ben Bardin <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: xhesikam <[email protected]> Co-authored-by: adityamaru <[email protected]>

In #111159 we deduced from the stacks a situation in which the goroutine draining `spanCh` had exited due to a context cancelation but the writer was not listening for a ctx cancelation. This manifests as a stuck restore when using the non-default make simple import spans implementation. Fixes: #111159 Release note: None

pav-kv · 2023-09-28T14:05:13Z

The graphs for this test look similar to #111160, with one node spinning at 100% CPU and having high storage latencies.

Both were run at similar times, so maybe both experienced some kind of infra flake. In this run, n1 is an outlier the same way as n5 is an outlier in #111160.

@msbutler @rhu713 @adityamaru Is this what you normally see for these tests? I'm running a "happy run" right now, and not observing any node overloads while the restore is going.

110377: pkg/server: add an extra server startup guardrail r=knz a=healthy-pod This code change prevents a SQL server from starting if its minimum supported binary version is greater than the tenant active version. Release note: None Epic: CRDB-26691 111143: cli: add --experimental-secondary-cache flag r=RaduBerinde a=itsbilal This change adds a new `--experimental-secondary-cache` flag for use with `--experimental-shared-storage` to enable the use of a secondary cache to speed up reads of objects in shared storage. This option sets the max cache size for each store using disaggregated storage; for per-store granularity, pebble options can also be used. Epic: none Release note: None 111173: sql: add some tests for lookup join when eq cols are key r=yuzefovich a=yuzefovich These tests are "regression tests" for #108489 if it were to be implemented. Epic: None Release note: None 111325: roachtest: add test to ruby-pg blocklist r=fqazi a=andyyang890 Fixes #111116 Release note: None 111384: process-bep-file: create binary for fetching test results from EngFlow r=healthy-pod a=rickystewart I'll extend this to additionally allow posting GitHub issues. Part of: DEVINF-871 Epic: CRDB-8308 Release note: None 111433: kv: add shared, replicated, and shared-replicated locks to TestEvaluateBatch r=nvanbenschoten a=nvanbenschoten Informs #91545. Informs #100193. This commit extends TestEvaluateBatch to include shared, replicated, and shared-replicated lock acquisition using Get, Scan, and ReverseScan requests. Release note: None 111434: roachtest: automatically profile CPU in restore tests r=msbutler a=pavelkalinnikov See https://www.cockroachlabs.com/docs/v23.1/automatic-cpu-profiler. NB: the `server.cpu_profile.enabled` setting is not used because it was recently removed in #107717. It should be used if this change is backported. Touches #111160, #111159 Epic: none Release note: none Co-authored-by: healthy-pod <[email protected]> Co-authored-by: Bilal Akhtar <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Andy Yang <[email protected]> Co-authored-by: Ricky Stewart <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Pavel Kalinnikov <[email protected]>

In cockroachdb#111159 we deduced from the stacks a situation in which the goroutine draining `spanCh` had exited due to a context cancelation but the writer was not listening for a ctx cancelation. This manifests as a stuck restore when using the non-default make simple import spans implementation. Fixes: cockroachdb#111159 Release note: None

cockroach-teamcity added this to the 23.2 milestone Sep 23, 2023

adityamaru self-assigned this Sep 25, 2023

adityamaru mentioned this issue Sep 25, 2023

backupccl: add missing ctx cancel check #111231

Merged

craig bot closed this as completed in 6fabaf8 Sep 25, 2023

blathers-crl bot mentioned this issue Sep 25, 2023

release-23.1: backupccl: add missing ctx cancel check #111245

Closed

pav-kv mentioned this issue Sep 28, 2023

roachtest: automatically profile CPU in restore tests #111434

Merged

pav-kv mentioned this issue Sep 29, 2023

release-23.1: roachtest: automatically profile CPU in restore tests #111493

Merged

msbutler mentioned this issue Feb 9, 2024

backupccl: add missing ctx cancel check #119014

Merged

github-project-automation bot added this to Disaster Recovery Backlog Aug 28, 2024

github-project-automation bot moved this to Done in Disaster Recovery Backlog Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: restore/tpce/400GB/aws/nodes=8/cpus=8 failed #111159

roachtest: restore/tpce/400GB/aws/nodes=8/cpus=8 failed #111159

cockroach-teamcity commented Sep 23, 2023 •

edited by cockroach-jira-scripts

Loading

adityamaru commented Sep 25, 2023

adityamaru commented Sep 25, 2023

msbutler commented Sep 25, 2023 •

edited

Loading

adityamaru commented Sep 25, 2023 •

edited

Loading

adityamaru commented Sep 25, 2023

msbutler commented Sep 25, 2023

msbutler commented Sep 25, 2023

pav-kv commented Sep 28, 2023 •

edited

Loading

roachtest: restore/tpce/400GB/aws/nodes=8/cpus=8 failed #111159

roachtest: restore/tpce/400GB/aws/nodes=8/cpus=8 failed #111159

Comments

cockroach-teamcity commented Sep 23, 2023 • edited by cockroach-jira-scripts Loading

adityamaru commented Sep 25, 2023

adityamaru commented Sep 25, 2023

msbutler commented Sep 25, 2023 • edited Loading

adityamaru commented Sep 25, 2023 • edited Loading

adityamaru commented Sep 25, 2023

msbutler commented Sep 25, 2023

msbutler commented Sep 25, 2023

pav-kv commented Sep 28, 2023 • edited Loading

cockroach-teamcity commented Sep 23, 2023 •

edited by cockroach-jira-scripts

Loading

msbutler commented Sep 25, 2023 •

edited

Loading

adityamaru commented Sep 25, 2023 •

edited

Loading

pav-kv commented Sep 28, 2023 •

edited

Loading