Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: admission-control/elastic-backup failed #105955

Closed
cockroach-teamcity opened this issue Jul 1, 2023 · 5 comments
Closed

roachtest: admission-control/elastic-backup failed #105955

cockroach-teamcity opened this issue Jul 1, 2023 · 5 comments
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-storage Storage Team X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jul 1, 2023

roachtest.admission-control/elastic-backup failed with artifacts on master @ df82b5306e945d396269f28874d98fa567f4ac32:

(monitor.go:137).Wait: monitor failure: monitor task failed: output in run_055414.054144921_n4_cockroach-workload-r: ./cockroach workload run tpcc --warehouses=1000 --histograms=perf/stats.json  --ramp=5m0s --duration=1h30m0s --prometheus-port=2112 --pprofport=33333  {pgurl:1-3} returned: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/admission-control/elastic-backup/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-29283

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels Jul 1, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.2 milestone Jul 1, 2023
@irfansharif irfansharif self-assigned this Jul 4, 2023
@irfansharif
Copy link
Contributor

Workload generator tripped with:

Error: error in stockLevel: ERROR: inbox communication error: rpc error: code = Canceled desc = context canceled (SQLSTATE 58C01)

@irfansharif
Copy link
Contributor

irfansharif commented Jul 4, 2023

This is unrelated to the roachtest failure itself, but I'm seeing the following in the logs:

۩ cat logs/3.unredacted/cockroach.log | grep "observed raft log position less than per-stream lower bound" | head -n 10
E230701 07:05:49.168518 265 kv/kvserver/kvflowcontrol/kvflowtokentracker/tracker.go:95 ⋮ [-] 4421  observed raft log position less than per-stream lower bound (log-position=0/0 <= log-position=7/252461)
E230701 07:05:49.171144 265 kv/kvserver/kvflowcontrol/kvflowtokentracker/tracker.go:95 ⋮ [-] 4434  observed raft log position less than per-stream lower bound (log-position=0/0 <= log-position=6/23546)
E230701 07:05:49.171165 265 kv/kvserver/kvflowcontrol/kvflowtokentracker/tracker.go:95 ⋮ [-] 4435  observed raft log position less than per-stream lower bound (log-position=0/0 <= log-position=6/23546)
E230701 07:05:49.171178 265 kv/kvserver/kvflowcontrol/kvflowtokentracker/tracker.go:95 ⋮ [-] 4436  observed raft log position less than per-stream lower bound (log-position=0/0 <= log-position=6/23546)
E230701 07:05:49.171190 265 kv/kvserver/kvflowcontrol/kvflowtokentracker/tracker.go:95 ⋮ [-] 4437  observed raft log position less than per-stream lower bound (log-position=0/0 <= log-position=6/23546)
E230701 07:05:49.171201 265 kv/kvserver/kvflowcontrol/kvflowtokentracker/tracker.go:95 ⋮ [-] 4438  observed raft log position less than per-stream lower bound (log-position=0/0 <= log-position=6/23546)
E230701 07:05:49.171215 265 kv/kvserver/kvflowcontrol/kvflowtokentracker/tracker.go:95 ⋮ [-] 4439  observed raft log position less than per-stream lower bound (log-position=0/0 <= log-position=6/23546)
E230701 07:05:49.171363 265 kv/kvserver/kvflowcontrol/kvflowtokentracker/tracker.go:95 ⋮ [-] 4440  observed raft log position less than per-stream lower bound (log-position=0/0 <= log-position=7/20430)
E230701 07:05:49.171381 265 kv/kvserver/kvflowcontrol/kvflowtokentracker/tracker.go:95 ⋮ [-] 4441  observed raft log position less than per-stream lower bound (log-position=0/0 <= log-position=7/20430)
E230701 07:05:49.172008 265 kv/kvserver/kvflowcontrol/kvflowtokentracker/tracker.go:95 ⋮ [-] 4445  observed raft log position less than per-stream lower bound (log-position=0/0 <= log-position=6/33144)

Which is an indication of a kvflowcontrol integration bug. Filed #106123.

if !(dt.lowerBound.Less(pos)) {
// We're trying to track a token deduction at a position less than the
// stream's lower-bound. Shout loudly but ultimately no-op. This
// regression indicates buggy usage since:
// - token deductions are done so with monotonically increasing log
// positions (see Handle.DeductTokensFor);
// - the monotonically increasing log positions for token deductions
// also extends to positions at which streams are connected,
// which typically happen when (a) members are added to the raft
// group, (b) previously crashed follower nodes restart, (c)
// replicas are unpaused, or (d) caught back up via snapshots (see
// Handle.ConnectStream).
// - token returns upto some log position don't precede deductions at
// lower log positions (see Handle.ReturnTokensUpto);
logFn := log.Errorf
if buildutil.CrdbTestBuild {
logFn = log.Fatalf
}
logFn(ctx, "observed raft log position less than per-stream lower bound (%s <= %s)",
pos, dt.lowerBound)
return false
}

Aside, I thought roachtests would have the test buildtag enabled, which would've triggered fatals. Is that not happening?

// CrdbTestBuild is a flag that is set to true if the binary was compiled
// with the 'crdb_test' build tag (which is the case for all test targets). This
// flag can be used to enable expensive checks, test randomizations, or other
// metamorphic-style perturbations that will not affect test results but will
// exercise different parts of the code.
const CrdbTestBuild = false

@irfansharif
Copy link
Contributor

Metrics from this run: https://grafana.testeng.crdb.io/d/ngnKs_j7x/admission-control?orgId=1&from=1688190294256&to=1688195718471&var-cluster=teamcity-10750279-1688190283-03-n4cpu8&var-instances=All.

More asides: I wonder what these mean. Why are we not ticking the raft group for 16s?

10971:I230701 07:05:49.168118 265 kv/kvserver/store_raft.go:670 ⋮ [T1,n3,s3,r44/2:‹/Table/4{2-3}›,raft] 4418  raft ready handling: 16.03s [append=0.00s, apply=0.00s, non-blocking-sync=16.03s, other=0.00s], wrote [append-batch=51 B, ]; node might be overloaded
12040:I230701 07:05:49.198847 284 kv/kvserver/store_raft.go:670 ⋮ [T1,n3,s3,r141/2:‹/Table/108/1/7{62/7/…-77/9/…}›,raft] 4872  raft ready handling: 16.06s [append=0.00s, apply=0.00s, non-blocking-sync=16.06s, other=0.00s], wrote [append-batch=52 B, ]; node might be overloaded
12048:I230701 07:05:49.199127 305 kv/kvserver/store_raft.go:670 ⋮ [T1,n3,s3,r82/2:‹/Table/10{6/1-7/1}›,raft] 4876  raft ready handling: 16.06s [append=0.00s, apply=0.00s, non-blocking-sync=16.06s, other=0.00s], wrote [append-batch=52 B, ]; node might be overloaded
12057:I230701 07:05:49.199435 309 kv/kvserver/store_raft.go:670 ⋮ [T1,n3,s3,r79/2:‹/Table/11{1/1-2/1}›,raft] 4883  raft ready handling: 16.06s [append=0.00s, apply=0.00s, non-blocking-sync=16.06s, other=0.00s], wrote [append-batch=51 B, ]; node might be overloaded
12064:I230701 07:05:49.199679 320 kv/kvserver/store_raft.go:670 ⋮ [T1,n3,s3,r163/2:‹/Table/109/1/{559/"\…-666/"\…}›,raft] 4888  raft ready handling: 16.06s [append=0.00s, apply=0.00s, non-blocking-sync=16.06s, other=0.00s], wrote [append-batch=52 B, ]; node might be overloaded
12072:I230701 07:05:49.199909 261 kv/kvserver/store_raft.go:670 ⋮ [T1,n3,s3,r148/2:‹/Table/110/1/{396/3/…-666/7/…}›,raft] 4892  raft ready handling: 16.06s [append=0.00s, apply=0.00s, non-blocking-sync=16.06s, other=0.00s], wrote [append-batch=52 B, ]; node might be overloaded
12098:I230701 07:05:49.200875 308 kv/kvserver/store_raft.go:670 ⋮ [T1,n3,s3,r127/2:‹/Table/109/1/{698/"\…-796/"\…}›,raft] 4902  raft ready handling: 16.06s [append=0.00s, apply=0.00s, non-blocking-sync=16.06s, other=0.00s], wrote [append-batch=393 B, append-ent=328 B (1), ]; node might be overloaded
12109:I230701 07:05:49.201199 288 kv/kvserver/store_raft.go:670 ⋮ [T1,n3,s3,r313/2:‹/Table/114/1/5{39/9/…-51/9/…}›,raft] 4907  raft ready handling: 16.06s [append=0.00s, apply=0.00s, non-blocking-sync=16.06s, other=0.00s], wrote [append-batch=3.8 KiB, append-ent=3.7 KiB (1), ]; node might be overloaded
12120:I230701 07:05:49.201418 282 kv/kvserver/store_raft.go:670 ⋮ [T1,n3,s3,r73/2:‹/Table/10{7/1-8/1}›,raft] 4914  raft ready handling: 16.06s [append=0.00s, apply=0.00s, non-blocking-sync=16.06s, other=0.00s], wrote [append-batch=814 B, append-ent=749 B (1), ]; node might be overloaded
12132:I230701 07:05:49.201633 273 kv/kvserver/store_raft.go:670 ⋮ [T1,n3,s3,r108/2:‹/Table/108/1/{5/1/23…-20/3/1…}›,raft] 4920  raft ready handling: 16.06s [append=0.00s, apply=0.00s, non-blocking-sync=16.06s, other=0.00s], wrote [append-batch=868 B, append-ent=804 B (1), ]; node might be overloaded
12136:I230701 07:05:49.201768 306 kv/kvserver/store_raft.go:670 ⋮ [T1,n3,s3,r107/2:‹/Table/113/1/3{35/95…-43/81…}›,raft] 4922  raft ready handling: 16.06s [append=0.00s, apply=0.00s, non-blocking-sync=16.06s, other=0.00s], wrote [append-batch=5.1 KiB, append-ent=5.1 KiB (1), ]; node might be overloaded
12138:I230701 07:05:49.201930 268 kv/kvserver/store_raft.go:670 ⋮ [T1,n3,s3,r120/2:‹/Table/114/1/3{37/2/…-49/1/…}›,raft] 4924  raft ready handling: 16.06s [append=0.00s, apply=0.00s, non-blocking-sync=16.06s, other=0.00s], wrote [append-batch=2.3 KiB, append-ent=2.2 KiB (1), ]; node might be overloaded

@irfansharif
Copy link
Contributor

irfansharif commented Jul 4, 2023

There's evidence of slow disks.

I230701 07:05:38.319562 19360381 3@storage/pebble.go:1114 ⋮ [n3,s3,pebble] 59426  disk slowness detected: syncdata on file 024372.log has been ongoing for 5.3s
I230701 07:05:38.523759 19360478 3@storage/pebble.go:1114 ⋮ [n3,s3,pebble] 59427  disk slowness detected: syncdata on file 024378.sst has been ongoing for 5.5s
I230701 07:05:40.319617 19360718 3@storage/pebble.go:1114 ⋮ [n3,s3,pebble] 59429  disk slowness detected: syncdata on file 024372.log has been ongoing for 7.3s
I230701 07:05:40.524417 19360639 3@storage/pebble.go:1114 ⋮ [n3,s3,pebble] 59430  disk slowness detected: syncdata on file 024378.sst has been ongoing for 7.5s
I230701 07:05:49.044503 19361573 3@storage/pebble.go:1114 ⋮ [n3,s3,pebble] 59431  disk slowness detected: create on file 024379.sst has been ongoing for 9.6s
I230701 07:05:49.056791 19361572 3@storage/pebble.go:1114 ⋮ [n3,s3,pebble] 59432  disk slowness detected: remove on file 020891.sst has been ongoing for 9.6s
W230701 07:05:38.515107 346 kv/kvserver/liveness/liveness.go:861 ⋮ [T1,n3,liveness-hb] 3171  slow heartbeat took 3.00021964s; err=disk write failed while updating node liveness: interrupted during singleflight ‹engine sync:0›: context deadline exceeded
W230701 07:05:38.515192 346 kv/kvserver/liveness/liveness.go:763 ⋮ [T1,n3,liveness-hb] 3172  failed node liveness heartbeat: operation ‹"node liveness heartbeat"› timed out after 3s (given timeout 3s): disk write failed while updating node liveness: interrupted during singleflight ‹engine sync:0›: context deadline exceeded
...
W230701 07:05:49.040956 346 kv/kvserver/liveness/liveness.go:861 ⋮ [T1,n3,liveness-hb] 3186  slow heartbeat took 10.525714696s; err=disk write failed while updating node liveness: interrupted during singleflight ‹engine sync:0›: context deadline exceeded
W230701 07:05:49.041033 346 kv/kvserver/liveness/liveness.go:763 ⋮ [T1,n3,liveness-hb] 3187  failed node liveness heartbeat: operation ‹"node liveness heartbeat"› timed out after 10.526s (given timeout 3s): disk write failed while updating node liveness: interrupted during singleflight ‹engine sync:0›: context deadline exceeded

We see this in the metrics too:

image image

@irfansharif
Copy link
Contributor

irfansharif commented Jul 4, 2023

Same things seen in #105833. This failure is not AC/KV related, so I'm going to put it in the @cockroachdb/storage investigations board (but I'll close the issue).

@blathers-crl blathers-crl bot added the T-storage Storage Team label Jul 4, 2023
@irfansharif irfansharif added X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jul 4, 2023
@exalate-issue-sync exalate-issue-sync bot removed the T-kv KV Team label Jul 4, 2023
@jbowens jbowens moved this to Done in [Deprecated] Storage Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-storage Storage Team X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue
Projects
No open projects
Archived in project
Development

No branches or pull requests

3 participants