Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: tpcc/headroom/n4cpu16 failed #108173

Closed
cockroach-teamcity opened this issue Aug 4, 2023 · 5 comments
Closed

roachtest: tpcc/headroom/n4cpu16 failed #108173

cockroach-teamcity opened this issue Aug 4, 2023 · 5 comments
Labels
branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-testeng TestEng Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Aug 4, 2023

roachtest.tpcc/headroom/n4cpu16 failed with artifacts on release-22.2 @ c9cdbe4417ec387382cf6c0edafee51b3d58e278:

test artifacts and logs in: /artifacts/tpcc/headroom/n4cpu16/run_1
(monitor.go:127).Wait: monitor failure: monitor task failed: output in run_150105.178428446_n4_cockroach-workload-r: ./cockroach workload run tpcc --warehouses=875 --histograms=perf/stats.json  --ramp=5m0s --duration=2h0m0s --prometheus-port=0 --pprofport=33333  {pgurl:1-3} returned: COMMAND_PROBLEM: ssh verbose log retained in ssh_150105.867149457_n4_cockroach-workload-r.log: exit status 1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=16 , ROACHTEST_encrypted=true , ROACHTEST_fs=zfs , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/test-eng

This test on roachdash | Improve this report!

Jira issue: CRDB-30345

@cockroach-teamcity cockroach-teamcity added branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Aug 4, 2023
@cockroach-teamcity cockroach-teamcity added this to the 22.2 milestone Aug 4, 2023
@blathers-crl blathers-crl bot added the T-testeng TestEng Team label Aug 4, 2023
@srosenberg
Copy link
Member

srosenberg commented Aug 4, 2023

On n2,

I230804 15:10:43.419718 1030882 kv/kvserver/replica_command.go:441 ⋮ [n2,split,s2,r186/3:‹/Table/108/1/{96/4/2…-126/9/…}›] 292  initiating a split of this range at key /Table/108/1/‹111›/‹6›/‹2901› [r323] (‹512 MiB above threshold size 512 MiB›)‹›

W230804 15:30:43.316684 383 kv/kvserver/replica_raft.go:1381 ⋮ [n2,s2,r186/3:‹/Table/108/1/{96/4/2…-111/6/…}›,raft] 385  have been waiting 15.20s for slow proposal Put [/Table/108/1/‹105›/‹7›/‹2209›/‹0›,/Min), [txn: 6d089584]

I230804 15:30:43.317110 5042128 kv/kvserver/replica_circuit_breaker.go:148 ⋮ [n2,s2,r186/3:‹/Table/108/1/{96/4/2…-111/6/…}›] 386  breaker: tripped with error: replica unavailable: (n2,s2):3 unable to serve request to r186:‹/Table/108/1/{96/4/2108-111/6/2901}› [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=17, sticky=1691161305.178242646,0]: closed timestamp: 1691163025.140137429,0 (2023-08-04 15:30:25); raft status: {"id":"3","term":7,"vote":"3","commit":19402,"lead":"3","raftState":"StateLeader","applied":19402,"progress":{"2":{"match":19402,"next":19444,"state":"StateReplicate"},"3":{"match":19443,"next":19444,"state":"StateReplicate"},"1":{"match":19402,"next":19444,"state":"StateReplicate"}},"leadtransferee":"0"}: have been waiting 15.20s for slow proposal Put [/Table/108/1/‹105›/‹7›/‹2209›/‹0›,/Min), [txn: 6d089584]

On n1,

I230804 15:30:46.595676 394 kv/kvserver/store_raft.go:651 ⋮ [n1,s1,r186/1:‹/Table/108/1/{96/4/2…-111/6/…}›,raft] 1888  raft ready handling: 18.36s [append=0.00s, apply=0.00s, commit-batch-sync=18.36s, other=0.00s], wrote 840 B sync [append-ent=775 B (1), ]; node might be overloaded

It's unclear what triggered unavailability; potentially related to #105970

@srosenberg
Copy link
Member

We can rule out cpu and mem,

cpu_mem_grafana

fsync appears to be the cause of the overload: Grafana

@srosenberg srosenberg removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Aug 4, 2023
@renatolabs
Copy link
Contributor

Saving artifacts and debug.zip before they are deleted from TC:

debug.zip
artifacts.zip

@renatolabs
Copy link
Contributor

Around 15:30:43, n1 started seeing lots of errors acquiring latches:

W230804 15:30:43.239967 8046714 kv/kvserver/spanlatch/manager.go:558 ⋮ [n1,s1,r70/1:‹/Table/10{7/1-8}›] 749 have been waiting 15s to acquire ‹read› latch ‹/Table/107/1/105/7/0@0,0›, held by ‹write› latch ‹/Table/107/1/105/7/[email protected],0›

This is also when fsync latencies increased on that node (see Grafana link above). According to that same graph, fsync latency on n1 eventually comes back to normal, but the cluster continues to observe lots of errors (see n1 logs). The workload dies shortly after.

I'm unsure whether there's something more to be done here. Would be nice to get some @cockroachdb/replication eyes on the issue.

@renatolabs
Copy link
Contributor

I'm closing this due to inactivity. This issue:

  • has an unclear failure mode (see discussion above)
  • is over a month old
  • happened only once on release-22.2 and we haven't seen other failures since then

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-testeng TestEng Team
Projects
No open projects
Status: Done
Development

No branches or pull requests

3 participants