Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: cdc/scan/initial/nodes=5/cpu=16/rows=1G/ranges=100k/protocol=rangefeed/format=json/sink=null failed #108243

Closed
cockroach-teamcity opened this issue Aug 5, 2023 · 4 comments
Assignees
Labels
A-cdc Change Data Capture branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-cdc X-unactionable This was closed because it was unactionable.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Aug 5, 2023

roachtest.cdc/scan/initial/nodes=5/cpu=16/rows=1G/ranges=100k/protocol=rangefeed/format=json/sink=null failed with artifacts on master @ 695d4ec85157a4da0c97237e6f16bf919eebf6c5:

(test_runner.go:1075).runTest: test timed out (2h0m0s)
(monitor.go:137).Wait: monitor failure: monitor task failed: dial tcp 34.23.84.165:26257: connect: connection refused
test artifacts and logs in: /artifacts/cdc/scan/initial/nodes=5/cpu=16/rows=1G/ranges=100k/protocol=rangefeed/format=json/sink=null/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=16 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/cdc

This test on roachdash | Improve this report!

Jira issue: CRDB-30379

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-cdc labels Aug 5, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.2 milestone Aug 5, 2023
@blathers-crl blathers-crl bot added the A-cdc Change Data Capture label Aug 5, 2023
@erikgrinaker erikgrinaker removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Aug 7, 2023
@erikgrinaker
Copy link
Contributor

I can have a look, since I added this test last week.

@erikgrinaker erikgrinaker self-assigned this Aug 7, 2023
@erikgrinaker
Copy link
Contributor

The test ended up timing out after 2 hours. Normally, the scan takes about 25 minutes, plus 30 minutes for setup.

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Aug 7, 2023

Seeing some node liveness failures and gRPC disconnects in the middle of the initial scan. I'd expect it to retry though, but it appears like it stalled the scan. Might be a similar problem as #108157. Don't have time to dig in further right now, will pick this up later.

n4 failures:

W230805 12:34:15.775547 6411030 kv/kvserver/closedts/sidetransport/receiver.go:140 ⋮ [n4] 131980  closed timestamps side-transport connection dropped from node: 2 (grpc: ‹context canceled› [code 1/Canceled])
W230805 12:34:16.434247 4765502 2@rpc/clock_offset.go:291 ⋮ [T1,n4,rnode=6,raddr=‹10.142.0.224:26257›,class=default,rpc] 131981  latency jump (prev avg 0.37ms, current 112.97ms)
W230805 12:34:17.652496 514 kv/kvserver/liveness/liveness.go:774 ⋮ [T1,n4,liveness-hb] 131982  slow heartbeat took 3.000866446s; err=result is ambiguous: context done during DistSender.Send: ba: ‹ConditionalPut [/System/NodeLiveness/4,/Min), EndTxn(commit modified-span (node-liveness)) [/System/NodeLiveness/4], [txn: 9481bd0b], [can-forward-ts]› RPC error: grpc: ‹context deadline exceeded› [code 4/DeadlineExceeded]
W230805 12:34:17.652653 514 kv/kvserver/liveness/liveness.go:690 ⋮ [T1,n4,liveness-hb] 131983  failed node liveness heartbeat: operation ‹"node liveness heartbeat"› timed out after 3.001s (given timeout 3s): result is ambiguous: context done during DistSender.Send: ba: ‹ConditionalPut [/System/NodeLiveness/4,/Min), EndTxn(commit modified-span (node-liveness)) [/System/NodeLiveness/4], [txn: 9481bd0b], [can-forward-ts]› RPC error: grpc: ‹context deadline exceeded› [code 4/DeadlineExceeded]
W230805 12:34:17.652653 514 kv/kvserver/liveness/liveness.go:690 ⋮ [T1,n4,liveness-hb] 131983 +
W230805 12:34:17.652653 514 kv/kvserver/liveness/liveness.go:690 ⋮ [T1,n4,liveness-hb] 131983 +An inability to maintain liveness will prevent a node from participating in a
W230805 12:34:17.652653 514 kv/kvserver/liveness/liveness.go:690 ⋮ [T1,n4,liveness-hb] 131983 +cluster. If this problem persists, it may be a sign of resource starvation or
W230805 12:34:17.652653 514 kv/kvserver/liveness/liveness.go:690 ⋮ [T1,n4,liveness-hb] 131983 +of network connectivity problems. For help troubleshooting, visit:
W230805 12:34:17.652653 514 kv/kvserver/liveness/liveness.go:690 ⋮ [T1,n4,liveness-hb] 131983 +
W230805 12:34:17.652653 514 kv/kvserver/liveness/liveness.go:690 ⋮ [T1,n4,liveness-hb] 131983 +    https://www.cockroachlabs.com/docs/stable/cluster-setup-troubleshooting.html#node-liveness-issues
W230805 12:34:18.035940 6409948 kv/kvserver/raft_transport.go:996 ⋮ [T1,n4] 131984  while processing outgoing Raft queue to node 2: recv msg error: grpc: ‹grpc: the client connection is closing› [code 1/Canceled]:

n2 failures:

W230805 12:34:16.263173 4066306 2@rpc/clock_offset.go:291 ⋮ [T1,n2,rnode=6,raddr=‹10.142.0.224:26257›,class=default,rpc] 117229  latency jump (prev avg 0.84ms, current 2566.59ms)
W230805 12:34:16.265301 5636390 kv/kvserver/raft_transport.go:996 ⋮ [T1,n2] 117230  while processing outgoing Raft queue to node 4: recv msg error: grpc: ‹error reading from server: EOF› [code 14/Unavailable]:
W230805 12:34:16.265682 5834 kv/kvserver/raft_transport.go:996 ⋮ [T1,n2] 117231  while processing outgoing Raft queue to node 5: recv msg error: grpc: ‹error reading from server: EOF› [code 14/Unavailable]:
W230805 12:34:16.266643 2344 kv/kvserver/raft_transport.go:996 ⋮ [T1,n2] 117232  while processing outgoing Raft queue to node 4: recv msg error: grpc: ‹error reading from server: EOF› [code 14/Unavailable]:
W230805 12:34:17.105781 2843 kv/kvserver/raft_transport.go:996 ⋮ [T1,n2] 117233  while processing outgoing Raft queue to node 5: send msg error: ‹EOF›:
E230805 12:34:17.115462 2844 2@rpc/peer.go:590 ⋮ [T1,n2,rnode=5,raddr=‹10.142.0.201:26257›,class=default,rpc] 117234  disconnected (was healthy for 1h8m38.34s): grpc: ‹error reading from server: EOF› [code 14/Unavailable]
W230805 12:34:17.118093 2842 kv/kvserver/store_raft.go:330 ⋮ [T1,n2] 117235  raft receive queue for r84924 is full
E230805 12:34:17.119108 2206 2@rpc/peer.go:590 ⋮ [T1,n2,rnode=3,raddr=‹10.142.0.229:26257›,class=system,rpc] 117236  disconnected (was healthy for 1h8m40.607s): grpc: ‹error reading from server: EOF› [code 14/Unavailable]
W230805 12:34:17.119298 2732 kv/kvserver/raft_transport.go:996 ⋮ [T1,n2] 117237  while processing outgoing Raft queue to node 3: recv msg error: grpc: ‹error reading from server: EOF› [code 14/Unavailable]:
E230805 12:34:17.119483 2224 2@rpc/peer.go:590 ⋮ [T1,n2,rnode=4,raddr=‹10.142.0.105:26257›,class=system,rpc] 117238  disconnected (was healthy for 18m55.361s): grpc: ‹error reading from server: EOF› [code 14/Unavailable]
E230805 12:34:17.119690 293 2@rpc/peer.go:590 ⋮ [T1,n2,rnode=1,raddr=‹10.142.0.226:26257›,class=system,rpc] 117239  disconnected (was healthy for 1h8m44.675s): grpc: ‹error reading from server: EOF› [code 14/Unavailable]
W230805 12:34:17.119838 1639 kv/kvserver/raft_transport.go:996 ⋮ [T1,n2] 117240  while processing outgoing Raft queue to node 1: recv msg error: grpc: ‹error reading from server: EOF› [code 14/Unavailable]:
E230805 12:34:17.120050 2722 2@rpc/peer.go:590 ⋮ [T1,n2,rnode=5,raddr=‹10.142.0.201:26257›,class=system,rpc] 117241  disconnected (was healthy for 1h8m39.826s): grpc: ‹error reading from server: EOF› [code 14/Unavailable]
W230805 12:34:17.121937 2519 kv/kvserver/closedts/sidetransport/sender.go:838 ⋮ [T1,n2,ctstream=4] 117242  failed to send closed timestamp message 17604 to n4: send msg error: ‹EOF›
W230805 12:34:17.145581 3020 kv/kvserver/closedts/sidetransport/sender.go:838 ⋮ [T1,n2,ctstream=5] 117243  failed to send closed timestamp message 17604 to n5: send msg error: ‹EOF›
W230805 12:34:17.145652 2518 kv/kvserver/closedts/sidetransport/sender.go:838 ⋮ [T1,n2,ctstream=3] 117244  failed to send closed timestamp message 17604 to n3: send msg error: ‹EOF›
W230805 12:34:17.145689 4055 kv/kvserver/closedts/sidetransport/sender.go:838 ⋮ [T1,n2,ctstream=1] 117245  failed to send closed timestamp message 17604 to n1: send msg error: ‹EOF›
E230805 12:34:17.177633 2144 2@rpc/peer.go:590 ⋮ [T1,n2,rnode=4,raddr=‹10.142.0.105:26257›,class=default,rpc] 117246  disconnected (was healthy for 1h8m40.716s): grpc: ‹error reading from server: EOF› [code 14/Unavailable]
W230805 12:34:17.957559 495 kv/kvserver/liveness/liveness.go:774 ⋮ [T1,n2,liveness-hb] 117247  slow heartbeat took 3.392284254s; err=disk write failed while updating node liveness: interrupted during singleflight ‹engine sync:0›: context deadline exceeded
W230805 12:34:17.957678 495 kv/kvserver/liveness/liveness.go:690 ⋮ [T1,n2,liveness-hb] 117248  failed node liveness heartbeat: operation ‹"node liveness heartbeat"› timed out after 3.392s (given timeout 3s): disk write failed while updating node liveness: interrupted during singleflight ‹engine sync:0›: context deadline exceeded
W230805 12:34:17.957678 495 kv/kvserver/liveness/liveness.go:690 ⋮ [T1,n2,liveness-hb] 117248 +
W230805 12:34:17.957678 495 kv/kvserver/liveness/liveness.go:690 ⋮ [T1,n2,liveness-hb] 117248 +An inability to maintain liveness will prevent a node from participating in a
W230805 12:34:17.957678 495 kv/kvserver/liveness/liveness.go:690 ⋮ [T1,n2,liveness-hb] 117248 +cluster. If this problem persists, it may be a sign of resource starvation or
W230805 12:34:17.957678 495 kv/kvserver/liveness/liveness.go:690 ⋮ [T1,n2,liveness-hb] 117248 +of network connectivity problems. For help troubleshooting, visit:
W230805 12:34:17.957678 495 kv/kvserver/liveness/liveness.go:690 ⋮ [T1,n2,liveness-hb] 117248 +
W230805 12:34:17.957678 495 kv/kvserver/liveness/liveness.go:690 ⋮ [T1,n2,liveness-hb] 117248 +    https://www.cockroachlabs.com/docs/stable/cluster-setup-troubleshooting.html#node-liveness-issues
I230805 12:34:17.977364 6944649 rpc/heartbeat.go:174 ⋮ [-] 117249  failing ping request from node n1
I230805 12:34:17.981647 6944651 rpc/heartbeat.go:174 ⋮ [-] 117250  failing ping request from node n4
E230805 12:34:17.991912 26 2@rpc/peer.go:590 ⋮ [T1,n2,rnode=?,raddr=‹34.138.219.11:26257›,class=system,rpc] 117251  disconnected (was healthy for 1h8m45.548s): grpc: ‹error reading from server: EOF› [code 14/Unavailable]
I230805 12:34:17.992311 6944969 rpc/heartbeat.go:174 ⋮ [-] 117252  failing ping request from node n3
W230805 12:34:17.994759 2312 kv/kvserver/raft_transport.go:996 ⋮ [T1,n2] 117253  while processing outgoing Raft queue to node 3: send msg error: ‹EOF›:

@erikgrinaker
Copy link
Contributor

Artifacts are gone, closing as inactionable.

@erikgrinaker erikgrinaker added the X-unactionable This was closed because it was unactionable. label Sep 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cdc Change Data Capture branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-cdc X-unactionable This was closed because it was unactionable.
Projects
None yet
Development

No branches or pull requests

2 participants