Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: tpcc-nowait/isolation-level=snapshot/nodes=3/w=1 failed #136277

Closed
cockroach-teamcity opened this issue Nov 27, 2024 · 2 comments
Closed
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-testeng TestEng Team X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Nov 27, 2024

roachtest.tpcc-nowait/isolation-level=snapshot/nodes=3/w=1 failed with artifacts on master @ 97965d4a2a614f2ac7fc9b10e6b5f4a92ed1d502:

(monitor.go:149).Wait: monitor failure: full command output in run_074711.652901426_n4_cockroach-workload-r.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/tpcc-nowait/isolation-level=snapshot/nodes=3/w=1/cpu_arch=arm64/run_1

Parameters:

  • arch=arm64
  • cloud=azure
  • coverageBuild=false
  • cpu=16
  • encrypted=false
  • metamorphicLeases=default
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for azure clusters

/cc @cockroachdb/test-eng

This test on roachdash | Improve this report!

Jira issue: CRDB-44951

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-testeng TestEng Team labels Nov 27, 2024
@DarrylWong
Copy link
Contributor

Error is:

error in newOrder: ERROR: result is ambiguous: error=ba: QueryIntent [/Table/108/1/0/5/483/0], [txn: c34da9ae], [protect-ambiguous-replay] RPC error: grpc: error reading from server: read tcp 10.1.0.171:35562->10.1.0.170:26257: use of closed network connection [code 14/Unavailable] [exhausted] (last error: failed to send RPC: sending to all replicas failed; last error: ba: QueryIntent [/Table/108/1/0/5/483/0], [txn: c34da9ae], [protect-ambiguous-replay] RPC error: grpc: error reading from server: read tcp 10.1.0.171:35562->10.1.0.170:26257: use of closed network connection [code 14/Unavailable]) (SQLSTATE 40003)

Looks like an infra flake? Node 3 struggles to stay connected to the cluster:

W241127 08:00:38.184735 546 kv/kvserver/liveness/liveness.go:667 ⋮ [T1,Vsystem,n3,liveness-hb] 1724 +
W241127 08:00:38.184735 546 kv/kvserver/liveness/liveness.go:667 ⋮ [T1,Vsystem,n3,liveness-hb] 1724 +An inability to maintain liveness will prevent a node from participating in a
W241127 08:00:38.184735 546 kv/kvserver/liveness/liveness.go:667 ⋮ [T1,Vsystem,n3,liveness-hb] 1724 +cluster. If this problem persists, it may be a sign of resource starvation or
W241127 08:00:38.184735 546 kv/kvserver/liveness/liveness.go:667 ⋮ [T1,Vsystem,n3,liveness-hb] 1724 +of network connectivity problems. For help troubleshooting, visit:
W241127 08:00:38.184735 546 kv/kvserver/liveness/liveness.go:667 ⋮ [T1,Vsystem,n3,liveness-hb] 1724 +
W241127 08:00:38.184735 546 kv/kvserver/liveness/liveness.go:667 ⋮ [T1,Vsystem,n3,liveness-hb] 1724 +    https://www.cockroachlabs.com/docs/stable/cluster-setup-troubleshooting.html#node-liveness-issues

Likewise, the following logs are seen on node 1 and 2.

I241127 08:00:30.977983 2990 kv/kvserver/closedts/sidetransport/sender.go:803 ⋮ [T1,Vsystem,n1,ctstream=3] 1275  side-transport failed to connect to n3: failed to connect to n3 at ‹10.1.0.170:26257›: grpc: ‹rpc error: code = Unavailable desc 

@DarrylWong DarrylWong removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Dec 3, 2024
@srosenberg
Copy link
Member

Yeah, it looks like the network dropped out around 07:55 on n3. From dmesg log,

[Wed Nov 27 07:55:49 2024] NETDEV WATCHDOG: enP23746s1 (mlx5_core): transmit queue 0 timed out
[Wed Nov 27 07:55:49 2024] WARNING: CPU: 7 PID: 0 at net/sched/sch_generic.c:525 dev_watchdog+0x28c/0x2ac
[Wed Nov 27 07:55:49 2024] Modules linked in: xt_tcpudp xt_owner xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables libcrc32c nfnetlink nvme_fabrics mlx5_ib ib_uverbs ib_core mlx5_core tls mlxfw psample udf crc_itu_t binfmt_misc nls_iso8859_1 joydev aes_ce_blk serio_raw hid_generic aes_ce_cipher crct10dif_ce polyval_ce hyperv_drm polyval_generic drm_kms_helper ghash_ce syscopyarea sysfillrect sm4 sha2_ce sha256_arm64 hid_hyperv sysimgblt sha1_ce hyperv_keyboard hid drm_shmem_helper hv_netvsc dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua sch_fq_codel efi_pstore drm ip_tables x_tables autofs4
[Wed Nov 27 07:55:49 2024] CPU: 7 PID: 0 Comm: swapper/7 Not tainted 6.2.0-1018-azure #18~22.04.1-Ubuntu
[Wed Nov 27 07:55:49 2024] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 09/28/2024
[Wed Nov 27 07:55:49 2024] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[Wed Nov 27 07:55:49 2024] pc : dev_watchdog+0x28c/0x2ac
[Wed Nov 27 07:55:49 2024] lr : dev_watchdog+0x28c/0x2ac
[Wed Nov 27 07:55:49 2024] sp : ffff80000803bde0
[Wed Nov 27 07:55:49 2024] x29: ffff80000803bde0 x28: ffffa9ef72157000 x27: ffff80000803bee0
[Wed Nov 27 07:55:49 2024] x26: ffffa9ef718b1008 x25: ffffa9ef725a3000 x24: 0000000000000000
[Wed Nov 27 07:55:49 2024] x23: ffffa9ef72157000 x22: 0000000000000000 x21: ffff00011e468000
[Wed Nov 27 07:55:49 2024] x20: ffff00011e4684c8 x19: ffffa9ef72576f40 x18: 0000000000000000
[Wed Nov 27 07:55:49 2024] x17: 2064656d69742030 x16: 2065756575712074 x15: 696d736e61727420
[Wed Nov 27 07:55:49 2024] x14: 3a2965726f635f35 x13: 74756f2064656d69 x12: 7420302065756575
[Wed Nov 27 07:55:49 2024] x11: 712074696d736e61 x10: 7274203a2965726f x9 : ffffa9ef6fd66fdc
[Wed Nov 27 07:55:49 2024] x8 : 34373332506e6520 x7 : 0000000000000001 x6 : 0000000000000001
[Wed Nov 27 07:55:49 2024] x5 : 0000000000000000 x4 : 0000000000000040 x3 : 0000000000000001
[Wed Nov 27 07:55:49 2024] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff000100533180
[Wed Nov 27 07:55:49 2024] Call trace:
[Wed Nov 27 07:55:49 2024]  dev_watchdog+0x28c/0x2ac
[Wed Nov 27 07:55:49 2024]  call_timer_fn+0x3c/0x17c
[Wed Nov 27 07:55:49 2024]  __run_timers.part.0+0x318/0x3e0
[Wed Nov 27 07:55:49 2024]  run_timer_softirq+0x44/0x80
[Wed Nov 27 07:55:49 2024]  __do_softirq+0x134/0x3d8
[Wed Nov 27 07:55:49 2024]  ____do_softirq+0x18/0x24
[Wed Nov 27 07:55:49 2024]  call_on_irq_stack+0x24/0x30
[Wed Nov 27 07:55:49 2024]  do_softirq_own_stack+0x24/0x3c
[Wed Nov 27 07:55:49 2024]  __irq_exit_rcu+0x118/0x160
[Wed Nov 27 07:55:49 2024]  irq_exit_rcu+0x18/0x24
[Wed Nov 27 07:55:49 2024]  el1_interrupt+0x4c/0xb0
[Wed Nov 27 07:55:49 2024]  el1h_64_irq_handler+0x18/0x2c
[Wed Nov 27 07:55:49 2024]  el1h_64_irq+0x78/0x7c
[Wed Nov 27 07:55:49 2024]  arch_cpu_idle+0x18/0x4c
[Wed Nov 27 07:55:49 2024]  default_idle_call+0x50/0x114
[Wed Nov 27 07:55:49 2024]  cpuidle_idle_call+0x174/0x1e0
[Wed Nov 27 07:55:49 2024]  do_idle+0xb8/0x110
[Wed Nov 27 07:55:49 2024]  cpu_startup_entry+0x2c/0x34
[Wed Nov 27 07:55:49 2024]  secondary_start_kernel+0xf0/0x154
[Wed Nov 27 07:55:49 2024]  __secondary_switched+0xb0/0xb4

@srosenberg srosenberg added the X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue label Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-testeng TestEng Team X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue
Projects
None yet
Development

No branches or pull requests

3 participants