Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: transfer-leases/signal failed #101624

Closed
cockroach-teamcity opened this issue Apr 17, 2023 · 21 comments
Closed

roachtest: transfer-leases/signal failed #101624

cockroach-teamcity opened this issue Apr 17, 2023 · 21 comments
Assignees
Labels
branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Apr 17, 2023

roachtest.transfer-leases/signal failed with artifacts on release-22.2 @ c2975c776afe8ed8b2571f0820cf9f3cb5ef22d0:

test artifacts and logs in: /artifacts/transfer-leases/signal/run_1
(quit.go:72).Fatalf: (1) ranges with no lease outside of node 2: []string{"20"}

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-27069

Epic CRDB-27234

@cockroach-teamcity cockroach-teamcity added branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Apr 17, 2023
@cockroach-teamcity cockroach-teamcity added this to the 22.2 milestone Apr 17, 2023
@blathers-crl blathers-crl bot added the T-kv KV Team label Apr 17, 2023
@erikgrinaker
Copy link
Contributor

erikgrinaker commented Apr 18, 2023

The lease was transferred away, but it wasn't applied on the new nodes before the check completed:

07:26:47.225509 205 kv/kvserver/replica_proposal.go:268 ⋮ [n3,s3,r20/3:‹/Table/{19-20}›,raft] 1778  new range lease repl=(n3,s3):3 seq=8 start=1681716401.612232842,0 exp=1681716410.612134910,0 pro=1681716401.612134910,0 following repl=(n2,s2):2 seq=7 start=1681716388.435400619,0 epo=2 pro=1681716388.477693251,0

07:26:39 cluster.go:677: test status: stopping nodes :2
07:26:43 quit.go:247: retrieving ranges for node 1
07:26:43 cluster.go:2068: > curl -s http://10.142.0.87:26258/_status/ranges/1
07:26:45 quit.go:247: retrieving ranges for node 3
07:26:45 cluster.go:2068: > curl -s http://10.142.0.87:26258/_status/ranges/3
07:26:47 test_impl.go:339: test failure #1: full stack retained in failure_1.log: (quit.go:72).Fatalf: (1) ranges with no lease outside of node 2: []string{"20"}

Here, start=1681716401.612232842,0 is 07:26:41, so n2 did successfully transfer the lease, but because n1 and n3 hadn't applied the new lease yet they (and the test) thought n2 still had the lease.

This test seems inherently racy, in that we have to wait for one of the other nodes to apply the lease, but not wait long enough that the old leases expire and the remaining nodes reacquire them on their own. One way to do this would be to change the test to set a very long lease expiration time (e.g. 10 minutes), and then poll the other nodes for e.g. 1 minute to see if they applied the lease.

@erikgrinaker erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-testing Testing tools and infrastructure and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Apr 18, 2023
@cockroach-teamcity
Copy link
Member Author

roachtest.transfer-leases/signal failed with artifacts on release-22.2 @ 9145535e8f6e57d09e3688d6b95fbdceecc47194:

test artifacts and logs in: /artifacts/transfer-leases/signal/run_1
(quit.go:72).Fatalf: (1) ranges with no lease outside of node 1: []string{"36"}

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.transfer-leases/signal failed with artifacts on release-22.2 @ 737f3bb5dbadc17ed5e924afaef3ec06e45bc017:

test artifacts and logs in: /artifacts/transfer-leases/signal/run_1
(cluster.go:1946).Stop: cluster.StopE: parallel execution failure: context deadline exceeded

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.transfer-leases/signal failed with artifacts on release-22.2 @ afe2d46600257f6c022ce89168bebdc0023c8215:

test artifacts and logs in: /artifacts/transfer-leases/signal/run_1
(cluster.go:1946).Stop: cluster.StopE: parallel execution failure: context deadline exceeded

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.transfer-leases/signal failed with artifacts on release-22.2 @ cfc200070bbf9fc4b6bc6c6556b5fd56f48db381:

test artifacts and logs in: /artifacts/transfer-leases/signal/run_1
(cluster.go:1946).Stop: cluster.StopE: parallel execution failure: context deadline exceeded

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.transfer-leases/signal failed with artifacts on release-22.2 @ eb2447deef5f6c1807bd51c84943cb47341e9046:

test artifacts and logs in: /artifacts/transfer-leases/signal/run_1
(cluster.go:1946).Stop: cluster.StopE: parallel execution failure: context deadline exceeded

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.transfer-leases/signal failed with artifacts on release-22.2 @ 8a2050b00bac7e247df26c63aa0c8c1ca4fd7996:

test artifacts and logs in: /artifacts/transfer-leases/signal/run_1
(cluster.go:1947).Stop: cluster.StopE: parallel execution failure: context deadline exceeded

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.transfer-leases/signal failed with artifacts on release-22.2 @ 8a2050b00bac7e247df26c63aa0c8c1ca4fd7996:

test artifacts and logs in: /artifacts/transfer-leases/signal/run_1
(cluster.go:1947).Stop: cluster.StopE: parallel execution failure: context deadline exceeded

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.transfer-leases/signal failed with artifacts on release-22.2 @ 8a2050b00bac7e247df26c63aa0c8c1ca4fd7996:

test artifacts and logs in: /artifacts/transfer-leases/signal/run_1
(cluster.go:1947).Stop: cluster.StopE: parallel execution failure: context deadline exceeded

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.transfer-leases/signal failed with artifacts on release-22.2 @ 33b991317b3f3590581424d85aca1c939d21a797:

test artifacts and logs in: /artifacts/transfer-leases/signal/run_1
(cluster.go:1947).Stop: cluster.StopE: parallel execution failure: context deadline exceeded

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.transfer-leases/signal failed with artifacts on release-22.2 @ 8055184fc991424d1d0eefd7ccf703948b8de3ee:

test artifacts and logs in: /artifacts/transfer-leases/signal/run_1
(cluster.go:2139).Stop: cluster.StopE: parallel execution failure: context deadline exceeded

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.transfer-leases/signal failed with artifacts on release-22.2 @ 2124c6b9adc61030bedc4a6a67c025566f989201:

test artifacts and logs in: /artifacts/transfer-leases/signal/run_1
(cluster.go:2139).Stop: cluster.StopE: parallel execution failure: context deadline exceeded

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.transfer-leases/signal failed with artifacts on release-22.2 @ e36a88e4bd22dae1108e1276d6a4cdc680bc6a1b:

test artifacts and logs in: /artifacts/transfer-leases/signal/run_1
(cluster.go:2139).Stop: cluster.StopE: parallel execution failure: context deadline exceeded

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.transfer-leases/signal failed with artifacts on release-22.2 @ 20d9e47509e8136096d133031bbd5f110962ba6a:

test artifacts and logs in: /artifacts/transfer-leases/signal/run_1
(cluster.go:2139).Stop: cluster.StopE: parallel execution failure: context deadline exceeded

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.transfer-leases/signal failed with artifacts on release-22.2 @ 9a48060e50978c78e264fe96bab563ec6e20e3ae:

test artifacts and logs in: /artifacts/transfer-leases/signal/run_1
(cluster.go:2139).Stop: cluster.StopE: parallel execution failure: context deadline exceeded

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jul 6, 2023

We see that we're not transferring leases because both followers are in StateProbe. We also see that n2 was initially in StateReplicate, but transitioned to StateProbe between 13:35:28 and 13:35:32:

13:35:28.792732 23045 13@kv/kvserver/allocator/allocatorimpl/allocator.go:2323 ⋮ [n3,replicate,s3,r32/3:‹/{NamespaceTab…-Table/32}›] 781
  not considering [n1, s1] as a potential candidate for a lease transfer because the replica may be waiting for a snapshot (‹replica in StateProbe›):
  ‹{"id":"3","term":13,"vote":"3","commit":57,"lead":"3","raftState":"StateLeader","applied":57,
  "progress":{"1":{"match":57,"next":58,"state":"StateProbe"},"2":{"match":57,"next":58,"state":"StateReplicate"},"3":{"match":57,"next":58,"state":"StateReplicate"}},"leadtransferee":"0"}›

I230706 13:35:32.408283 31159 13@kv/kvserver/allocator/allocatorimpl/allocator.go:2323 ⋮ [drain,n3,s3,r32/3:‹/{NamespaceTab…-Table/32}›] 1877
  not considering [n2, s2] as a potential candidate for a lease transfer because the replica may be waiting for a snapshot (‹replica in StateProbe›):
  ‹{"id":"3","term":13,"vote":"3","commit":57,"lead":"3","raftState":"StateLeader","applied":57,
  "progress":{"1":{"match":57,"next":58,"state":"StateProbe"},"2":{"match":57,"next":58,"state":"StateProbe"},"3":{"match":57,"next":58,"state":"StateReplicate"}},"leadtransferee":"0"}›

Looking at the Raft logs for r32 on n3, we see n3 sending a Raft heartbeat at 13:35:30. The send fails to n2 because the node was in fact down, transitioning it to StateProbe. However, we also see n1 responding to the heartbeat, but we aren't sending it a follow-up MsgApp probe that would have transitioned it back to a StateReplicate. Why not?

I230706 13:35:30.725763 356 kv/kvserver/raft.go:185 ⋮ [n3,s3,r32/3:‹/{NamespaceTab…-Table/32}›,raft] 87629
  raft ready (must-sync=false)
I230706 13:35:30.725763 356 kv/kvserver/raft.go:185 ⋮ [n3,s3,r32/3:‹/{NamespaceTab…-Table/32}›,raft] 87629
  +‹  Outgoing Message[0]: 3->1 MsgHeartbeat Term:13 Log:0/0 Commit:57›
I230706 13:35:30.725763 356 kv/kvserver/raft.go:185 ⋮ [n3,s3,r32/3:‹/{NamespaceTab…-Table/32}›,raft] 87629
  +‹  Outgoing Message[1]: 3->2 MsgHeartbeat Term:13 Log:0/0 Commit:57›
I230706 13:35:30.726521 356 kv/kvserver/replica_raft.go:1461 ⋮ [n3,s3,r32/3:‹/{NamespaceTab…-Table/32}›,raft] 87757
  coalescing beat: {RangeID:32 FromReplicaID:3 ToReplicaID:1 Term:13 Commit:57 Quiesce:false LaggingFollowersOnQuiesce:[] LaggingFollowersOnQuiesceAccurate:false}
I230706 13:35:30.726545 356 kv/kvserver/replica_raft.go:1461 ⋮ [n3,s3,r32/3:‹/{NamespaceTab…-Table/32}›,raft] 87758
  coalescing beat: {RangeID:32 FromReplicaID:3 ToReplicaID:2 Term:13 Commit:57 Quiesce:false LaggingFollowersOnQuiesce:[] LaggingFollowersOnQuiesceAccurate:false}

I230706 13:35:30.910504 392 kv/kvserver/store_raft.go:354 ⋮ [n3,s3,r32/3:‹/{NamespaceTab…-Table/32}›,raft] 88567
  incoming raft message:
I230706 13:35:30.910504 392 kv/kvserver/store_raft.go:354 ⋮ [n3,s3,r32/3:‹/{NamespaceTab…-Table/32}›,raft] 88567
  +‹1->3 MsgHeartbeatResp Term:13 Log:0/0›
I230706 13:35:30.926289 405 go.etcd.io/etcd/raft/v3/raft.go:1361 ⋮ [n3,s3,r32/3:‹/{NamespaceTab…-Table/32}›] 88943
  3 failed to send message to 2 because it is unreachable [‹StateProbe match=57 next=58›]

Well, turns out etcd/raft doesn't even try if it knows the follower is caught up:

https://github.com/cockroachdb/etcd/blob/88bbb1d5f3a3cf060ffe813cb045e5c84be64755/raft/raft.go#L1315-L1317

As usual, @tbg is way ahead of me, and fixed this upstream in etcd-io/raft#52. That change wasn't backported to 22.2.

Going to verify with a local cherry-pick.

@blathers-crl
Copy link

blathers-crl bot commented Jul 6, 2023

cc @cockroachdb/replication

@erikgrinaker erikgrinaker removed the A-testing Testing tools and infrastructure label Jul 6, 2023
@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jul 6, 2023

Going to verify with a local cherry-pick.

Can confirm that 30/30 runs passed with etcd-io/raft#52. Before, it would typically fail on the first or second run.

Will prepare a backport.

@cockroach-teamcity
Copy link
Member Author

roachtest.transfer-leases/signal failed with artifacts on release-22.2 @ f0bda42400b5bcd7a2ccad3611144793a00f18a8:

test artifacts and logs in: /artifacts/transfer-leases/signal/run_1
(cluster.go:2139).Stop: cluster.StopE: parallel execution failure: context deadline exceeded

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.transfer-leases/signal failed with artifacts on release-22.2 @ e6e90f623bef2f35a04f1404bb8b4483ecf5729f:

test artifacts and logs in: /artifacts/transfer-leases/signal/run_1
(cluster.go:2139).Stop: cluster.StopE: parallel execution failure: context deadline exceeded

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@erikgrinaker
Copy link
Contributor

Resolved by #106805.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

3 participants