-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: replicagc-changed-peers/restart=false failed #66944
Comments
This also failed on master, and the failure there is linked to #63999, which this build already had the backport of (#63999). Looking at the logs here, we see the same problem - n4 to n6 are throttled throughout the test. But this makes sense because everyone seems to lose liveness and never get it back:
something pretty bad is going on here. We need to understand what. |
I looked at the stack traces hoping this was the deadlock we fixed the other day, but the stacks are... very boring. Nothing going on really. |
The liveness range is healthy (two out of three replicas remaining, the down one being n3 which is by design in this test). |
Looking at a repro now. We lost liveness for 5 minutes, the liveness heartbeats fail (on all nodes) with
this eventually resolves. The error comes from here cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go Lines 2114 to 2123 in 11fd6b1
This basically has to be some kind of cache invalidation problem? When we see a SendError, we eject the range descriptor and on a retry we shouldn't be seeing n3 in it - I checked the range desc and it's on n4,n5,n6 (with both lease and raft leadership on n5). Why are we continuing to try to touch n3? |
Going to take this as an opportunity to test out #65844. Started another repro with that cherry-picked & called in the test at hand. That should tell us what goes on inside of DistSender for these liveness attempts. |
@lunevalex I took the release-blocker label off of this since we don't think that this was introduced recently, please comment if you disagree with this. |
roachtest.replicagc-changed-peers/restart=false failed with artifacts on release-21.1 @ c4d0e7baee3925541eed599ae771abb95c97732b:
Reproduce
To reproduce, try: # From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh replicagc-changed-peers/restart=false Same failure on other branches
|
Seems to be fixed, I think it must've been the roachtest broken env var passing that we had for a while. |
roachtest.replicagc-changed-peers/restart=false failed with artifacts on release-21.1 @ ede1628bd625a07b3f966a36ea841009803fc8a9:
Reproduce
To reproduce, try:
# From https://go.crdb.dev/p/roachstress, perhaps edited lightly. caffeinate ./roachstress.sh replicagc-changed-peers/restart=false
This test on roachdash | Improve this report!
The text was updated successfully, but these errors were encountered: