-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: panic: tocommit(28) is out of range [lastIndex(0)] in StartTestCluster
#20764
Comments
Hmm. Coalesced heartbeats sound like a reasonable suspicion. Another shot in the dark is that it has something to do with the way that the test servers are started on port 0 and get assigned their real port after starting. Could that be confusing node ids somehow? The one unusual thing about this test is the very short ScanInterval. This must be making the replication queue race with something. I would say it's racing with another ChangeReplicas, but I think that would make more noise in the logs. |
How does the automatic stress of new tests work? I can't find where it happens. Is there something different about that environment compared to our usual one? |
To dispel any doubt, this now also happened in one of our other tests: #21045 This test sets up a mixed cluster, but I'm unconvinced that that is relevant. |
I got a repro of this on a GCE worker while trying to bisect TestAdminAPITableStats (it seems to suffer from a number of different flakes, although after looking through the history it appears it was recently re-enabled.) Full log: https://gist.github.com/mrtracy/d677275be57c2c5b2764dbc6b9e17791 |
Another Gist with an interesting different failure: a segfault in raft. https://gist.github.com/mrtracy/c1d1e3ebcf70c84d0656080f62fd4d74
|
We've seen the nil pointer panic before, most recently in #20629. It may or may not be related to the To decipher the stack trace a bit, when you see |
I can't confirm this, but #21771 fixed some instances of |
I was never able to reproduce this, but optimistically closing as fixed by #22518 |
This happened in #20625. Our CI harness tries to figure out new/changed tests and stress-tests for five minutes. This resulted in the following crash that I haven't been able to reproduce in >2h of stress runs, though the code that caused it was essentially
What I see in the logs is a bit confusing. The range in question is
r3
and it has the following activity:What strikes me as odd (though maybe I'm confusing what's which) is that replicaID=2 should belong to node 3. Yet, in its failing message,
n2
claims to have a replica ofr3
withreplicaID=2
but no data ([n2,s2,r3/2:{-}]
)?This makes me suspect something about coalesced heartbeats (since the stack below has handleHeartbeat), but could be a red herring. Any ideas, @bdarnell?
Full log: https://gist.github.com/tschottdorf/3960b61bc7bc6f67c713d61107db13b8
The text was updated successfully, but these errors were encountered: