-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: election-after-restart failed [skipped] #35047
Comments
Attached the artifacts. A cursory glance shows that we're taking a long time to acquire a lease. This doesn't quite come out at the top of my priority queue right now, so I'm going to defer. |
SHA: https://github.com/cockroachdb/cockroach/commits/d0a93d286ee42ceb94f986b6a06a1afd52cf914e Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1149020&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/a119a3a158725c9e3f9b8084d9398601c0e67007 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1170795&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/57e825a7940495b67e0cc7213a5fabc24e12be0e Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1176948&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/a512e390f7f2f2629f3f811bab5866c46e3e5713 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1183678&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/9399d559ae196e5cf2ad122195048ff9115ab56a Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1194326&tab=buildLog
|
Taken from the last run, presumably the others are similar: https://gist.github.com/tbg/895c05d4b6672e382bc8a09cf2c15999 Looks like most of the gaps (not all) have the following pattern
We are presumably dropping the pending lease request (why exactly? No leaseholder known or an election fight?) and because it's a special request it has no replay protection and we "have" to wait for the caller. @bdarnell I wonder what a good way to fix this without additional hacks could look like. The real solution is to handle
|
ErrProposalDropped is returned when there is no known raft leader. When that happens, it should be followed by a reasonNewLeader reproposal as soon as the election completes, which avoids waiting out reasonTicks. One scenario in which you do have to wait for ticks is if n1 proposes a new lease while (n1 believes) n2 is leader (it must have been true at some point for n1 to believe this). The proposal is forwarded to n2. If n2 has lost leadership at this point, it will drop the proposal (and now we're too far removed from the original proposer to do anything with the error). I think that's probably what's happening here although I don't know why this appears to have become more common recently. I don't immediately see a better/quicker solution than #21849. (which involves introducing a new return path so we can tell the original proposer that its forwarded proposal has been dropped. |
SHA: https://github.com/cockroachdb/cockroach/commits/1cbf3680129e47bd310640bf32b665662f30faa9 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1217781&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/76e8e78d2eca7c81ce7f76aa0504a0cf4b91e27f Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1231790&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/134478e4dde16919eb86efb81fb22d8cce8a9701 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1234680&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/509c5b130fb1ad0042beb74e083817aa68e4fc92 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1237068&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/d3f704f839ccaef7f10c3af48c78a26d390ae1dc Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1241436&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/9938cb1a2cca4c0350244f76845f0c61391d44a7 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1249130&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/c0d8e9d838fca9f79bc10d9fb43eeeaa502fdd91 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1255139&tab=buildLog
|
Looking back at the recent reports of this (the ones that still have their artifacts), there are a lot of ADD_REPLICAs, snapshots, and snapshot failures because the follower nodes haven't yet caught up on the splits from before the restart. It looks like for whatever reason the initial ranges aren't getting fully replicated before the ALTER TABLE SPLIT starts, so when we reach the restart we have a bunch of incompletely-replicated ranges, and followers who haven't even caught up on splits, so they'll reject incoming snapshots. I'm not 100% sure how this leads to the observed behavior, but it could fairly easily explain a lot of slow starts. Since this isn't core to what the test is trying to prove, I think adding a sleep before the restart to let all the replication and split processing catch up would be an easy way to deflake the test. I'm not sure whether there would still be stuff to do (like the ErrProposalDropped issue discussed above) after that. |
Logs suggest that we're seeing failures when nodes aren't caught up on all the split processing before the restart, so give this a chance to finish. Updates cockroachdb#35047 Release note: None
36984: storage: Add comments about leases and liveness r=tbg a=bdarnell I had to study this code for #35986, so I took the opportunity to update the documentation while it's fresh in my mind. Release note: None 37100: roachtest: Hopefully deflake election-after-restart r=tbg a=bdarnell Logs suggest that we're seeing failures when nodes aren't caught up on all the split processing before the restart, so give this a chance to finish. Updates #35047 Release note: None Co-authored-by: Ben Darnell <[email protected]>
SHA: https://github.com/cockroachdb/cockroach/commits/99306ec3e9fcbba01c05431cbf496e8b5b8954b4 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1260033&tab=buildLog
|
^- the roachtest version here seems to have had #37100: Back to square one. |
Actually this is the SHA that matters: bc40e38 has that PR too, though. |
Symptoms look similar to before, with each slow range having a failed lease request. This time it's a mix of reasonNewLeaderOrConfigChange and reasonTicks. I think before we were seeing mostly reasonTicks, but I'm not sure.
The same split-related messages i saw before #37100 are there.
I still see a lot of ADD_REPLICA messages interleaved with the "initiating a split" messages and continuing afterwards. It looks like the initial ranges aren't getting fully upreplicated before the splits, and then maybe the serialization of snapshots is enough to keep it from catching up before the restart? Adding another sleep before the split might fix things, although it's just adding another hack. |
The sleep added in the previous PR didn't eliminate the flakiness, because it appears that the initial ranges weren't fully replicated. Add another sleep before the splits in another attempt to get everything into a stable configuration before restart. Updates cockroachdb#35047 Release note: None
37212: roachtest: Another attempt at deflaking election-after-restart r=tbg a=bdarnell The sleep added in the previous PR didn't eliminate the flakiness, because it appears that the initial ranges weren't fully replicated. Add another sleep before the splits in another attempt to get everything into a stable configuration before restart. Updates #35047 Release note: None Co-authored-by: Ben Darnell <[email protected]>
SHA: https://github.com/cockroachdb/cockroach/commits/24feca7a4106f08c73534e16ebb79d949a479f35 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1268176&tab=buildLog
|
Sigh. That failure includes my most recent change and has exactly the same behavior. Looks like we're going to have to actually understand this instead of just hacking around it with sleeps. |
SHA: https://github.com/cockroachdb/cockroach/commits/23155799e92e54915ae66259d06a630e981afbeb Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1277061&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/45e15e05abff25e099ca59f4c5cb40a6cf695e6d Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1285294&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/c25518b4e9a723d8de0dba30a95ce0ade7963aed Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1286188&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/dcd4cc5e37ebbcebbbf0f01670811b176a58bf89 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1288281&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/ba5c092a726134b73e789c2047f7ec151be7c1a1 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1288263&tab=buildLog
|
See cockroachdb#35047 Release note: None
37507: roachtest: Skip flaky election-after-restart r=tbg a=bdarnell See #35047 Release note: None Co-authored-by: Ben Darnell <[email protected]>
The sleep added in the previous PR didn't eliminate the flakiness, because it appears that the initial ranges weren't fully replicated. Add another sleep before the splits in another attempt to get everything into a stable configuration before restart. Updates cockroachdb#35047 Release note: None
I investigated this: I can't repro the issue after many runs. I also see that the failure was for I think that roachtests should not run under |
SHA: https://github.com/cockroachdb/cockroach/commits/bd80a74f882a583d6bb2a04dfdb57b49254bc7ba
Parameters:
To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1143393&tab=buildLog
The text was updated successfully, but these errors were encountered: