-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When the leader’s raft.LogStore hangs indefinitely, the network will hang indefinitely instead of re-electing #503
Comments
Hey @stapelberg thanks for filing this and doing some great detective work so far! I've not looked at the code to verify all of this, but I think this is an issue as you described because there are no timeouts around disk writing and that happens synchronously in the leader loop so if the disk IO blocks forever, then the leader will be stuck forever currently! That's not a disk failure mode I've come across before! The reason the leader doesn't stand down I think is because the heart-beating is done by the replication goroutine for each follower which is separate to the leader loop. So when the leader loop is stuck on IO the replication routines are still running. They won't be replicating anything since no new logs are being written to the leader's LogStore, but since they are still running they'll keep sending heartbeats and so followers will have no reason to think the leader is unhealthy and hold an election. The question is what the correct fix is. It's a little nuanced because it's impossible to tell the difference between a sudden spike in work that backs up write buffers on the disk and causes a few long writes and this mode where the disk is jammed and will never recover. If we picked some timeout after which we consider the disk write to have failed that would become an error here: Lines 1190 to 1199 in 44124c2
Which would cause the leader to step down. That would fix your specific failure case here, but I wonder if we need to think carefully about how we'd set that. I could see issues where a spike in writes on an overloaded disk could just cause constant leader flapping and basically take the cluster down rather than just being slow until the spike is over 🤔. Maybe just setting it long enough (like a few minutes) would be OK, but it would be a long time to recover in your case. Maybe we'd need something more sophisticated. If anyone has thought or insights please share! |
Thanks for taking a look!
Yep, I was surprised, too, but it was happening repeatedly on that machine — probably 5 or 6 times before I had enough and migrated my stuff to a new machine. Only a hardware reset would help, and any I/O would hang indefinitely.
Yes, that is consistent with what I’m thinking.
I think my preferred way to fix this would be to couple leader healthiness to disk throughput: if disk throughput drops to 0 while there are failing I don’t know if this is something you want to have in hashicorp/raft itself. You could also consider making healthiness externally influencible if you want to leave the specific logic to users. Of course, providing a robust default behavior would still be key. |
We ran into this same behavior recently. The Azure disk backing the Consul leader began to hang (IO utilization shot to 100% but actual throughput and operations dropped to almost nothing), causing updates to the KV store to time out ( |
I recently encountered an issue in my production deployment of https://robustirc.net/, where the network was not making any Raft progress anymore.
It turns out that one of my servers has an issue with its (local NVMe) storage, which manifests itself in hanging indefinitely. There are no read or write errors, any disk access just hangs.
When that server happens to currently be the Raft leader when the issue occurs, the entire Raft network will just hang indefinitely. By this, I mean the leader will still participate in the Raft protocol (last contact times do update on the Raft followers), but applying new messages to Raft will not work (will timeout) and, crucially, the Raft network never even starts electing a new leader.
This was a surprising failure mode to me, and I wonder if that’s intentional (out of scope for Raft) or an issue with the current implementation?
To reproduce, I cloned the Raft example https://github.com/yongman/leto and modified it like so:
Then, after bringing up 3 nodes as described in the README, I send SIGUSR1 to the leader node, and now all commands hang, but no re-election happens.
The same is reproducible in https://robustirc.net/ with the robustirc-localnet command, but might be a bit more elaborate to test than the more self-contained leto example.
Given that it happens in two different projects built on top of hashicorp/raft, I don’t think it’s a bug with my code itself, but maybe both projects are using hashicorp/raft slightly wrong?
Any recommendations for how to handle this failure mode?
The text was updated successfully, but these errors were encountered: