-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
distsql: avoid dead nodes #13655
Comments
this might have been resolved by #13822 |
Looks like this has been fixed by #13822 . I was able to reproduce this at will and am unable to do so now. |
This might still be an issue. One schema change I see didn't terminate. |
Closing this issue because the schema change did terminate. |
One node on a cluster (node 1) panicked due to a bug in distsql
But even with that panic distsql on node 2 still tries to talk to node 1
distsql ought to pick a node where the lease is |
gossip on node 2 is not seeing node 1
|
|
The "too many Close() calls" panic I think is being fixed by #13570 which avoids a connection timeout that (erroneously) resulted in this panic (and it'll also fix the panic). Streams used to connect only when the first results were being produced (which, I guess, for schema changes might be way late and beyond the timeout). Now we'll establish the stream early. Vivek, you implied something about distsql connecting to the wrong node after that panic but I didn't understand. What's that about? |
In the cluster the distsql gateway running on node 2 keeps trying to reach node 1 that has died with a panic. My current theory is that distsql uses the range descriptor cache to make this decision, and when node 1 dies, the node down event seen through the liveness gossip event, doesn't trigger a cleanup of the range descriptor cache. So we keep retrying the same node 1. |
index backfill is now working fine, but I haven't tested what happens when a node fails. I'm pretty sure it will still fail. Adding andrei to this bug because I don't intend to work on the topic of distsql rerouting in the event of node failure, in the immediate future. Please ping the bug if anyone starts working on this. |
@andreimatei is this something we care about finishing in 1.0? If so, we should move the milestone over so that we don't let it slip through the cracks. |
This issue tracks implementing a mechanism for detecting and avoiding dead/unhealthy nodes during planning. |
I'm seeing an index backfill not terminating because it keeps hitting the same error. I've not had the time to debug this issue but it's certainly reproducible and definitely a bug introduced in the move of index backfilling to use the distsql framework. I see some grpc connection failures, which could be the reason for this.
The text was updated successfully, but these errors were encountered: