-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Swarm is unresponsive after startup in 24% of cases #136
Comments
I also have this problem, with 3.4.0 strategy: ring, libcluster 3.2.0, strategy: kubernetes.dns The following is the state of Swarm.Tracker of 4 replicas
Any feedback would be appreciated! |
From a quick look at the code it seems that there's no deadlock resolution when 3 or more nodes start syncing concurrently and in addition you happen to get unlucky with random pick of the syncing node. There's some logic that handles case of 2 nodes syncing each other: swarm/lib/swarm/tracker/tracker.ex Line 378 in 4aee63d
Basically this works okay:
But this doesn't:
Not an expert on this topic, so forgive me if I'm completely off the mark - but here's just my quick idea to fix it: After receiving :sync message respond with new {:waiting_for, node} message (only if actually waiting for some other node). Since only one of them needs to cancel so we can just pick the first one (based on node order).
Instead of retrying with different node I suppose one could also pick the "best" node of the three (based on clock and then node ordering) and let it just resolve the pending requests. I'll attempt to do PR for that when I have some more spare time. Quick workaround for Kubernetes based clusters would be StatefulSet (with proper readiness probe) since that would guarantee that node start one by one. EDIT: Now that I think of it, this would only work with simple 3-node sync cycle. It's going to be a little more complicated if it were to handle bigger cycles. |
hi, we just hit this too. due to go live this weekend, eek! |
Hello Paul,
I have an Erlang cluster of 3 nodes.
A few seconds after startup my code calls
GenServer.call({:via, :swarm, "echo-be"}, ...
where"echo-be"
process does not exist yet.In 24% of new node startups, this leads to
GenServer.call
hanging forever (actually insideSwarm.whereis_name
called internally), and Swarm never becomes functional on this node.I'm using Swarm version 3.4.0.
Attached is a full
:erlang.dbg
trace of theSwarm.Tracker
process where the hang happens:repro.log
I would appreciate it if you investigate this issue and come up with a fix or a workaround. We were very close to adopting Swarm before this issue was discovered.
Please let me know if you need any additional information/traces. I can reliably repro this.
Thank you, Dmitry.
P.S. The 24% number was derived from 910 test runs, where only 691 were successful (no hang).
The text was updated successfully, but these errors were encountered: