Swarm is unresponsive after startup in 24% of cases #136

DmitryKakurin · 2019-11-23T02:41:49Z

Hello Paul,
I have an Erlang cluster of 3 nodes.
A few seconds after startup my code calls GenServer.call({:via, :swarm, "echo-be"}, ... where "echo-be" process does not exist yet.
In 24% of new node startups, this leads to GenServer.call hanging forever (actually inside Swarm.whereis_name called internally), and Swarm never becomes functional on this node.
I'm using Swarm version 3.4.0.

Attached is a full :erlang.dbg trace of the Swarm.Tracker process where the hang happens:
repro.log

I would appreciate it if you investigate this issue and come up with a fix or a workaround. We were very close to adopting Swarm before this issue was discovered.

Please let me know if you need any additional information/traces. I can reliably repro this.

Thank you, Dmitry.

P.S. The 24% number was derived from 910 test runs, where only 691 were successful (no hang).

The text was updated successfully, but these errors were encountered:

suexcxine · 2020-03-07T12:58:59Z

I also have this problem, with 3.4.0 strategy: ring, libcluster 3.2.0, strategy: kubernetes.dns

The following is the state of Swarm.Tracker of 4 replicas
It looks like that
10.1.0.171 is waiting for 10.1.0.173,
10.1.0.173 is waiting for 10.1.0.172,
10.1.0.172 is waiting for 10.1.0.171
May be recursive deadlock?

 {:syncing,
 %Swarm.Tracker.TrackerState{
   clock: {1, 0},
   nodes: [:"[email protected]", :"[email protected]", :"[email protected]"],
   pending_sync_reqs: [#PID<11439.1343.0>, #PID<11437.1339.0>],
   self: :"[email protected]",
   strategy: #<Ring[:"[email protected]", :"[email protected]", :"[email protected]", :"[email protected]"]>,
   sync_node: :"[email protected]",
   sync_ref: #Reference<11224.2913252267.1040711683.223001>
 }}
{:syncing,
 %Swarm.Tracker.TrackerState{
   clock: {1, 0},
   nodes: [:"[email protected]", :"[email protected]", :"[email protected]"],
   pending_sync_reqs: [#PID<11438.1339.0>],
   self: :"[email protected]",
   strategy: #<Ring[:"[email protected]", :"[email protected]", :"[email protected]", :"[email protected]"]>,
   sync_node: :"[email protected]",
   sync_ref: #Reference<11224.2390910966.1309409281.72932>
 }}
{:syncing,
 %Swarm.Tracker.TrackerState{
   clock: {1, 0},
   nodes: [:"[email protected]", :"[email protected]"],
   pending_sync_reqs: [],
   self: :"[email protected]",
   strategy: #<Ring[:"[email protected]", :"[email protected]", :"[email protected]"]>,
   sync_node: :"[email protected]",
   sync_ref: #Reference<11224.3895545559.1309147137.190750>
 }}
{:syncing,
 %Swarm.Tracker.TrackerState{
   clock: {1, 0},
   nodes: [:"[email protected]", :"[email protected]", :"[email protected]"],
   pending_sync_reqs: [#PID<11438.1341.0>],
   self: :"[email protected]",
   strategy: #<Ring[:"[email protected]", :"[email protected]", :"[email protected]", :"[email protected]"]>,
   sync_node: :"[email protected]",
   sync_ref: #Reference<11224.763090284.1309409282.67102>
 }}

Any feedback would be appreciated!
Thank you

eplightning · 2020-05-01T23:47:11Z

From a quick look at the code it seems that there's no deadlock resolution when 3 or more nodes start syncing concurrently and in addition you happen to get unlucky with random pick of the syncing node.

There's some logic that handles case of 2 nodes syncing each other:

swarm/lib/swarm/tracker/tracker.ex

Line 378 in 4aee63d

    
           # We're trying to sync with another node while it is trying to sync with us, deterministically

.

Basically this works okay:

n1 picks n2
n2 picks n1
n3 picks n1 or n2

But this doesn't:

n1 picks n2
n2 picks n3
n3 picks n1

Not an expert on this topic, so forgive me if I'm completely off the mark - but here's just my quick idea to fix it:

After receiving :sync message respond with new {:waiting_for, node} message (only if actually waiting for some other node).
Waiting node receives that message, verifies if deadlock took place (by checking its own pending_sync_reqs list) and if so cancels the sync.
After cancelling the sync we can try again with different node.

Since only one of them needs to cancel so we can just pick the first one (based on node order).

n1 picks n2
n2 picks n3
n3 picks n1

n3 send {:waiting_for, n1} to n2
n2 receive {:waiting_for, n1} | n1 in pending_sync_reqs but self() > n1 so message ignored

n1 send {:waiting_for, n2} to n3
n3 receive {:waiting_for, n2} | n2 in pending_sync_reqs but self() > n2 so message ignored

n2 send {:waiting_for, n3} to n1
n1 receive {:waiting_for, n3} | n3 in pending_sync_reqs and self() < n3
n1:
  1. cancels sync to n2 (probably needs to notify n2 so yet another message)
  2. pick different node than n2 and try syncing again

Instead of retrying with different node I suppose one could also pick the "best" node of the three (based on clock and then node ordering) and let it just resolve the pending requests.

I'll attempt to do PR for that when I have some more spare time.

Quick workaround for Kubernetes based clusters would be StatefulSet (with proper readiness probe) since that would guarantee that node start one by one.

EDIT: Now that I think of it, this would only work with simple 3-node sync cycle. It's going to be a little more complicated if it were to handle bigger cycles.

seanmcevoy · 2020-09-29T21:05:17Z

hi, we just hit this too. due to go live this weekend, eek!
is there any quick fix for this, a change of strategy or is there some other config we can tweak to make it less likely?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Swarm is unresponsive after startup in 24% of cases #136

Swarm is unresponsive after startup in 24% of cases #136

DmitryKakurin commented Nov 23, 2019 •

edited

Loading

suexcxine commented Mar 7, 2020 •

edited

Loading

eplightning commented May 1, 2020 •

edited

Loading

seanmcevoy commented Sep 29, 2020

Swarm is unresponsive after startup in 24% of cases #136

Swarm is unresponsive after startup in 24% of cases #136

Comments

DmitryKakurin commented Nov 23, 2019 • edited Loading

suexcxine commented Mar 7, 2020 • edited Loading

eplightning commented May 1, 2020 • edited Loading

seanmcevoy commented Sep 29, 2020

DmitryKakurin commented Nov 23, 2019 •

edited

Loading

suexcxine commented Mar 7, 2020 •

edited

Loading

eplightning commented May 1, 2020 •

edited

Loading