stability: long-lasting gossip connection thrashing #9819

tbg · 2016-10-08T02:55:39Z

From #9749. We should figure out what was happening there and make sure we find out if it's happening again. See also #9817.

Btw, gossip forwarding is somewhat unhappy. A 100 node cluster should reach a steady state where connections aren't thrashed as much, but they are. I don't think it interferes with operations much, but it's awkward.

104.197.7.85: I161006 21:40:32.818083 1505518 gossip/server.go:254  [n29] node 29: refusing gossip from node 40 (max 4 conns); forwarding to 72 ({tcp cockroach-omega-072:26257})
104.197.7.85: I161006 21:40:32.822056 1505518 gossip/server.go:254  [n29] node 29: refusing gossip from node 40 (max 4 conns); forwarding to 93 ({tcp cockroach-omega-093:26257})
104.197.7.85: I161006 21:40:32.822703 1505518 gossip/server.go:254  [n29] node 29: refusing gossip from node 40 (max 4 conns); forwarding to 72 ({tcp cockroach-omega-072:26257})
104.197.140.88: I161006 21:40:32.885913 1743625 gossip/server.go:254  [n12] node 12: refusing gossip from node 40 (max 4 conns); forwarding to 42 ({tcp cockroach-omega-042:26257})
104.197.140.88: I161006 21:40:32.903174 1743625 gossip/server.go:254  [n12] node 12: refusing gossip from node 40 (max 4 conns); forwarding to 7 ({tcp cockroach-omega-007:26257})
104.198.242.162: I161006 21:40:32.970412 1022050 gossip/server.go:254  [n5] node 5: refusing gossip from node 40 (max 4 conns); forwarding to 24 ({tcp cockroach-omega-024:26257})
104.198.242.162: I161006 21:40:32.973392 1022050 gossip/server.go:254  [n5] node 5: refusing gossip from node 40 (max 4 conns); forwarding to 24 ({tcp cockroach-omega-024:26257})
104.198.242.162: I161006 21:40:32.974201 1022050 gossip/server.go:254  [n5] node 5: refusing gossip from node 40 (max 4 conns); forwarding to 24 ({tcp cockroach-omega-024:26257})
130.211.189.253: I161006 21:40:32.977045 1474501 gossip/server.go:254  [n24] node 24: refusing gossip from node 40 (max 4 conns); forwarding to 51 ({tcp cockroach-omega-051:26257})
130.211.189.253: I161006 21:40:32.979703 1474501 gossip/server.go:254  [n24] node 24: refusing gossip from node 40 (max 4 conns); forwarding to 34 ({tcp cockroach-omega-034:26257})
130.211.189.253: I161006 21:40:32.981100 1474501 gossip/server.go:254  [n24] node 24: refusing gossip from node 40 (max 4 conns); forwarding to 33 ({tcp cockroach-omega-033:26257})
104.155.137.78: I161006 21:40:33.861200 1894980 gossip/server.go:254  [n15] node 15: refusing gossip from node 69 (max 4 conns); forwarding to 65 ({tcp cockroach-omega-065:26257})
104.155.137.78: I161006 21:40:33.868766 1894980 gossip/server.go:254  [n15] node 15: refusing gossip from node 69 (max 4 conns); forwarding to 12 ({tcp cockroach-omega-012:26257})
104.198.223.39: I161006 21:40:34.595144 1645601 gossip/server.go:254  [n23] node 23: refusing gossip from node 69 (max 4 conns); forwarding to 85 ({tcp cockroach-omega-085:26257})
104.198.223.39: I161006 21:40:34.599871 1645601 gossip/server.go:254  [n23] node 23: refusing gossip from node 69 (max 4 conns); forwarding to 22 ({tcp cockroach-omega-022:26257})
104.198.223.39: I161006 21:40:34.631990 1645601 gossip/server.go:254  [n23] node 23: refusing gossip from node 69 (max 4 conns); forwarding to 89 ({tcp cockroach-omega-089:26257})
104.198.223.39: I161006 21:40:34.633451 1645601 gossip/server.go:254  [n23] node 23: refusing gossip from node 69 (max 4 conns); forwarding to 89 ({tcp cockroach-omega-089:26257})
104.198.223.39: I161006 21:40:34.643484 1645601 gossip/server.go:254  [n23] node 23: refusing gossip from node 69 (max 4 conns); forwarding to 89 ({tcp cockroach-omega-089:26257})
104.155.137.78: I161006 21:40:34.653518 1895834 gossip/server.go:254  [n15] node 15: refusing gossip from node 69 (max 4 conns); forwarding to 37 ({tcp cockroach-omega-037:26257})
104.198.223.39: I161006 21:40:34.656862 1645601 gossip/server.go:254  [n23] node 23: refusing gossip from node 69 (max 4 conns); forwarding to 89 ({tcp cockroach-omega-089:26257})
104.155.137.78: I161006 21:40:34.664758 1895834 gossip/server.go:254  [n15] node 15: refusing gossip from node 69 (max 4 conns); forwarding to 12 ({tcp cockroach-omega-012:26257})
104.155.137.78: I161006 21:40:34.667654 1895834 gossip/server.go:254  [n15] node 15: refusing gossip from node 69 (max 4 conns); forwarding to 12 ({tcp cockroach-omega-012:26257})
104.155.137.78: I161006 21:40:34.722797 1895834 gossip/server.go:254  [n15] node 15: refusing gossip from node 69 (max 4 conns); forwarding to 65 ({tcp cockroach-omega-065:26257})
104.198.255.99: I161006 21:40:34.912909 1384896 gossip/server.go:254  [n26] node 26: refusing gossip from node 69 (max 4 conns); forwarding to 79 ({tcp cockroach-omega-079:26257})
104.198.33.57: I161006 21:40:34.986429 1378355 gossip/server.go:254  [n9] node 9: refusing gossip from node 69 (max 4 conns); forwarding to 66 ({tcp cockroach-omega-066:26257})
104.198.33.57: I161006 21:40:35.001922 1378355 gossip/server.go:254  [n9] node 9: refusing gossip from node 69 (max 4 conns); forwarding to 66 ({tcp cockroach-omega-066:26257})
104.198.33.57: I161006 21:40:35.010396 1378355 gossip/server.go:254  [n9] node 9: refusing gossip from node 69 (max 4 conns); forwarding to 61 ({tcp cockroach-omega-061:26257})
104.198.40.234: I161006 21:40:35.174897 1202176 gossip/server.go:254  [n30] node 30: refusing gossip from node 69 (max 4 conns); forwarding to 61 ({tcp cockroach-omega-061:26257})
104.198.40.234: I161006 21:40:35.184023 1202176 gossip/server.go:254  [n30] node 30: refusing gossip from node 69 (max 4 conns); forwarding to 88 ({tcp cockroach-omega-088:26257})
104.197.143.153: I161006 21:40:36.075797 2002615 gossip/server.go:254  [n11] node 11: refusing gossip from node 75 (max 4 conns); forwarding to 3 ({tcp cockroach-omega-003:26257})
104.198.241.226: I161006 21:40:36.131057 1318331 gossip/server.go:254  [n4] node 4: refusing gossip from node 75 (max 4 conns); forwarding to 10 ({tcp cockroach-omega-010:26257})
104.198.242.162: I161006 21:40:36.149840 1024407 gossip/server.go:254  [n5] node 5: refusing gossip from node 75 (max 4 conns); forwarding to 24 ({tcp cockroach-omega-024:26257})
130.211.189.253: I161006 21:40:36.182970 1478389 gossip/server.go:254  [n24] node 24: refusing gossip from node 75 (max 4 conns); forwarding to 51 ({tcp cockroach-omega-051:26257})
104.197.143.153: I161006 21:40:36.191283 2002615 gossip/server.go:254  [n11] node 11: refusing gossip from node 75 (max 4 conns); forwarding to 44 ({tcp cockroach-omega-044:26257})
104.197.143.153: I161006 21:40:36.200922 2002615 gossip/server.go:254  [n11] node 11: refusing gossip from node 75 (max 4 conns); forwarding to 52 ({tcp cockroach-omega-052:26257})
104.154.87.20: I161006 21:40:36.268781 1458432 gossip/server.go:254  [n14] node 14: refusing gossip from node 75 (max 4 conns); forwarding to 99 ({tcp cockroach-omega-099:26257})
104.198.57.154: I161006 21:40:36.306849 1662653 gossip/server.go:254  [n2] node 2: refusing gossip from node 75 (max 4 conns); forwarding to 36 ({tcp cockroach-omega-036:26257})
104.198.63.222: I161006 21:40:36.318390 1798429 gossip/server.go:254  [n3] node 3: refusing gossip from node 75 (max 4 conns); forwarding to 63 ({tcp cockro

The text was updated successfully, but these errors were encountered:

tbg · 2016-10-08T02:55:52Z

@spencerkimball, any theories here? There are some metrics we collect but don't graph yet, I'll add those to grafana shortly.

tbg · 2016-10-08T03:27:33Z

Basic graph added which shows that we churned for quite a while until it stabilized.

tbg · 2016-10-08T03:33:52Z

@tamird this graph is pretty horrible and doesn't scale at all. Any magical ideas?

tamird · 2016-10-09T00:40:47Z

https://prometheus.io/blog/2015/06/18/practical-anomaly-detection/ seems promising, but I haven't managed to massage the example there into something useful. I'll take a deeper look on Monday.

a-robinson · 2016-12-28T22:07:56Z

Gossip thrashing has been very apparent in @mberhault's scalability tests where he doubles the number of nodes in a cluster every N minutes:

On the node with the lease on r1, 2186 of its 19907 log lines are "refusing gossip" errors.

Noticed while looking into #12591

petermattis · 2017-01-02T13:35:09Z

That's not good. At what cluster scale does that thrashing begin to occur?

a-robinson · 2017-01-03T15:07:36Z

The start of thrashing around 20:15 in that graph is when the cluster jumped from 16 to 32 nodes. The additional thrashing that picks up around 20:45 is when it jumped from 32 to 64 nodes.

a-robinson · 2017-04-26T19:27:01Z

This appears to be back, judging by our recent 64 node tests - #12741 (comment)

bdarnell · 2017-04-28T21:10:38Z

See #15003 (comment), which shows an increase in gossip thrashing on 4/16, although that incident appears to be precipitated by a change from stop-the-world to rolling restarts.

a-robinson · 2017-04-30T17:47:52Z

Continuing the thread from #15003 (comment) on why node 2 kept trying to connect to itself, it appears as though what was going on was this:

Node 2 held the lease for the system config range. It had gossiped it out to other nodes in the cluster at some point.
A rolling restart happens. It's worth noting that after a restart, a node starts off without any gossiped info since it isn't persisted to disk. Thus, when node 2 starts up again, it gets all of its gossip info (other than its own node and store descriptors) from other nodes.
This means that node 2 gets the system config from another node with the original nodeID set to node 2 and some non-zero number of hops.
After the restart, node 2 maintains its lease on the system config range. However, it doesn't re-gossip the system config because it's unchanged from the value that node 2 received via gossip.
This can repeat multiple times so long as the value of the system config doesn't change, which it essentially never does on our test clusters. The more it repeats, the higher the number of hops gets.
Doing anything that modifies the system config range fixes the problem (until it builds up again after more rolling restarts). That's why me creating a new database fixed the problem on cyan.

We never hit this during stop-the-world migrations because the system config's leaseholder would have to re-gossip it when it started up.

This could also make for much greater gossip thrashing on the other nodes, too, since they'll always think that the system config leaseholder is further away than it actually is, and thus they'll periodically try to open a connection to it.

There's a couple changes we should make here. First, we should either get a little less conservative about not gossiping the system config during startup, or we should exclude the system config key from mostDistant calculations. Second, each node should exclude itself from mostDistant calculations as a backup check. I'll send out a PR.

tamird · 2017-04-30T17:55:37Z

Nice work. Would it make sense to set `gossip.Info.Hops` to zero if the node ID matches the recipient's node ID? I think that's a more general solution, but I might be missing something.

…

On Sun, Apr 30, 2017 at 1:48 PM, Alex Robinson ***@***.***> wrote: Continuing the thread from #15003 (comment) <#15003 (comment)> on why node 2 kept trying to connect to itself, it appears as though what was going on was this: - Node 2 held the lease for the system config range. It had gossiped it out to other nodes in the cluster at some point. - A rolling restart happens. It's worth noting that after a restart, a node starts off without any gossiped info since it isn't persisted to disk. Thus, when node 2 starts up again, it gets all of its gossip info (other than its own node and store descriptors) from other nodes. - This means that node 2 gets the system config from another node with the original nodeID set to node 2 and some non-zero number of hops. - After the restart, node 2 maintains its lease on the system config range. However, it doesn't re-gossip the system config because it's unchanged <https://github.com/cockroachdb/cockroach/blob/master/pkg/storage/replica.go#L4527> from the value that node 2 received via gossip. - This can repeat multiple times so long as the value of the system config doesn't change, which it essentially never does on our test clusters. The more it repeats, the higher the number of hops gets. We never hit this during stop-the-world migrations because the system config's leaseholder would have to re-gossip it when it started up. This could also make for much greater gossip thrashing on the other nodes, too, since they'll always think that the system config leaseholder is further away than it actually is, and thus they'll periodically try to open a connection to it. There's a couple changes we should make here. First, we should either get a little less conservative about not gossiping the system config during startup, or we should exclude the system config key from mostDistant calculations. Second, each node should exclude itself from mostDistant calculations as a backup check. I'll send out a PR. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9819 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABdsPJNzu08o5zUosjn2Q5LpuYKVw3sjks5r1MlQgaJpZM4KRma6> .

a-robinson · 2017-04-30T18:06:01Z

That would fix the problem of a node trying to connect to itself, yeah. We'd have to re-gossip the modified info in order to ensure other nodes don't get an unrealistic idea of how far away the source node is, but I think that as long as we did that, it would work. The only downside is that we'd lose the history of how long a particular piece of info has been floating around, but I'm not sure that has much value other than for this sort of debugging.

tamird · 2017-04-30T19:06:16Z

On Sun, Apr 30, 2017 at 2:06 PM, Alex Robinson ***@***.***> wrote: We'd have to re-gossip the modified info in order to ensure other nodes don't get an unrealistic idea of how far away the source node is, but I think that as long as we did that, it would work.

Can you help me understand why that's necessary, given that we don't already do that today? I.e. isn't possible today for an Info to be created at node 1, passed to node 2, and then node 3, and then for node 3 to connect to node 1 - I believe node 1 will not then re-gossip that Info - leaving node 3 believing the Info's Hops to be 2, rather than 1?

a-robinson · 2017-05-01T00:43:18Z

Ah, right, that behavior actually makes controlling MaxHops pretty tricky, since MaxHops can easily be incorrect and isn't really correctable given how we use the high watermarks. Node restarts could lead to some very long gossip paths if the cluster isn't lucky about which nodes pass on which data to the restarted nodes.

I'd be curious to understand the reasoning behind the high watermarks design -- I imagine it's purely to minimize the amount of data passed around? Because it seems dangerous from a thrashing perspective without a mechanism in place for nodes to periodically re-gossip all their infos (and thus reset all MaxHops counts down to something more accurate) or to recompute their MaxHops to each node in some other way.

The current behavior is really bad since if no new Infos are being passed around and more than maxPeers nodes think that one particular node is far away, then each of those peers will regularly try to open a connection to that particular node, but they won't all be able to, so one or more of them will have to repeatedly keep trying in perpetuity.

a-robinson · 2017-05-01T01:01:16Z

Alright, we already gossip the node descriptors periodically, so we just need to make mostDistant calculations only rely on the node descriptor infos.

NodeID infos are reliably re-gossiped periodically and on restart, making it much less likely that process restarts can lead to large Hops values, which has shown itself to be a problem for the system config info. Fixes cockroachdb#9819

tamird · 2017-05-01T01:52:24Z

FWIW, the high water marks design was added in 984c1e5, if that's any help.

NodeID infos are reliably re-gossiped periodically and on restart, making it much less likely that process restarts can lead to large Hops values, which has shown itself to be a problem for the system config info. Fixes cockroachdb#9819

spencerkimball · 2017-05-01T13:44:59Z

The high-water timestamps are meant to prevent excess gossip traffic. They reduce it considerably in various simulations – in some cases almost an order of magnitude.

I think the proper solution to this is to replace the concept of info.Hops with a slice of node IDs that indicate the shortest path between two nodes for info passing. See #15538.

tbg added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label Oct 8, 2016

tbg assigned spencerkimball Oct 8, 2016

tbg mentioned this issue Oct 8, 2016

stability: (*Store).mu permanently unavailable on a node #9749

Closed

bdarnell mentioned this issue Jan 5, 2017

stability: leader-less range stalls entire cluster #12591

Closed

a-robinson assigned a-robinson and unassigned spencerkimball Jan 12, 2017

a-robinson mentioned this issue Jan 12, 2017

gossip: Greatly reduce thrashing in larger clusters #12880

Merged

a-robinson closed this as completed in #12880 Jan 18, 2017

a-robinson reopened this Apr 26, 2017

a-robinson added this to the 1.1 milestone Apr 26, 2017

This was referenced Apr 26, 2017

stability: performance problems when adding a lot of nodes at once #12741

Closed

stability: periods of 0 QPS under chaos #15163

Closed

bdarnell mentioned this issue Apr 28, 2017

storage: Lease owned by replica that no longer exists #15003

Closed

a-robinson mentioned this issue May 1, 2017

gossip: Reduce thrashing caused by process restarts #15533

Merged

spencerkimball mentioned this issue May 1, 2017

gossip: replace scalar hops count with a slice of node IDs tracing path to origin #15538

Closed

a-robinson mentioned this issue May 1, 2017

gossip: Improve/replace hop-counting logic #15539

Closed

a-robinson closed this as completed in #15533 May 1, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stability: long-lasting gossip connection thrashing #9819

stability: long-lasting gossip connection thrashing #9819

tbg commented Oct 8, 2016

tbg commented Oct 8, 2016 •

edited

Loading

tbg commented Oct 8, 2016

tbg commented Oct 8, 2016

tamird commented Oct 9, 2016

a-robinson commented Dec 28, 2016

petermattis commented Jan 2, 2017

a-robinson commented Jan 3, 2017

a-robinson commented Apr 26, 2017

bdarnell commented Apr 28, 2017

a-robinson commented Apr 30, 2017 •

edited

Loading

tamird commented Apr 30, 2017 via email

a-robinson commented Apr 30, 2017

tamird commented Apr 30, 2017 via email

a-robinson commented May 1, 2017

a-robinson commented May 1, 2017

tamird commented May 1, 2017

spencerkimball commented May 1, 2017

stability: long-lasting gossip connection thrashing #9819

stability: long-lasting gossip connection thrashing #9819

Comments

tbg commented Oct 8, 2016

tbg commented Oct 8, 2016 • edited Loading

tbg commented Oct 8, 2016

tbg commented Oct 8, 2016

tamird commented Oct 9, 2016

a-robinson commented Dec 28, 2016

petermattis commented Jan 2, 2017

a-robinson commented Jan 3, 2017

a-robinson commented Apr 26, 2017

bdarnell commented Apr 28, 2017

a-robinson commented Apr 30, 2017 • edited Loading

tamird commented Apr 30, 2017 via email

a-robinson commented Apr 30, 2017

tamird commented Apr 30, 2017 via email

a-robinson commented May 1, 2017

a-robinson commented May 1, 2017

tamird commented May 1, 2017

spencerkimball commented May 1, 2017

tbg commented Oct 8, 2016 •

edited

Loading

a-robinson commented Apr 30, 2017 •

edited

Loading