Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stability: long-lasting gossip connection thrashing #9819

Closed
tbg opened this issue Oct 8, 2016 · 17 comments · Fixed by #12880
Closed

stability: long-lasting gossip connection thrashing #9819

tbg opened this issue Oct 8, 2016 · 17 comments · Fixed by #12880
Assignees
Labels
S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Milestone

Comments

@tbg
Copy link
Member

tbg commented Oct 8, 2016

From #9749. We should figure out what was happening there and make sure we find out if it's happening again. See also #9817.

Btw, gossip forwarding is somewhat unhappy. A 100 node cluster should reach a steady state where connections aren't thrashed as much, but they are. I don't think it interferes with operations much, but it's awkward.

104.197.7.85: I161006 21:40:32.818083 1505518 gossip/server.go:254  [n29] node 29: refusing gossip from node 40 (max 4 conns); forwarding to 72 ({tcp cockroach-omega-072:26257})
104.197.7.85: I161006 21:40:32.822056 1505518 gossip/server.go:254  [n29] node 29: refusing gossip from node 40 (max 4 conns); forwarding to 93 ({tcp cockroach-omega-093:26257})
104.197.7.85: I161006 21:40:32.822703 1505518 gossip/server.go:254  [n29] node 29: refusing gossip from node 40 (max 4 conns); forwarding to 72 ({tcp cockroach-omega-072:26257})
104.197.140.88: I161006 21:40:32.885913 1743625 gossip/server.go:254  [n12] node 12: refusing gossip from node 40 (max 4 conns); forwarding to 42 ({tcp cockroach-omega-042:26257})
104.197.140.88: I161006 21:40:32.903174 1743625 gossip/server.go:254  [n12] node 12: refusing gossip from node 40 (max 4 conns); forwarding to 7 ({tcp cockroach-omega-007:26257})
104.198.242.162: I161006 21:40:32.970412 1022050 gossip/server.go:254  [n5] node 5: refusing gossip from node 40 (max 4 conns); forwarding to 24 ({tcp cockroach-omega-024:26257})
104.198.242.162: I161006 21:40:32.973392 1022050 gossip/server.go:254  [n5] node 5: refusing gossip from node 40 (max 4 conns); forwarding to 24 ({tcp cockroach-omega-024:26257})
104.198.242.162: I161006 21:40:32.974201 1022050 gossip/server.go:254  [n5] node 5: refusing gossip from node 40 (max 4 conns); forwarding to 24 ({tcp cockroach-omega-024:26257})
130.211.189.253: I161006 21:40:32.977045 1474501 gossip/server.go:254  [n24] node 24: refusing gossip from node 40 (max 4 conns); forwarding to 51 ({tcp cockroach-omega-051:26257})
130.211.189.253: I161006 21:40:32.979703 1474501 gossip/server.go:254  [n24] node 24: refusing gossip from node 40 (max 4 conns); forwarding to 34 ({tcp cockroach-omega-034:26257})
130.211.189.253: I161006 21:40:32.981100 1474501 gossip/server.go:254  [n24] node 24: refusing gossip from node 40 (max 4 conns); forwarding to 33 ({tcp cockroach-omega-033:26257})
104.155.137.78: I161006 21:40:33.861200 1894980 gossip/server.go:254  [n15] node 15: refusing gossip from node 69 (max 4 conns); forwarding to 65 ({tcp cockroach-omega-065:26257})
104.155.137.78: I161006 21:40:33.868766 1894980 gossip/server.go:254  [n15] node 15: refusing gossip from node 69 (max 4 conns); forwarding to 12 ({tcp cockroach-omega-012:26257})
104.198.223.39: I161006 21:40:34.595144 1645601 gossip/server.go:254  [n23] node 23: refusing gossip from node 69 (max 4 conns); forwarding to 85 ({tcp cockroach-omega-085:26257})
104.198.223.39: I161006 21:40:34.599871 1645601 gossip/server.go:254  [n23] node 23: refusing gossip from node 69 (max 4 conns); forwarding to 22 ({tcp cockroach-omega-022:26257})
104.198.223.39: I161006 21:40:34.631990 1645601 gossip/server.go:254  [n23] node 23: refusing gossip from node 69 (max 4 conns); forwarding to 89 ({tcp cockroach-omega-089:26257})
104.198.223.39: I161006 21:40:34.633451 1645601 gossip/server.go:254  [n23] node 23: refusing gossip from node 69 (max 4 conns); forwarding to 89 ({tcp cockroach-omega-089:26257})
104.198.223.39: I161006 21:40:34.643484 1645601 gossip/server.go:254  [n23] node 23: refusing gossip from node 69 (max 4 conns); forwarding to 89 ({tcp cockroach-omega-089:26257})
104.155.137.78: I161006 21:40:34.653518 1895834 gossip/server.go:254  [n15] node 15: refusing gossip from node 69 (max 4 conns); forwarding to 37 ({tcp cockroach-omega-037:26257})
104.198.223.39: I161006 21:40:34.656862 1645601 gossip/server.go:254  [n23] node 23: refusing gossip from node 69 (max 4 conns); forwarding to 89 ({tcp cockroach-omega-089:26257})
104.155.137.78: I161006 21:40:34.664758 1895834 gossip/server.go:254  [n15] node 15: refusing gossip from node 69 (max 4 conns); forwarding to 12 ({tcp cockroach-omega-012:26257})
104.155.137.78: I161006 21:40:34.667654 1895834 gossip/server.go:254  [n15] node 15: refusing gossip from node 69 (max 4 conns); forwarding to 12 ({tcp cockroach-omega-012:26257})
104.155.137.78: I161006 21:40:34.722797 1895834 gossip/server.go:254  [n15] node 15: refusing gossip from node 69 (max 4 conns); forwarding to 65 ({tcp cockroach-omega-065:26257})
104.198.255.99: I161006 21:40:34.912909 1384896 gossip/server.go:254  [n26] node 26: refusing gossip from node 69 (max 4 conns); forwarding to 79 ({tcp cockroach-omega-079:26257})
104.198.33.57: I161006 21:40:34.986429 1378355 gossip/server.go:254  [n9] node 9: refusing gossip from node 69 (max 4 conns); forwarding to 66 ({tcp cockroach-omega-066:26257})
104.198.33.57: I161006 21:40:35.001922 1378355 gossip/server.go:254  [n9] node 9: refusing gossip from node 69 (max 4 conns); forwarding to 66 ({tcp cockroach-omega-066:26257})
104.198.33.57: I161006 21:40:35.010396 1378355 gossip/server.go:254  [n9] node 9: refusing gossip from node 69 (max 4 conns); forwarding to 61 ({tcp cockroach-omega-061:26257})
104.198.40.234: I161006 21:40:35.174897 1202176 gossip/server.go:254  [n30] node 30: refusing gossip from node 69 (max 4 conns); forwarding to 61 ({tcp cockroach-omega-061:26257})
104.198.40.234: I161006 21:40:35.184023 1202176 gossip/server.go:254  [n30] node 30: refusing gossip from node 69 (max 4 conns); forwarding to 88 ({tcp cockroach-omega-088:26257})
104.197.143.153: I161006 21:40:36.075797 2002615 gossip/server.go:254  [n11] node 11: refusing gossip from node 75 (max 4 conns); forwarding to 3 ({tcp cockroach-omega-003:26257})
104.198.241.226: I161006 21:40:36.131057 1318331 gossip/server.go:254  [n4] node 4: refusing gossip from node 75 (max 4 conns); forwarding to 10 ({tcp cockroach-omega-010:26257})
104.198.242.162: I161006 21:40:36.149840 1024407 gossip/server.go:254  [n5] node 5: refusing gossip from node 75 (max 4 conns); forwarding to 24 ({tcp cockroach-omega-024:26257})
130.211.189.253: I161006 21:40:36.182970 1478389 gossip/server.go:254  [n24] node 24: refusing gossip from node 75 (max 4 conns); forwarding to 51 ({tcp cockroach-omega-051:26257})
104.197.143.153: I161006 21:40:36.191283 2002615 gossip/server.go:254  [n11] node 11: refusing gossip from node 75 (max 4 conns); forwarding to 44 ({tcp cockroach-omega-044:26257})
104.197.143.153: I161006 21:40:36.200922 2002615 gossip/server.go:254  [n11] node 11: refusing gossip from node 75 (max 4 conns); forwarding to 52 ({tcp cockroach-omega-052:26257})
104.154.87.20: I161006 21:40:36.268781 1458432 gossip/server.go:254  [n14] node 14: refusing gossip from node 75 (max 4 conns); forwarding to 99 ({tcp cockroach-omega-099:26257})
104.198.57.154: I161006 21:40:36.306849 1662653 gossip/server.go:254  [n2] node 2: refusing gossip from node 75 (max 4 conns); forwarding to 36 ({tcp cockroach-omega-036:26257})
104.198.63.222: I161006 21:40:36.318390 1798429 gossip/server.go:254  [n3] node 3: refusing gossip from node 75 (max 4 conns); forwarding to 63 ({tcp cockro
@tbg tbg added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label Oct 8, 2016
@tbg
Copy link
Member Author

tbg commented Oct 8, 2016

@spencerkimball, any theories here? There are some metrics we collect but don't graph yet, I'll add those to grafana shortly.

@tbg
Copy link
Member Author

tbg commented Oct 8, 2016

Basic graph added which shows that we churned for quite a while until it stabilized.

image

@tbg
Copy link
Member Author

tbg commented Oct 8, 2016

@tamird this graph is pretty horrible and doesn't scale at all. Any magical ideas?

@tamird
Copy link
Contributor

tamird commented Oct 9, 2016

https://prometheus.io/blog/2015/06/18/practical-anomaly-detection/ seems promising, but I haven't managed to massage the example there into something useful. I'll take a deeper look on Monday.

@a-robinson
Copy link
Contributor

Gossip thrashing has been very apparent in @mberhault's scalability tests where he doubles the number of nodes in a cluster every N minutes:

screen shot 2016-12-28 at 5 05 56 pm

On the node with the lease on r1, 2186 of its 19907 log lines are "refusing gossip" errors.

Noticed while looking into #12591

@petermattis
Copy link
Collaborator

That's not good. At what cluster scale does that thrashing begin to occur?

@a-robinson
Copy link
Contributor

The start of thrashing around 20:15 in that graph is when the cluster jumped from 16 to 32 nodes. The additional thrashing that picks up around 20:45 is when it jumped from 32 to 64 nodes.

@a-robinson
Copy link
Contributor

This appears to be back, judging by our recent 64 node tests - #12741 (comment)

@bdarnell
Copy link
Contributor

See #15003 (comment), which shows an increase in gossip thrashing on 4/16, although that incident appears to be precipitated by a change from stop-the-world to rolling restarts.

@a-robinson
Copy link
Contributor

a-robinson commented Apr 30, 2017

Continuing the thread from #15003 (comment) on why node 2 kept trying to connect to itself, it appears as though what was going on was this:

  • Node 2 held the lease for the system config range. It had gossiped it out to other nodes in the cluster at some point.
  • A rolling restart happens. It's worth noting that after a restart, a node starts off without any gossiped info since it isn't persisted to disk. Thus, when node 2 starts up again, it gets all of its gossip info (other than its own node and store descriptors) from other nodes.
  • This means that node 2 gets the system config from another node with the original nodeID set to node 2 and some non-zero number of hops.
  • After the restart, node 2 maintains its lease on the system config range. However, it doesn't re-gossip the system config because it's unchanged from the value that node 2 received via gossip.
  • This can repeat multiple times so long as the value of the system config doesn't change, which it essentially never does on our test clusters. The more it repeats, the higher the number of hops gets.
  • Doing anything that modifies the system config range fixes the problem (until it builds up again after more rolling restarts). That's why me creating a new database fixed the problem on cyan.

We never hit this during stop-the-world migrations because the system config's leaseholder would have to re-gossip it when it started up.

This could also make for much greater gossip thrashing on the other nodes, too, since they'll always think that the system config leaseholder is further away than it actually is, and thus they'll periodically try to open a connection to it.

There's a couple changes we should make here. First, we should either get a little less conservative about not gossiping the system config during startup, or we should exclude the system config key from mostDistant calculations. Second, each node should exclude itself from mostDistant calculations as a backup check. I'll send out a PR.

@tamird
Copy link
Contributor

tamird commented Apr 30, 2017 via email

@a-robinson
Copy link
Contributor

That would fix the problem of a node trying to connect to itself, yeah. We'd have to re-gossip the modified info in order to ensure other nodes don't get an unrealistic idea of how far away the source node is, but I think that as long as we did that, it would work. The only downside is that we'd lose the history of how long a particular piece of info has been floating around, but I'm not sure that has much value other than for this sort of debugging.

@tamird
Copy link
Contributor

tamird commented Apr 30, 2017 via email

@a-robinson
Copy link
Contributor

Ah, right, that behavior actually makes controlling MaxHops pretty tricky, since MaxHops can easily be incorrect and isn't really correctable given how we use the high watermarks. Node restarts could lead to some very long gossip paths if the cluster isn't lucky about which nodes pass on which data to the restarted nodes.

I'd be curious to understand the reasoning behind the high watermarks design -- I imagine it's purely to minimize the amount of data passed around? Because it seems dangerous from a thrashing perspective without a mechanism in place for nodes to periodically re-gossip all their infos (and thus reset all MaxHops counts down to something more accurate) or to recompute their MaxHops to each node in some other way.

The current behavior is really bad since if no new Infos are being passed around and more than maxPeers nodes think that one particular node is far away, then each of those peers will regularly try to open a connection to that particular node, but they won't all be able to, so one or more of them will have to repeatedly keep trying in perpetuity.

@a-robinson
Copy link
Contributor

Alright, we already gossip the node descriptors periodically, so we just need to make mostDistant calculations only rely on the node descriptor infos.

a-robinson added a commit to a-robinson/cockroach that referenced this issue May 1, 2017
NodeID infos are reliably re-gossiped periodically and on restart,
making it much less likely that process restarts can lead to large
Hops values, which has shown itself to be a problem for the system
config info.

Fixes cockroachdb#9819
@tamird
Copy link
Contributor

tamird commented May 1, 2017

FWIW, the high water marks design was added in 984c1e5, if that's any help.

a-robinson added a commit to a-robinson/cockroach that referenced this issue May 1, 2017
NodeID infos are reliably re-gossiped periodically and on restart,
making it much less likely that process restarts can lead to large
Hops values, which has shown itself to be a problem for the system
config info.

Fixes cockroachdb#9819
a-robinson added a commit to a-robinson/cockroach that referenced this issue May 1, 2017
NodeID infos are reliably re-gossiped periodically and on restart,
making it much less likely that process restarts can lead to large
Hops values, which has shown itself to be a problem for the system
config info.

Fixes cockroachdb#9819
@spencerkimball
Copy link
Member

The high-water timestamps are meant to prevent excess gossip traffic. They reduce it considerably in various simulations – in some cases almost an order of magnitude.

I think the proper solution to this is to replace the concept of info.Hops with a slice of node IDs that indicate the shortest path between two nodes for info passing. See #15538.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants