-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stability: long-lasting gossip connection thrashing #9819
Comments
@spencerkimball, any theories here? There are some metrics we collect but don't graph yet, I'll add those to grafana shortly. |
@tamird this graph is pretty horrible and doesn't scale at all. Any magical ideas? |
https://prometheus.io/blog/2015/06/18/practical-anomaly-detection/ seems promising, but I haven't managed to massage the example there into something useful. I'll take a deeper look on Monday. |
Gossip thrashing has been very apparent in @mberhault's scalability tests where he doubles the number of nodes in a cluster every N minutes: On the node with the lease on r1, 2186 of its 19907 log lines are "refusing gossip" errors. Noticed while looking into #12591 |
That's not good. At what cluster scale does that thrashing begin to occur? |
The start of thrashing around 20:15 in that graph is when the cluster jumped from 16 to 32 nodes. The additional thrashing that picks up around 20:45 is when it jumped from 32 to 64 nodes. |
This appears to be back, judging by our recent 64 node tests - #12741 (comment) |
See #15003 (comment), which shows an increase in gossip thrashing on 4/16, although that incident appears to be precipitated by a change from stop-the-world to rolling restarts. |
Continuing the thread from #15003 (comment) on why node 2 kept trying to connect to itself, it appears as though what was going on was this:
We never hit this during stop-the-world migrations because the system config's leaseholder would have to re-gossip it when it started up. This could also make for much greater gossip thrashing on the other nodes, too, since they'll always think that the system config leaseholder is further away than it actually is, and thus they'll periodically try to open a connection to it. There's a couple changes we should make here. First, we should either get a little less conservative about not gossiping the system config during startup, or we should exclude the system config key from |
Nice work. Would it make sense to set `gossip.Info.Hops` to zero if the
node ID matches the recipient's node ID? I think that's a more general
solution, but I might be missing something.
…On Sun, Apr 30, 2017 at 1:48 PM, Alex Robinson ***@***.***> wrote:
Continuing the thread from #15003 (comment)
<#15003 (comment)>
on why node 2 kept trying to connect to itself, it appears as though what
was going on was this:
- Node 2 held the lease for the system config range. It had gossiped
it out to other nodes in the cluster at some point.
- A rolling restart happens. It's worth noting that after a restart, a
node starts off without any gossiped info since it isn't persisted to disk.
Thus, when node 2 starts up again, it gets all of its gossip info (other
than its own node and store descriptors) from other nodes.
- This means that node 2 gets the system config from another node with
the original nodeID set to node 2 and some non-zero number of hops.
- After the restart, node 2 maintains its lease on the system config
range. However, it doesn't re-gossip the system config because it's
unchanged
<https://github.com/cockroachdb/cockroach/blob/master/pkg/storage/replica.go#L4527>
from the value that node 2 received via gossip.
- This can repeat multiple times so long as the value of the system
config doesn't change, which it essentially never does on our test
clusters. The more it repeats, the higher the number of hops gets.
We never hit this during stop-the-world migrations because the system
config's leaseholder would have to re-gossip it when it started up.
This could also make for much greater gossip thrashing on the other nodes,
too, since they'll always think that the system config leaseholder is
further away than it actually is, and thus they'll periodically try to open
a connection to it.
There's a couple changes we should make here. First, we should either get
a little less conservative about not gossiping the system config during
startup, or we should exclude the system config key from mostDistant
calculations. Second, each node should exclude itself from mostDistant
calculations as a backup check. I'll send out a PR.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9819 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABdsPJNzu08o5zUosjn2Q5LpuYKVw3sjks5r1MlQgaJpZM4KRma6>
.
|
That would fix the problem of a node trying to connect to itself, yeah. We'd have to re-gossip the modified info in order to ensure other nodes don't get an unrealistic idea of how far away the source node is, but I think that as long as we did that, it would work. The only downside is that we'd lose the history of how long a particular piece of info has been floating around, but I'm not sure that has much value other than for this sort of debugging. |
On Sun, Apr 30, 2017 at 2:06 PM, Alex Robinson ***@***.***> wrote:
We'd have to re-gossip the modified info in order to ensure other nodes
don't get an unrealistic idea of how far away the source node is, but I
think that as long as we did that, it would work.
Can you help me understand why that's necessary, given that we don't
already do that today? I.e. isn't possible today for an Info to be created
at node 1, passed to node 2, and then node 3, and then for node 3 to
connect to node 1 - I believe node 1 will not then re-gossip that Info -
leaving node 3 believing the Info's Hops to be 2, rather than 1?
|
Ah, right, that behavior actually makes controlling I'd be curious to understand the reasoning behind the high watermarks design -- I imagine it's purely to minimize the amount of data passed around? Because it seems dangerous from a thrashing perspective without a mechanism in place for nodes to periodically re-gossip all their infos (and thus reset all The current behavior is really bad since if no new |
Alright, we already gossip the node descriptors periodically, so we just need to make |
NodeID infos are reliably re-gossiped periodically and on restart, making it much less likely that process restarts can lead to large Hops values, which has shown itself to be a problem for the system config info. Fixes cockroachdb#9819
FWIW, the high water marks design was added in 984c1e5, if that's any help. |
NodeID infos are reliably re-gossiped periodically and on restart, making it much less likely that process restarts can lead to large Hops values, which has shown itself to be a problem for the system config info. Fixes cockroachdb#9819
NodeID infos are reliably re-gossiped periodically and on restart, making it much less likely that process restarts can lead to large Hops values, which has shown itself to be a problem for the system config info. Fixes cockroachdb#9819
The high-water timestamps are meant to prevent excess gossip traffic. They reduce it considerably in various simulations – in some cases almost an order of magnitude. I think the proper solution to this is to replace the concept of |
From #9749. We should figure out what was happening there and make sure we find out if it's happening again. See also #9817.
Btw, gossip forwarding is somewhat unhappy. A 100 node cluster should reach a steady state where connections aren't thrashed as much, but they are. I don't think it interferes with operations much, but it's awkward.
The text was updated successfully, but these errors were encountered: