-
Notifications
You must be signed in to change notification settings - Fork 814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consul net latency #2757
Consul net latency #2757
Conversation
A mechinism for detecting leadership changes that can, in fact should, be enabled on all server nodes in a cluster. It will emit a single event from the new leader.
Smaller diff from previous version and now keeps track of prev_consul_leader so overall a cleaner change.
Adds a new option to the consul check that includes node-to-node, and datacenter-to-datacenter latency metrics to the consul integration.
Hi @ross, thanks a lot for this. We have something similar shipping with the agent https://github.com/DataDog/go-metro. But this is great if customers have already deployed consul and don't want any of the sniffing overhead. |
It's a pretty big contribution, so we'll do our best to review it as diligently as possible. Thanks! |
That looks pretty interesting. Yeah, nice thing about the consul stuff is that it comes for free (once you're running consul.)
Most of the changes are actually part of #2647 as mentioned above. This one is a entirely new code that's only turned on if configured. I've been running both for a couple weeks now without any problems. The 2 travis ci failures seem to be unrelated bootstrapping stuff. Majority seemed to pass. I assume you all can re-run them if need be. |
Hi @ross there's a small flake issue here: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really awesome stuff! Thanks a lot @ross!!! Just a couple of minor comments/questions, and that flake warning we should address before merging.
@@ -15,6 +16,27 @@ | |||
import requests | |||
|
|||
|
|||
def distance(a, b): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a reference to https://www.consul.io/docs/internals/coordinates.html here? Just to make some more sense of the calculations/coordinates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
perform_self_leader_check = instance.get('self_leader_check', | ||
self.init_config.get('self_leader_check', False)) | ||
|
||
if perform_new_leader_checks and perform_self_leader_check: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is having both of these overkill? I'm not sure I like the idea of getting multiple events for the same leader change (ie. new_leader_checks
is True
and self_leader_checks
is False
). I do like how you've handled it though, giving precedence to the self-checks (and having it set to True
by default). Maybe I'm just overthinking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the former behavior would've already reported events multiple events in the event of a leader change... We might want to preserve that default behavior (although I'm not a fan 😅 - I prefer what you've implemented) - for backward "compatibility" so to speak - I'll discuss and get back to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this was all an attempt to preserve the existing behavior as-is. There would be other options, but it wasn't immediately clear how I'd preserve the old behavior for people who might be relying on it.
# Not us, skip | ||
continue | ||
# Inter-datacenter | ||
for other in datacenters: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can probably break
after this for loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ross so, because we're only interested in agent_dc == name
, I think that after this for loop completes, in line 406, we can probably break
the outer for loop as there's no reason to waste any more cycles there - the linear search is done. datacenters
is probably going to be small so there's no big performance gain, that's true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. It's the other way around. This is saying if the the other datacenter is ourself continue
and skip it. Otherwise process the the timing (between us and this round's other)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, too many for loops there. I think I see what you're saying in the first one. I'll rework this in the morning.
median = (latencies[half_n - 1] + latencies[half_n]) / 2 | ||
self.gauge('consul.net.dc-latency.min', latencies[0], hostname='', tags=tags) | ||
self.gauge('consul.net.dc-latency.median', median, hostname='', tags=tags) | ||
self.gauge('consul.net.dc-latency.max', latencies[-1], hostname='', tags=tags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to have to check how this hostname=''
parameter might play with our backend. I get why you're overriding (because they're not host metrics), but key:value
tags with an empty value might not be desireable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we maybe namespace as consul.net.dc.latency.*
and consul.net.node.latency
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to have to check how this hostname='' parameter might play with our backend. I get why you're overriding (because they're not host metrics), but key:value tags with an empty value might not be desireable.
I've done it quite frequently in any case where the metric doesn't belong to a host, it seems to avoid having the metric tagged with things that make no sense (all the other host-level tags.) Until we did so those tags seemed to be confusing people at times. I tested things out and it all seems to work from the outside. I guess it might be blowing up/causing problems internally or something 😁
should we maybe namespace as consul.net.dc.latency.* and consul.net.node.latency
I can make that change if you like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for modifying the namespacing. I checked regarding the blank of the hostname tags. We should be fine. :)
@@ -370,3 +474,68 @@ def test_new_leader_event(self): | |||
self.assertEqual(event['event_type'], 'consul.new_leader') | |||
self.assertIn('prev_consul_leader:My Old Leader', event['tags']) | |||
self.assertIn('curr_consul_leader:My New Leader', event['tags']) | |||
|
|||
def test_self_leader_event(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 great stuff, thank you so much for the tests.
# happens. It is safe/expected to enable this on all nodes in a consul | ||
# cluster since only the new leader will emit the (single) event. This | ||
# flag takes precedence over new_leader_checks. | ||
self_leader_check: yes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned earlier, I like this a lot better than what we were previously doing. However this would modify the check behavior and could throw some people off. Don't make any changes to this (or the code) just yet - but we might have to (maybe in a quick separate PR).
|
||
self.event({ | ||
"timestamp": int(datetime.now().strftime("%s")), | ||
"event_type": "consul.new_leader", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, not your fault at all, but consul.new_leader
should probably go in a constant. It's not your job to cleanup this, so feel free to ignore.
"consul_datacenter:{0}".format(agent_dc)] | ||
}) | ||
# There was a leadership change | ||
if perform_new_leader_checks or (perform_self_leader_check and agent == leader): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have perform_self_leader_check
should we also send an event when we relinquish leadership? Just trying to cover the scenario where perhaps the agent on the new leader, for whatever reason, fails to submit the new leadership event....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My assumption was that you'd want to run the new way on all leaders so that if another picked up leadership you'd see the message from there. If no leader was elected you'd probably find out about that another way. It's easy enough for the former leader to emit an event when it changes. Downside would be that you'd get two events. Not the end of the world so up to you.
CI failures are a known issue to start the SQL server and unrelated. 👍 Thanks a lot for this @ross |
What does this PR do?
This PR adds a new option to the consul check to include calculated network latencies based on https://www.consul.io/docs/internals/coordinates.html. Both inter and intra dc metrics are included. The dc to dc metrics are under
consul.net.dc-latency
, includesource_datacenter
/dest_datacenter
tags, and are host-less. The node to node metrics are underconsul.net.latency
, are emitted per-host, and include aconsul_datacenter
tag.Motivation
Previously had access to this data in graphite and found it interesting/useful at times. It's nice to be able to see network latencies between hosts, racks, availability-zones, roles, etc using the host-level tags.
Additional Notes
This PR is a follow on to #2647 which should be merged prior to merging this.