Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul net latency #2757

Merged
merged 9 commits into from
Oct 6, 2016
Merged

Consul net latency #2757

merged 9 commits into from
Oct 6, 2016

Conversation

ross
Copy link
Contributor

@ross ross commented Aug 12, 2016

What does this PR do?

This PR adds a new option to the consul check to include calculated network latencies based on https://www.consul.io/docs/internals/coordinates.html. Both inter and intra dc metrics are included. The dc to dc metrics are under consul.net.dc-latency, include source_datacenter/dest_datacenter tags, and are host-less. The node to node metrics are under consul.net.latency, are emitted per-host, and include a consul_datacenter tag.

Motivation

Previously had access to this data in graphite and found it interesting/useful at times. It's nice to be able to see network latencies between hosts, racks, availability-zones, roles, etc using the host-level tags.

Additional Notes

This PR is a follow on to #2647 which should be merged prior to merging this.

ross added 5 commits June 30, 2016 15:30
A mechinism for detecting leadership changes that can, in fact should, be
enabled on all server nodes in a cluster. It will emit a single event from
the new leader.
Smaller diff from previous version and now keeps track of
prev_consul_leader so overall a cleaner change.
Adds a new option to the consul check that includes node-to-node, and
datacenter-to-datacenter latency metrics to the consul integration.
@ross
Copy link
Contributor Author

ross commented Aug 12, 2016

Example of min, median, and 75th percentile latencies for a single node in AWS:

screen shot 2016-08-12 at 1 42 13 pm

@truthbk truthbk self-assigned this Aug 12, 2016
@truthbk
Copy link
Member

truthbk commented Aug 12, 2016

Hi @ross, thanks a lot for this. We have something similar shipping with the agent https://github.com/DataDog/go-metro. But this is great if customers have already deployed consul and don't want any of the sniffing overhead.

@truthbk
Copy link
Member

truthbk commented Aug 12, 2016

It's a pretty big contribution, so we'll do our best to review it as diligently as possible. Thanks!

@ross
Copy link
Contributor Author

ross commented Aug 12, 2016

Hi @ross, thanks a lot for this. We have something similar shipping with the agent https://github.com/DataDog/go-metro. But this is great if customers have already deployed consul and don't want any of the sniffing overhead.

That looks pretty interesting. Yeah, nice thing about the consul stuff is that it comes for free (once you're running consul.)

It's a pretty big contribution, so we'll do our best to review it as diligently as possible. Thanks!

Most of the changes are actually part of #2647 as mentioned above. This one is a entirely new code that's only turned on if configured. I've been running both for a couple weeks now without any problems.

The 2 travis ci failures seem to be unrelated bootstrapping stuff. Majority seemed to pass. I assume you all can re-run them if need be.

@truthbk truthbk modified the milestones: 5.10.0, Triage Oct 4, 2016
@truthbk
Copy link
Member

truthbk commented Oct 4, 2016

Hi @ross there's a small flake issue here: ./checks.d/consul.py:31:7: E111 indentation is not a multiple of four.

Copy link
Member

@truthbk truthbk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really awesome stuff! Thanks a lot @ross!!! Just a couple of minor comments/questions, and that flake warning we should address before merging.

@@ -15,6 +16,27 @@
import requests


def distance(a, b):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a reference to https://www.consul.io/docs/internals/coordinates.html here? Just to make some more sense of the calculations/coordinates.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

perform_self_leader_check = instance.get('self_leader_check',
self.init_config.get('self_leader_check', False))

if perform_new_leader_checks and perform_self_leader_check:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is having both of these overkill? I'm not sure I like the idea of getting multiple events for the same leader change (ie. new_leader_checks is True and self_leader_checks is False). I do like how you've handled it though, giving precedence to the self-checks (and having it set to True by default). Maybe I'm just overthinking.

Copy link
Member

@truthbk truthbk Oct 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the former behavior would've already reported events multiple events in the event of a leader change... We might want to preserve that default behavior (although I'm not a fan 😅 - I prefer what you've implemented) - for backward "compatibility" so to speak - I'll discuss and get back to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this was all an attempt to preserve the existing behavior as-is. There would be other options, but it wasn't immediately clear how I'd preserve the old behavior for people who might be relying on it.

# Not us, skip
continue
# Inter-datacenter
for other in datacenters:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably break after this for loop

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean here.

Copy link
Member

@truthbk truthbk Oct 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ross so, because we're only interested in agent_dc == name, I think that after this for loop completes, in line 406, we can probably break the outer for loop as there's no reason to waste any more cycles there - the linear search is done. datacenters is probably going to be small so there's no big performance gain, that's true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. It's the other way around. This is saying if the the other datacenter is ourself continue and skip it. Otherwise process the the timing (between us and this round's other)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, too many for loops there. I think I see what you're saying in the first one. I'll rework this in the morning.

median = (latencies[half_n - 1] + latencies[half_n]) / 2
self.gauge('consul.net.dc-latency.min', latencies[0], hostname='', tags=tags)
self.gauge('consul.net.dc-latency.median', median, hostname='', tags=tags)
self.gauge('consul.net.dc-latency.max', latencies[-1], hostname='', tags=tags)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to have to check how this hostname='' parameter might play with our backend. I get why you're overriding (because they're not host metrics), but key:value tags with an empty value might not be desireable.

Copy link
Member

@truthbk truthbk Oct 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we maybe namespace as consul.net.dc.latency.* and consul.net.node.latency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to have to check how this hostname='' parameter might play with our backend. I get why you're overriding (because they're not host metrics), but key:value tags with an empty value might not be desireable.

I've done it quite frequently in any case where the metric doesn't belong to a host, it seems to avoid having the metric tagged with things that make no sense (all the other host-level tags.) Until we did so those tags seemed to be confusing people at times. I tested things out and it all seems to work from the outside. I guess it might be blowing up/causing problems internally or something 😁

should we maybe namespace as consul.net.dc.latency.* and consul.net.node.latency

I can make that change if you like.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for modifying the namespacing. I checked regarding the blank of the hostname tags. We should be fine. :)

@@ -370,3 +474,68 @@ def test_new_leader_event(self):
self.assertEqual(event['event_type'], 'consul.new_leader')
self.assertIn('prev_consul_leader:My Old Leader', event['tags'])
self.assertIn('curr_consul_leader:My New Leader', event['tags'])

def test_self_leader_event(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 great stuff, thank you so much for the tests.

# happens. It is safe/expected to enable this on all nodes in a consul
# cluster since only the new leader will emit the (single) event. This
# flag takes precedence over new_leader_checks.
self_leader_check: yes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned earlier, I like this a lot better than what we were previously doing. However this would modify the check behavior and could throw some people off. Don't make any changes to this (or the code) just yet - but we might have to (maybe in a quick separate PR).


self.event({
"timestamp": int(datetime.now().strftime("%s")),
"event_type": "consul.new_leader",
Copy link
Member

@truthbk truthbk Oct 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, not your fault at all, but consul.new_leader should probably go in a constant. It's not your job to cleanup this, so feel free to ignore.

"consul_datacenter:{0}".format(agent_dc)]
})
# There was a leadership change
if perform_new_leader_checks or (perform_self_leader_check and agent == leader):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have perform_self_leader_check should we also send an event when we relinquish leadership? Just trying to cover the scenario where perhaps the agent on the new leader, for whatever reason, fails to submit the new leadership event....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My assumption was that you'd want to run the new way on all leaders so that if another picked up leadership you'd see the message from there. If no leader was elected you'd probably find out about that another way. It's easy enough for the former leader to emit an event when it changes. Downside would be that you'd get two events. Not the end of the world so up to you.

@truthbk
Copy link
Member

truthbk commented Oct 6, 2016

CI failures are a known issue to start the SQL server and unrelated. 👍

Thanks a lot for this @ross

@truthbk truthbk merged commit a33aa3f into DataDog:master Oct 6, 2016
@ross ross deleted the consul-net-latency branch October 6, 2016 22:14
@ross
Copy link
Contributor Author

ross commented Oct 6, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants