Consul net latency #2757

ross · 2016-08-12T20:37:15Z

What does this PR do?

This PR adds a new option to the consul check to include calculated network latencies based on https://www.consul.io/docs/internals/coordinates.html. Both inter and intra dc metrics are included. The dc to dc metrics are under consul.net.dc-latency, include source_datacenter/dest_datacenter tags, and are host-less. The node to node metrics are under consul.net.latency, are emitted per-host, and include a consul_datacenter tag.

Motivation

Previously had access to this data in graphite and found it interesting/useful at times. It's nice to be able to see network latencies between hosts, racks, availability-zones, roles, etc using the host-level tags.

Additional Notes

This PR is a follow on to #2647 which should be merged prior to merging this.

A mechinism for detecting leadership changes that can, in fact should, be enabled on all server nodes in a cluster. It will emit a single event from the new leader.

Smaller diff from previous version and now keeps track of prev_consul_leader so overall a cleaner change.

Adds a new option to the consul check that includes node-to-node, and datacenter-to-datacenter latency metrics to the consul integration.

ross · 2016-08-12T20:43:26Z

Example of min, median, and 75th percentile latencies for a single node in AWS:

truthbk · 2016-08-12T21:01:15Z

Hi @ross, thanks a lot for this. We have something similar shipping with the agent https://github.com/DataDog/go-metro. But this is great if customers have already deployed consul and don't want any of the sniffing overhead.

truthbk · 2016-08-12T21:02:32Z

It's a pretty big contribution, so we'll do our best to review it as diligently as possible. Thanks!

ross · 2016-08-12T21:19:26Z

Hi @ross, thanks a lot for this. We have something similar shipping with the agent https://github.com/DataDog/go-metro. But this is great if customers have already deployed consul and don't want any of the sniffing overhead.

That looks pretty interesting. Yeah, nice thing about the consul stuff is that it comes for free (once you're running consul.)

It's a pretty big contribution, so we'll do our best to review it as diligently as possible. Thanks!

Most of the changes are actually part of #2647 as mentioned above. This one is a entirely new code that's only turned on if configured. I've been running both for a couple weeks now without any problems.

The 2 travis ci failures seem to be unrelated bootstrapping stuff. Majority seemed to pass. I assume you all can re-run them if need be.

truthbk · 2016-10-04T16:37:12Z

Hi @ross there's a small flake issue here: ./checks.d/consul.py:31:7: E111 indentation is not a multiple of four.

truthbk

Really awesome stuff! Thanks a lot @ross!!! Just a couple of minor comments/questions, and that flake warning we should address before merging.

truthbk · 2016-10-04T16:39:59Z

checks.d/consul.py

@@ -15,6 +16,27 @@
 import requests


+def distance(a, b):


Could we add a reference to https://www.consul.io/docs/internals/coordinates.html here? Just to make some more sense of the calculations/coordinates.

truthbk · 2016-10-04T16:47:43Z

checks.d/consul.py

+        perform_self_leader_check = instance.get('self_leader_check',
+                                                 self.init_config.get('self_leader_check', False))
+
+        if perform_new_leader_checks and perform_self_leader_check:


Is having both of these overkill? I'm not sure I like the idea of getting multiple events for the same leader change (ie. new_leader_checks is True and self_leader_checks is False). I do like how you've handled it though, giving precedence to the self-checks (and having it set to True by default). Maybe I'm just overthinking.

So the former behavior would've already reported events multiple events in the event of a leader change... We might want to preserve that default behavior (although I'm not a fan 😅 - I prefer what you've implemented) - for backward "compatibility" so to speak - I'll discuss and get back to you.

Yeah, this was all an attempt to preserve the existing behavior as-is. There would be other options, but it wasn't immediately clear how I'd preserve the old behavior for people who might be relying on it.

truthbk · 2016-10-04T17:07:01Z

checks.d/consul.py

+                # Not us, skip
+                continue
+            # Inter-datacenter
+            for other in datacenters:


We can probably break after this for loop

Not sure what you mean here.

@ross so, because we're only interested in agent_dc == name, I think that after this for loop completes, in line 406, we can probably break the outer for loop as there's no reason to waste any more cycles there - the linear search is done. datacenters is probably going to be small so there's no big performance gain, that's true.

Ah. It's the other way around. This is saying if the the other datacenter is ourself continue and skip it. Otherwise process the the timing (between us and this round's other)

Sorry, too many for loops there. I think I see what you're saying in the first one. I'll rework this in the morning.

truthbk · 2016-10-04T17:28:51Z

checks.d/consul.py

+                    median = (latencies[half_n - 1] + latencies[half_n]) / 2
+                self.gauge('consul.net.dc-latency.min', latencies[0], hostname='', tags=tags)
+                self.gauge('consul.net.dc-latency.median', median, hostname='', tags=tags)
+                self.gauge('consul.net.dc-latency.max', latencies[-1], hostname='', tags=tags)


I'm going to have to check how this hostname='' parameter might play with our backend. I get why you're overriding (because they're not host metrics), but key:value tags with an empty value might not be desireable.

should we maybe namespace as consul.net.dc.latency.* and consul.net.node.latency?

I'm going to have to check how this hostname='' parameter might play with our backend. I get why you're overriding (because they're not host metrics), but key:value tags with an empty value might not be desireable.

I've done it quite frequently in any case where the metric doesn't belong to a host, it seems to avoid having the metric tagged with things that make no sense (all the other host-level tags.) Until we did so those tags seemed to be confusing people at times. I tested things out and it all seems to work from the outside. I guess it might be blowing up/causing problems internally or something 😁

should we maybe namespace as consul.net.dc.latency.* and consul.net.node.latency

I can make that change if you like.

Thanks for modifying the namespacing. I checked regarding the blank of the hostname tags. We should be fine. :)

truthbk · 2016-10-04T17:36:59Z

tests/checks/mock/test_consul.py

@@ -370,3 +474,68 @@ def test_new_leader_event(self):
        self.assertEqual(event['event_type'], 'consul.new_leader')
        self.assertIn('prev_consul_leader:My Old Leader', event['tags'])
        self.assertIn('curr_consul_leader:My New Leader', event['tags'])
+
+    def test_self_leader_event(self):


💯 great stuff, thank you so much for the tests.

truthbk · 2016-10-04T17:40:29Z

conf.d/consul.yaml.example

+      # happens. It is safe/expected to enable this on all nodes in a consul
+      # cluster since only the new leader will emit the (single) event. This
+      # flag takes precedence over new_leader_checks.
+      self_leader_check: yes


As mentioned earlier, I like this a lot better than what we were previously doing. However this would modify the check behavior and could throw some people off. Don't make any changes to this (or the code) just yet - but we might have to (maybe in a quick separate PR).

truthbk · 2016-10-04T17:44:56Z

checks.d/consul.py

+
+                self.event({
+                    "timestamp": int(datetime.now().strftime("%s")),
+                    "event_type": "consul.new_leader",


Again, not your fault at all, but consul.new_leader should probably go in a constant. It's not your job to cleanup this, so feel free to ignore.

truthbk · 2016-10-04T17:48:47Z

checks.d/consul.py

-                         "consul_datacenter:{0}".format(agent_dc)]
-            })
+            # There was a leadership change
+            if perform_new_leader_checks or (perform_self_leader_check and agent == leader):


If we have perform_self_leader_check should we also send an event when we relinquish leadership? Just trying to cover the scenario where perhaps the agent on the new leader, for whatever reason, fails to submit the new leadership event....

My assumption was that you'd want to run the new way on all leaders so that if another picked up leadership you'd see the message from there. If no leader was elected you'd probably find out about that another way. It's easy enough for the former leader to emit an event when it changes. Downside would be that you'd get two events. Not the end of the world so up to you.

truthbk · 2016-10-06T22:07:04Z

CI failures are a known issue to start the SQL server and unrelated. 👍

Thanks a lot for this @ross

ross · 2016-10-06T22:14:48Z

ross added 5 commits June 30, 2016 15:30

Implement consul self_leader_check

3e6f450

A mechinism for detecting leadership changes that can, in fact should, be enabled on all server nodes in a cluster. It will emit a single event from the new leader.

Rework consul self_leader_check, much cleaner

ce997ab

Smaller diff from previous version and now keeps track of prev_consul_leader so overall a cleaner change.

Remove stray n from test_consul

cac4a66

Add consul.net.latency metrics

cec1bbc

Adds a new option to the consul check that includes node-to-node, and datacenter-to-datacenter latency metrics to the consul integration.

Unit tests for consul net latency metrics

5b56215

truthbk self-assigned this Aug 12, 2016

truthbk added this to the Triage milestone Aug 12, 2016

truthbk added checks feature community labels Aug 12, 2016

truthbk modified the milestones: 5.10.0, Triage Oct 4, 2016

truthbk suggested changes Oct 4, 2016

View reviewed changes

ross added 4 commits October 4, 2016 10:58

Fix indentation in consul tests

d344434

Add link to consul coords doc

72aa75d

Rework consul.net latency namespaces

1624079

Rework consul DC matching logic to be == and to bail out after match

0ff2ba0

truthbk approved these changes Oct 6, 2016

View reviewed changes

truthbk merged commit a33aa3f into DataDog:master Oct 6, 2016

ross deleted the consul-net-latency branch October 6, 2016 22:14

sjenriquez mentioned this pull request Nov 1, 2016

[consul] skip network latency metric collection if only 1 node in cluster #2981

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consul net latency #2757

Consul net latency #2757

ross commented Aug 12, 2016

ross commented Aug 12, 2016

truthbk commented Aug 12, 2016

truthbk commented Aug 12, 2016

ross commented Aug 12, 2016

truthbk commented Oct 4, 2016

truthbk left a comment

truthbk Oct 4, 2016

ross Oct 4, 2016

truthbk Oct 4, 2016

truthbk Oct 4, 2016 •

edited

Loading

ross Oct 4, 2016

truthbk Oct 4, 2016

ross Oct 4, 2016

truthbk Oct 4, 2016 •

edited

Loading

ross Oct 4, 2016

ross Oct 4, 2016

truthbk Oct 4, 2016

truthbk Oct 4, 2016 •

edited

Loading

ross Oct 4, 2016

truthbk Oct 6, 2016

truthbk Oct 4, 2016

truthbk Oct 4, 2016

truthbk Oct 4, 2016 •

edited

Loading

truthbk Oct 4, 2016

ross Oct 4, 2016

truthbk commented Oct 6, 2016 •

edited

Loading

ross commented Oct 6, 2016

Consul net latency #2757

Consul net latency #2757

Conversation

ross commented Aug 12, 2016

What does this PR do?

Motivation

Additional Notes

ross commented Aug 12, 2016

truthbk commented Aug 12, 2016

truthbk commented Aug 12, 2016

ross commented Aug 12, 2016

truthbk commented Oct 4, 2016

truthbk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

truthbk Oct 4, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

truthbk Oct 4, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

truthbk Oct 4, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

truthbk Oct 4, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

truthbk commented Oct 6, 2016 • edited Loading

ross commented Oct 6, 2016

truthbk Oct 4, 2016 •

edited

Loading

truthbk Oct 4, 2016 •

edited

Loading

truthbk Oct 4, 2016 •

edited

Loading

truthbk Oct 4, 2016 •

edited

Loading

truthbk commented Oct 6, 2016 •

edited

Loading