-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[envoy] new integration #1156
[envoy] new integration #1156
Changes from all commits
bb281ac
a6e3d7f
6c60a2e
6b37fc2
440f683
7df7799
a1aae8b
9c6bcd2
099b15f
c0ede9e
352d30c
5240eac
c89dac8
8685c92
a26832d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
# CHANGELOG - Envoy | ||
|
||
1.0.0 / Unreleased | ||
================== | ||
|
||
### Changes | ||
|
||
* [FEATURE] add Envoy integration. See #1156 | ||
|
||
<!--- The following link definition list is generated by PimpMyChangelog ---> |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
include README.md | ||
include requirements.in | ||
include requirements.txt | ||
include requirements-dev.txt | ||
graft datadog_checks | ||
graft tests |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,141 @@ | ||
# Agent Check: Envoy | ||
## Overview | ||
|
||
This check collects distributed system observability metrics from [Envoy](https://www.envoyproxy.io). | ||
|
||
## Setup | ||
### Installation | ||
|
||
The Envoy check is packaged with the Agent, so simply [install the Agent](https://app.datadoghq.com/account/settings#agent) on your server. | ||
|
||
If you need the newest version of the Envoy check, install the `dd-check-envoy` package; this package's check overrides the one packaged with the Agent. See the [integrations-core repository README.md for more details](https://docs.datadoghq.com/agent/faq/install-core-extra/). | ||
|
||
### Configuration | ||
|
||
Create a file `envoy.yaml` in the Datadog Agent's `conf.d` directory. See the [sample envoy.yaml](https://github.com/DataDog/integrations-core/blob/master/envoy/conf.yaml.example) for all available configuration options: | ||
|
||
There are 2 ways to setup the `/stats` endpoint: | ||
|
||
#### Unsecured stats endpoint | ||
|
||
Be sure the Datadog Agent can access Envoy's [admin endpoint](https://www.envoyproxy.io/docs/envoy/latest/operations/admin). Here's an example Envoy admin configuration: | ||
|
||
```yaml | ||
admin: | ||
access_log_path: "/dev/null" | ||
address: | ||
socket_address: | ||
address: 0.0.0.0 | ||
port_value: 8001 | ||
``` | ||
|
||
#### Secured stats endpoint | ||
|
||
Create a listener/vhost that routes to the admin endpoint (Envoy connecting to itself), but only has a route for `/stats`; all other routes get a static/error response. Additionally, this allows nice integration with L3 filters for auth, for example. | ||
|
||
Here's an example config (from [this gist](https://gist.github.com/ofek/6051508cd0dfa98fc6c13153b647c6f8)): | ||
|
||
```json | ||
{ | ||
"listeners": [ | ||
{ | ||
"address": "tcp://0.0.0.0:80", | ||
"filters": [ | ||
{ | ||
"type": "read", | ||
"name": "http_connection_manager", | ||
"config": { | ||
"codec_type": "auto", | ||
"stat_prefix": "ingress_http", | ||
"route_config": { | ||
"virtual_hosts": [ | ||
{ | ||
"name": "backend", | ||
"domains": ["*"], | ||
"routes": [ | ||
{ | ||
"timeout_ms": 0, | ||
"prefix": "/stats", | ||
"cluster": "service_stats" | ||
} | ||
] | ||
} | ||
] | ||
}, | ||
"filters": [ | ||
{ | ||
"type": "decoder", | ||
"name": "router", | ||
"config": {} | ||
} | ||
] | ||
} | ||
} | ||
] | ||
} | ||
], | ||
"admin": { | ||
"access_log_path": "/dev/null", | ||
"address": "tcp://0.0.0.0:8001" | ||
}, | ||
"cluster_manager": { | ||
"clusters": [ | ||
{ | ||
"name": "service_stats", | ||
"connect_timeout_ms": 250, | ||
"type": "logical_dns", | ||
"lb_type": "round_robin", | ||
"hosts": [ | ||
{ | ||
"url": "tcp://127.0.0.1:8001" | ||
} | ||
] | ||
} | ||
] | ||
} | ||
} | ||
``` | ||
|
||
### Validation | ||
|
||
[Run the Agent's `status` subcommand](https://docs.datadoghq.com/agent/faq/agent-commands/#agent-status-and-information) and look for `envoy` under the Checks section: | ||
|
||
``` | ||
Checks | ||
====== | ||
[...] | ||
|
||
envoy | ||
----- | ||
- instance #0 [OK] | ||
- Collected 244 metrics, 0 events & 1 service check | ||
|
||
[...] | ||
``` | ||
|
||
## Compatibility | ||
|
||
The Envoy check is compatible with all platforms. | ||
|
||
## Data Collected | ||
### Metrics | ||
|
||
See [metadata.csv](https://github.com/DataDog/integrations-core/blob/master/envoy/metadata.csv) for a list of metrics provided by this check. | ||
See [metrics.py](https://github.com/DataDog/integrations-core/blob/master/envoy/datadog_checks/envoy/metrics.py) for a list of tags sent by each metric. | ||
|
||
### Events | ||
|
||
The Envoy check does not include any events at this time. | ||
|
||
### Service Checks | ||
|
||
`envoy.can_connect`: | ||
|
||
Returns CRITICAL if the Agent cannot connect to Envoy to collect metrics, otherwise OK. | ||
|
||
## Troubleshooting | ||
|
||
Need help? Contact [Datadog Support](http://docs.datadoghq.com/help/). | ||
|
||
## Further Reading | ||
Learn more about infrastructure monitoring and all our integrations on [our blog](https://www.datadoghq.com/blog/) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
# This file is overwritten upon Agent upgrade. | ||
# To make modifications to the check configuration, please copy this file | ||
# to `envoy.yaml` and make your changes on that file. | ||
|
||
init_config: | ||
|
||
instances: | ||
# For every instance, you need a `stats_url` and can optionally | ||
# supply a list of tags. The admin endpoint must be accessible. | ||
# https://www.envoyproxy.io/docs/envoy/latest/operations/admin | ||
|
||
- stats_url: http://localhost:80/stats | ||
|
||
# tags: | ||
# - instance:foo | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In case you missed, probably you can use fixed string tags to distinguish Envoy instances. https://www.envoyproxy.io/docs/envoy/latest/api-v2/config/metrics/v2/stats.proto#config-metrics-v2-tagspecifier I don't know the tag can work with admin There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks! We actually allow users to specify any tag they wish, which, as you said, is especially useful for distinguishing between different instances of an integration for metric aggregation. |
||
|
||
# <<<Note>>> The Envoy admin endpoint does not support auth until: | ||
# https://github.com/envoyproxy/envoy/issues/2763 | ||
# For an alternative, see: | ||
# https://gist.github.com/ofek/6051508cd0dfa98fc6c13153b647c6f8 | ||
# | ||
# If the stats page is behind basic auth: | ||
# username: USERNAME | ||
# password: PASSWORD | ||
|
||
# The (optional) verify_ssl parameter will instruct the check to validate SSL | ||
# certificates when connecting to Envoy. Defaulting to true, set to false if | ||
# you want to disable SSL certificate validation. | ||
# | ||
# verify_ssl: true | ||
|
||
# The (optional) skip_proxy parameter will bypass any proxy | ||
# settings enabled and attempt to reach Envoy directly. | ||
# | ||
# skip_proxy: false | ||
|
||
# If you need to specify a custom timeout in seconds (default is 20): | ||
# timeout: 20 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# (C) Datadog, Inc. 2018 | ||
# All rights reserved | ||
# Licensed under a 3-clause BSD style license (see LICENSE) | ||
__path__ = __import__('pkgutil').extend_path(__path__, __name__) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
# (C) Datadog, Inc. 2018 | ||
# All rights reserved | ||
# Licensed under a 3-clause BSD style license (see LICENSE) | ||
|
||
__version__ = '1.0.0' |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
# (C) Datadog, Inc. 2018 | ||
# All rights reserved | ||
# Licensed under a 3-clause BSD style license (see LICENSE) | ||
from .__about__ import __version__ | ||
from .envoy import Envoy | ||
|
||
__all__ = [ | ||
'__version__', | ||
'Envoy' | ||
] |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
# (C) Datadog, Inc. 2018 | ||
# All rights reserved | ||
# Licensed under a 3-clause BSD style license (see LICENSE) | ||
import requests | ||
|
||
from datadog_checks.checks import AgentCheck | ||
|
||
from .errors import UnknownMetric | ||
from .parser import parse_metric | ||
|
||
|
||
class Envoy(AgentCheck): | ||
SERVICE_CHECK_NAME = 'envoy.can_connect' | ||
|
||
def __init__(self, name, init_config, agentConfig, instances=None): | ||
super(Envoy, self).__init__(name, init_config, agentConfig, instances) | ||
self.unknown_metrics = set() | ||
|
||
def check(self, instance): | ||
custom_tags = instance.get('tags', []) | ||
|
||
try: | ||
stats_url = instance['stats_url'] | ||
except KeyError: | ||
msg = 'Envoy configuration setting `stats_url` is required' | ||
self.service_check(self.SERVICE_CHECK_NAME, AgentCheck.CRITICAL, message=msg, tags=custom_tags) | ||
self.log.error(msg) | ||
return | ||
|
||
username = instance.get('username', None) | ||
password = instance.get('password', None) | ||
auth = (username, password) if username and password else None | ||
verify_ssl = instance.get('verify_ssl', True) | ||
proxies = self.get_instance_proxy(instance, stats_url) | ||
timeout = int(instance.get('timeout', 20)) | ||
|
||
try: | ||
request = requests.get( | ||
stats_url, auth=auth, verify=verify_ssl, proxies=proxies, timeout=timeout | ||
) | ||
except requests.exceptions.Timeout: | ||
msg = 'Envoy endpoint `{}` timed out after {} seconds'.format(stats_url, timeout) | ||
self.service_check(self.SERVICE_CHECK_NAME, AgentCheck.CRITICAL, message=msg, tags=custom_tags) | ||
self.log.exception(msg) | ||
return | ||
except (requests.exceptions.RequestException, requests.exceptions.ConnectionError): | ||
msg = 'Error accessing Envoy endpoint `{}`'.format(stats_url) | ||
self.service_check(self.SERVICE_CHECK_NAME, AgentCheck.CRITICAL, message=msg, tags=custom_tags) | ||
self.log.exception(msg) | ||
return | ||
|
||
if request.status_code != 200: | ||
msg = 'Envoy endpoint `{}` responded with HTTP status code {}'.format(stats_url, request.status_code) | ||
self.service_check(self.SERVICE_CHECK_NAME, AgentCheck.CRITICAL, message=msg, tags=custom_tags) | ||
self.log.warning(msg) | ||
return | ||
|
||
# Avoid repeated global lookups. | ||
get_method = getattr | ||
|
||
for line in request.content.decode().splitlines(): | ||
try: | ||
envoy_metric, value = line.split(': ') | ||
except ValueError: | ||
continue | ||
|
||
value = int(value) | ||
|
||
try: | ||
metric, tags, method = parse_metric(envoy_metric) | ||
except UnknownMetric: | ||
if envoy_metric not in self.unknown_metrics: | ||
self.log.debug('Unknown metric `{}`'.format(envoy_metric)) | ||
self.unknown_metrics.add(envoy_metric) | ||
continue | ||
|
||
tags.extend(custom_tags) | ||
get_method(self, method)(metric, value, tags=tags) | ||
|
||
self.service_check(self.SERVICE_CHECK_NAME, AgentCheck.OK, tags=custom_tags) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
# (C) Datadog, Inc. 2018 | ||
# All rights reserved | ||
# Licensed under a 3-clause BSD style license (see LICENSE) | ||
|
||
|
||
class UnknownMetric(Exception): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: can we add the license header? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done! |
||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we had a stats sink for emitting to DataDog, https://github.com/envoyproxy/data-plane-api/blob/master/envoy/config/metrics/v2/stats.proto#L157, this is a push model. How come we have pull here? I'd echo the same security concerns as in envoyproxy/data-plane-api#523. CC: @taiki45
edit: fixed tag to Taiki
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @htuch and @taiki45,
Thanks for sharing your concerns with us! We understand them, and this is what we are doing at the moment to address them:
Envoy
's/stats
endpoint.envoy
integration supports SSL verification of theEnvoy
endpoint (and this requiresEnvoy
to support SSL). If the SSL verification fails for any reason when enabled, the integration will not collect metrics.Envoy
to implement basic authentication).Envoy
implements it, our integration will use it transparently (via basic authentication).At the moment, we do not see the benefit of supporting the
dogstatsd
connector, because that still does not prevent the admin endpoint onEnvoy
from being exposed to the entire trusted network. A customer could use that to be more secure in some network scenarios, but we would also like to provide our customers with a working agent integration forEnvoy
in its current form.We also need to do some massaging of metrics in the integration for our backend, which the
dogstatsd
connector currently does not allow. Additionally, we are finding our parser a bit more resilient to stat name collisions. @jmarantz mentioned that he has a similar implementation for which he’ll issue a PR toEnvoy
soon.We would be happy to enforce authentication by default, provided that
Envoy
implements it. We hope this addresses your concerns. Thanks again for sharing your thoughts with us!Regards,
The Agent team at Datadog
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this all seems very reasonable. I'd add a couple of points for consideration:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@htuch That's an interesting idea. Would I simply change
"address": "tcp://0.0.0.0:8001"
to"address": "tcp://127.0.0.1:8001"
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ofek Yes, that's right. And I assume that prevents the entire admin endpoints from being exposed.
I agree to move to pull-based stats because it can be easily scaled than push-based one, but I'd leave some comments to clarify the points.
AFAIK, the datadog agent will be running in the customer's host (the same host as running Envoy process). If it's correct, we might be able to send stats to datadog agent without opening any Envoy admin endpoints. The scenario is:
I suppose you can do that modifications with writing a statsd relay component which receives stats and modifies the stats then re-pushes to somewhere (probably datadog agent?).
IIUC, the current admin's
/stats
endpoint is missing histogram metrics. envoyproxy/envoy#1947 This can be resolved with envoyproxy/envoy#2736 which will implement pull-based stats with stats sink (I didn't fully take a look the PR yet).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ofek ack.