Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[envoy] new integration #1156

Merged
merged 15 commits into from
Mar 13, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions envoy/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# CHANGELOG - Envoy

1.0.0 / Unreleased
==================

### Changes

* [FEATURE] add Envoy integration. See #1156

<!--- The following link definition list is generated by PimpMyChangelog --->
6 changes: 6 additions & 0 deletions envoy/MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
include README.md
include requirements.in
include requirements.txt
include requirements-dev.txt
graft datadog_checks
graft tests
141 changes: 141 additions & 0 deletions envoy/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
# Agent Check: Envoy
## Overview

This check collects distributed system observability metrics from [Envoy](https://www.envoyproxy.io).

## Setup
### Installation

The Envoy check is packaged with the Agent, so simply [install the Agent](https://app.datadoghq.com/account/settings#agent) on your server.

If you need the newest version of the Envoy check, install the `dd-check-envoy` package; this package's check overrides the one packaged with the Agent. See the [integrations-core repository README.md for more details](https://docs.datadoghq.com/agent/faq/install-core-extra/).

### Configuration

Create a file `envoy.yaml` in the Datadog Agent's `conf.d` directory. See the [sample envoy.yaml](https://github.com/DataDog/integrations-core/blob/master/envoy/conf.yaml.example) for all available configuration options:

There are 2 ways to setup the `/stats` endpoint:

#### Unsecured stats endpoint

Be sure the Datadog Agent can access Envoy's [admin endpoint](https://www.envoyproxy.io/docs/envoy/latest/operations/admin). Here's an example Envoy admin configuration:

```yaml
admin:
access_log_path: "/dev/null"
address:
socket_address:
address: 0.0.0.0
port_value: 8001
```

#### Secured stats endpoint

Create a listener/vhost that routes to the admin endpoint (Envoy connecting to itself), but only has a route for `/stats`; all other routes get a static/error response. Additionally, this allows nice integration with L3 filters for auth, for example.

Here's an example config (from [this gist](https://gist.github.com/ofek/6051508cd0dfa98fc6c13153b647c6f8)):

```json
{
"listeners": [
{
"address": "tcp://0.0.0.0:80",
"filters": [
{
"type": "read",
"name": "http_connection_manager",
"config": {
"codec_type": "auto",
"stat_prefix": "ingress_http",
"route_config": {
"virtual_hosts": [
{
"name": "backend",
"domains": ["*"],
"routes": [
{
"timeout_ms": 0,
"prefix": "/stats",
"cluster": "service_stats"
}
]
}
]
},
"filters": [
{
"type": "decoder",
"name": "router",
"config": {}
}
]
}
}
]
}
],
"admin": {
"access_log_path": "/dev/null",
"address": "tcp://0.0.0.0:8001"
},
"cluster_manager": {
"clusters": [
{
"name": "service_stats",
"connect_timeout_ms": 250,
"type": "logical_dns",
"lb_type": "round_robin",
"hosts": [
{
"url": "tcp://127.0.0.1:8001"
}
]
}
]
}
}
```

### Validation

[Run the Agent's `status` subcommand](https://docs.datadoghq.com/agent/faq/agent-commands/#agent-status-and-information) and look for `envoy` under the Checks section:

```
Checks
======
[...]

envoy
-----
- instance #0 [OK]
- Collected 244 metrics, 0 events & 1 service check

[...]
```

## Compatibility

The Envoy check is compatible with all platforms.

## Data Collected
### Metrics

See [metadata.csv](https://github.com/DataDog/integrations-core/blob/master/envoy/metadata.csv) for a list of metrics provided by this check.
See [metrics.py](https://github.com/DataDog/integrations-core/blob/master/envoy/datadog_checks/envoy/metrics.py) for a list of tags sent by each metric.

### Events

The Envoy check does not include any events at this time.

### Service Checks

`envoy.can_connect`:

Returns CRITICAL if the Agent cannot connect to Envoy to collect metrics, otherwise OK.

## Troubleshooting

Need help? Contact [Datadog Support](http://docs.datadoghq.com/help/).

## Further Reading
Learn more about infrastructure monitoring and all our integrations on [our blog](https://www.datadoghq.com/blog/)
38 changes: 38 additions & 0 deletions envoy/conf.yaml.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# This file is overwritten upon Agent upgrade.
# To make modifications to the check configuration, please copy this file
# to `envoy.yaml` and make your changes on that file.

init_config:

instances:
# For every instance, you need a `stats_url` and can optionally
# supply a list of tags. The admin endpoint must be accessible.
# https://www.envoyproxy.io/docs/envoy/latest/operations/admin

- stats_url: http://localhost:80/stats
Copy link

@htuch htuch Mar 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we had a stats sink for emitting to DataDog, https://github.com/envoyproxy/data-plane-api/blob/master/envoy/config/metrics/v2/stats.proto#L157, this is a push model. How come we have pull here? I'd echo the same security concerns as in envoyproxy/data-plane-api#523. CC: @taiki45

edit: fixed tag to Taiki

Copy link
Contributor Author

@ofek ofek Mar 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @htuch and @taiki45,

Thanks for sharing your concerns with us! We understand them, and this is what we are doing at the moment to address them:

  1. Until Admin endpoint security envoyproxy/envoy#2763 is resolved, we are offering our customers an alternative way to access Envoy's /stats endpoint.
  2. Our envoy integration supports SSL verification of the Envoy endpoint (and this requires Envoy to support SSL). If the SSL verification fails for any reason when enabled, the integration will not collect metrics.
  3. Our integration has the option to support basic authentication (but this requires Envoy to implement basic authentication).
  4. Our integration does not need to be aware of RBAC, but when Envoy implements it, our integration will use it transparently (via basic authentication).

At the moment, we do not see the benefit of supporting the dogstatsd connector, because that still does not prevent the admin endpoint on Envoy from being exposed to the entire trusted network. A customer could use that to be more secure in some network scenarios, but we would also like to provide our customers with a working agent integration for Envoy in its current form.

We also need to do some massaging of metrics in the integration for our backend, which the dogstatsd connector currently does not allow. Additionally, we are finding our parser a bit more resilient to stat name collisions. @jmarantz mentioned that he has a similar implementation for which he’ll issue a PR to Envoy soon.

We would be happy to enforce authentication by default, provided that Envoy implements it. We hope this addresses your concerns. Thanks again for sharing your thoughts with us!

Regards,
The Agent team at Datadog

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this all seems very reasonable. I'd add a couple of points for consideration:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@htuch That's an interesting idea. Would I simply change "address": "tcp://0.0.0.0:8001" to "address": "tcp://127.0.0.1:8001"?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ofek Yes, that's right. And I assume that prevents the entire admin endpoints from being exposed.

I agree to move to pull-based stats because it can be easily scaled than push-based one, but I'd leave some comments to clarify the points.

At the moment, we do not see the benefit of supporting the dogstatsd connector, because that still does not prevent the admin endpoint on Envoy from being exposed to the entire trusted network. A customer could use that to be more secure in some network scenarios, but we would also like to provide our customers with a working agent integration for Envoy in its current form.

AFAIK, the datadog agent will be running in the customer's host (the same host as running Envoy process). If it's correct, we might be able to send stats to datadog agent without opening any Envoy admin endpoints. The scenario is:

  • Configure Envoy admin endpoint to bind local loopback address (127.0.0.1 in IPv4).
  • Send Envoy stats to datadog agent running on the same host using UDP with dog_statsd sink.

We also need to do some massaging of metrics in the integration for our backend, which the dogstatsd connector currently does not allow.

I suppose you can do that modifications with writing a statsd relay component which receives stats and modifies the stats then re-pushes to somewhere (probably datadog agent?).

IIUC, the current admin's /stats endpoint is missing histogram metrics. envoyproxy/envoy#1947 This can be resolved with envoyproxy/envoy#2736 which will implement pull-based stats with stats sink (I didn't fully take a look the PR yet).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ofek ack.


# tags:
# - instance:foo
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case you missed, probably you can use fixed string tags to distinguish Envoy instances. https://www.envoyproxy.io/docs/envoy/latest/api-v2/config/metrics/v2/stats.proto#config-metrics-v2-tagspecifier I don't know the tag can work with admin /stats, but I'm using the fixed tag with dog_statsd sink. Please note the fixed tag feature is only available with Envoy >= v1.6.0.

Copy link
Contributor Author

@ofek ofek Mar 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! We actually allow users to specify any tag they wish, which, as you said, is especially useful for distinguishing between different instances of an integration for metric aggregation.


# <<<Note>>> The Envoy admin endpoint does not support auth until:
# https://github.com/envoyproxy/envoy/issues/2763
# For an alternative, see:
# https://gist.github.com/ofek/6051508cd0dfa98fc6c13153b647c6f8
#
# If the stats page is behind basic auth:
# username: USERNAME
# password: PASSWORD

# The (optional) verify_ssl parameter will instruct the check to validate SSL
# certificates when connecting to Envoy. Defaulting to true, set to false if
# you want to disable SSL certificate validation.
#
# verify_ssl: true

# The (optional) skip_proxy parameter will bypass any proxy
# settings enabled and attempt to reach Envoy directly.
#
# skip_proxy: false

# If you need to specify a custom timeout in seconds (default is 20):
# timeout: 20
4 changes: 4 additions & 0 deletions envoy/datadog_checks/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# (C) Datadog, Inc. 2018
# All rights reserved
# Licensed under a 3-clause BSD style license (see LICENSE)
__path__ = __import__('pkgutil').extend_path(__path__, __name__)
5 changes: 5 additions & 0 deletions envoy/datadog_checks/envoy/__about__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# (C) Datadog, Inc. 2018
# All rights reserved
# Licensed under a 3-clause BSD style license (see LICENSE)

__version__ = '1.0.0'
10 changes: 10 additions & 0 deletions envoy/datadog_checks/envoy/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# (C) Datadog, Inc. 2018
# All rights reserved
# Licensed under a 3-clause BSD style license (see LICENSE)
from .__about__ import __version__
from .envoy import Envoy

__all__ = [
'__version__',
'Envoy'
]
80 changes: 80 additions & 0 deletions envoy/datadog_checks/envoy/envoy.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# (C) Datadog, Inc. 2018
# All rights reserved
# Licensed under a 3-clause BSD style license (see LICENSE)
import requests

from datadog_checks.checks import AgentCheck

from .errors import UnknownMetric
from .parser import parse_metric


class Envoy(AgentCheck):
SERVICE_CHECK_NAME = 'envoy.can_connect'

def __init__(self, name, init_config, agentConfig, instances=None):
super(Envoy, self).__init__(name, init_config, agentConfig, instances)
self.unknown_metrics = set()

def check(self, instance):
custom_tags = instance.get('tags', [])

try:
stats_url = instance['stats_url']
except KeyError:
msg = 'Envoy configuration setting `stats_url` is required'
self.service_check(self.SERVICE_CHECK_NAME, AgentCheck.CRITICAL, message=msg, tags=custom_tags)
self.log.error(msg)
return

username = instance.get('username', None)
password = instance.get('password', None)
auth = (username, password) if username and password else None
verify_ssl = instance.get('verify_ssl', True)
proxies = self.get_instance_proxy(instance, stats_url)
timeout = int(instance.get('timeout', 20))

try:
request = requests.get(
stats_url, auth=auth, verify=verify_ssl, proxies=proxies, timeout=timeout
)
except requests.exceptions.Timeout:
msg = 'Envoy endpoint `{}` timed out after {} seconds'.format(stats_url, timeout)
self.service_check(self.SERVICE_CHECK_NAME, AgentCheck.CRITICAL, message=msg, tags=custom_tags)
self.log.exception(msg)
return
except (requests.exceptions.RequestException, requests.exceptions.ConnectionError):
msg = 'Error accessing Envoy endpoint `{}`'.format(stats_url)
self.service_check(self.SERVICE_CHECK_NAME, AgentCheck.CRITICAL, message=msg, tags=custom_tags)
self.log.exception(msg)
return

if request.status_code != 200:
msg = 'Envoy endpoint `{}` responded with HTTP status code {}'.format(stats_url, request.status_code)
self.service_check(self.SERVICE_CHECK_NAME, AgentCheck.CRITICAL, message=msg, tags=custom_tags)
self.log.warning(msg)
return

# Avoid repeated global lookups.
get_method = getattr

for line in request.content.decode().splitlines():
try:
envoy_metric, value = line.split(': ')
except ValueError:
continue

value = int(value)

try:
metric, tags, method = parse_metric(envoy_metric)
except UnknownMetric:
if envoy_metric not in self.unknown_metrics:
self.log.debug('Unknown metric `{}`'.format(envoy_metric))
self.unknown_metrics.add(envoy_metric)
continue

tags.extend(custom_tags)
get_method(self, method)(metric, value, tags=tags)

self.service_check(self.SERVICE_CHECK_NAME, AgentCheck.OK, tags=custom_tags)
7 changes: 7 additions & 0 deletions envoy/datadog_checks/envoy/errors.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# (C) Datadog, Inc. 2018
# All rights reserved
# Licensed under a 3-clause BSD style license (see LICENSE)


class UnknownMetric(Exception):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we add the license header?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

pass
Loading