Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Openmetrics metrics collection #10752

Merged
merged 46 commits into from
Dec 8, 2021
Merged
Show file tree
Hide file tree
Changes from 45 commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
20a5700
Add logic for Envoy Openmetricsv2
ChristineTChen Nov 11, 2021
4403dfc
Add label remapper
ChristineTChen Nov 11, 2021
5be56bc
Add new metrics
ChristineTChen Nov 18, 2021
8a5ae52
Finish adding other metrics
ChristineTChen Nov 18, 2021
80aacd9
reorganize metrics that should be transformed
ChristineTChen Nov 23, 2021
44f7aa3
Introduce openmetrics_endpoint config option
ChristineTChen Nov 23, 2021
97ad876
Add watchdog metrics transformers
ChristineTChen Nov 30, 2021
53ca624
Add some more label extraction metrics
ChristineTChen Nov 30, 2021
35480d3
refactor tests to move to legacy
ChristineTChen Nov 30, 2021
81fc8c5
Add legacy and non legacy fixtures to test files
ChristineTChen Nov 30, 2021
3f0a250
Update readme
ChristineTChen Nov 30, 2021
42bcff2
Bump base req
ChristineTChen Nov 30, 2021
04b47dd
Fix style
ChristineTChen Nov 30, 2021
7ec43c7
Mark legacy metrics
ChristineTChen Nov 30, 2021
6d87299
Fix watchdog counter name
ChristineTChen Nov 30, 2021
907fdcd
document prometheus metrics in metadata csv
ChristineTChen Nov 30, 2021
79a622b
Fix metadata csv
ChristineTChen Nov 30, 2021
bf062bf
Add e2e test
ChristineTChen Nov 30, 2021
075a008
FIx style
ChristineTChen Nov 30, 2021
cec9639
Fix metadata format for validation
ChristineTChen Nov 30, 2021
bf714df
Flaky metrics
ChristineTChen Dec 1, 2021
fd330e3
Only support openmetrics in latest api v3
ChristineTChen Dec 1, 2021
f13db5a
Fix test imports
ChristineTChen Dec 1, 2021
443a473
Enable Openmetrics option by default
ChristineTChen Dec 1, 2021
2ebef10
Fix import
ChristineTChen Dec 1, 2021
10240ce
Fix style
ChristineTChen Dec 1, 2021
1be4872
Update readme
ChristineTChen Dec 1, 2021
6d8c195
Update config stats_url wording
ChristineTChen Dec 1, 2021
15142ca
Fix envoy import
ChristineTChen Dec 1, 2021
567ecb7
Remove py27 for openmetrics version
ChristineTChen Dec 1, 2021
a0548ee
Openmetrics endpoint should be optional
ChristineTChen Dec 1, 2021
3bc70c1
Account for flaky metrics
ChristineTChen Dec 1, 2021
8228c93
Document service checks
ChristineTChen Dec 1, 2021
4babc2c
Use unique name
ChristineTChen Dec 1, 2021
f14cdd9
Update envoy/tests/legacy/test_bench.py
ChristineTChen Dec 7, 2021
557807e
Move metrics map to metrics.py
ChristineTChen Dec 7, 2021
7d16e13
Merge branch 'cc/envoy-prom' of github.com:DataDog/integrations-core …
ChristineTChen Dec 7, 2021
bc91fd5
Update with feedback
ChristineTChen Dec 7, 2021
08acfa3
Use lambda
ChristineTChen Dec 7, 2021
a9b76d0
Merge branch 'master' into cc/envoy-prom
ChristineTChen Dec 7, 2021
561e6b7
simplify match
ChristineTChen Dec 7, 2021
adfca59
Merge branch 'cc/envoy-prom' of github.com:DataDog/integrations-core …
ChristineTChen Dec 8, 2021
9c88b2f
Refactor metadata utils
ChristineTChen Dec 8, 2021
e11f689
Support metadata collection in V2
ChristineTChen Dec 8, 2021
071d5d3
Use urlunparse
ChristineTChen Dec 8, 2021
5856804
Reintroduce legacy config options as hidden
ChristineTChen Dec 8, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 12 additions & 36 deletions envoy/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,12 @@ The Envoy check is included in the [Datadog Agent][2] package, so you don't need

#### Istio

If you are using Envoy as part of [Istio][3], be sure to use the appropriate [Envoy admin endpoint][4] for the `stats_url`.
If you are using Envoy as part of [Istio][3], configure the Envoy integration to collect metrics from the Istio proxy metrics endpoint.

```yaml
instances:
- openmetrics_endpoint: localhost:15090/stats/prometheus
```

#### Standard

Expand Down Expand Up @@ -100,45 +105,16 @@ To configure this check for an Agent running on a host:
init_config:

instances:
## @param stats_url - string - required
## The admin endpoint to connect to. It must be accessible:
## https://www.envoyproxy.io/docs/envoy/latest/operations/admin
## Add a `?usedonly` on the end if you wish to ignore
## unused metrics instead of reporting them as `0`.
#
- stats_url: http://localhost:80/stats
## @param openmetrics_endpoint - string - required
## The URL exposing metrics in the OpenMetrics format.
#
- openmetrics_endpoint: http://localhost:8001/stats/prometheus

```

2. Check if the Datadog Agent can access Envoy's [admin endpoint][5].
3. [Restart the Agent][9].

###### Metric filtering

Metrics can be filtered with the parameters`included_metrics` or `excluded_metrics` using regular expressions. If both parameters are used, `included_metrics` is applied first, then `excluded_metrics` is applied on the resulting set.

The filtering occurs before tag extraction, so you have the option to have certain tags decide whether or not to keep or ignore metrics. An exhaustive list of all metrics and tags can be found in [metrics.py][10]. Let's walk through an example of Envoy metric tagging!

```python
...
'cluster.grpc.success': {
'tags': (
('<CLUSTER_NAME>', ),
('<GRPC_SERVICE>', '<GRPC_METHOD>', ),
(),
),
...
},
...
```

Here there are `3` tag sequences: `('<CLUSTER_NAME>')`, `('<GRPC_SERVICE>', '<GRPC_METHOD>')`, and empty `()`. The number of sequences corresponds exactly to how many metric parts there are. For this metric, there are `3` parts: `cluster`, `grpc`, and `success`. Envoy separates everything with a `.`, hence the final metric name would be:

`cluster.<CLUSTER_NAME>.grpc.<GRPC_SERVICE>.<GRPC_METHOD>.success`

If you care only about the cluster name and grpc service, you would add this to your `included_metrics`:

`^cluster\.<CLUSTER_NAME>\.grpc\.<GRPC_SERVICE>\.`

##### Log collection

<!-- partial
Expand Down Expand Up @@ -180,7 +156,7 @@ For containerized environments, see the [Autodiscovery Integration Templates][11
| -------------------- | ------------------------------------------- |
| `<INTEGRATION_NAME>` | `envoy` |
| `<INIT_CONFIG>` | blank or `{}` |
| `<INSTANCE_CONFIG>` | `{"stats_url": "http://%%host%%:80/stats"}` |
| `<INSTANCE_CONFIG>` | `{"openmetrics_endpoint": "http://%%host%%:80/stats/prometheus"}` |

##### Log collection

Expand Down
Empty file added envoy/__init__.py
Empty file.
99 changes: 12 additions & 87 deletions envoy/assets/configuration/spec.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,101 +8,26 @@ files:
- template: init_config/default
- template: instances
options:
- template: instances/openmetrics
overrides:
openmetrics_endpoint.value.example: http://localhost:80/stats/prometheus
openmetrics_endpoint.display_priority: 1
openmetrics_endpoint.required: false
openmetrics_endpoint.enabled: true
- name: stats_url
required: true
display_priority: 3
display_priority: 1
description: |
The admin endpoint to connect to. It must be accessible:
The check will collect and parse metrics from the admin /stats/ endpoint.
It must be accessible:
https://www.envoyproxy.io/docs/envoy/latest/operations/admin
Add a `?usedonly` on the end if you wish to ignore
unused metrics instead of reporting them as `0`.

Note: see the configuration options specific to this option here,
https://github.com/DataDog/integrations-core/blob/7.33.x/envoy/datadog_checks/envoy/data/conf.yaml.example
value:
example: http://localhost:80/stats
type: string
- name: included_metrics
description: |
Includes metrics using regular expressions.
The filtering occurs before tag extraction, so you have the option
to have certain tags decide whether or not to keep or ignore metrics.
For an exhaustive list of all metrics and tags, see:
https://github.com/DataDog/integrations-core/blob/master/envoy/datadog_checks/envoy/metrics.py

If you surround patterns by quotes, be sure to escape backslashes with an extra backslash.

The example list below will include:
- cluster.in.0000.lb_subsets_active
- cluster.out.alerting-event-evaluator-test.datadog.svc.cluster.local
value:
type: array
items:
type: string
example:
- cluster\.(in|out)\..*
- name: excluded_metrics
description: |
Excludes metrics using regular expressions.
The filtering occurs before tag extraction, so you have the option
to have certain tags decide whether or not to keep or ignore metrics.
For an exhaustive list of all metrics and tags, see:
https://github.com/DataDog/integrations-core/blob/master/envoy/datadog_checks/envoy/metrics.py

If you surround patterns by quotes, be sure to escape backslashes with an extra backslash.

The example list below will exclude:
- http.admin.downstream_cx_active
- http.http.rds.0000.control_plane.rate_limit_enforced
value:
type: array
items:
type: string
example:
- ^http\..*
- name: cache_metrics
description: |
Results are cached by default to decrease CPU utilization, at
the expense of some memory. Disable by setting this to false.
value:
type: boolean
example: true
- name: parse_unknown_metrics
description: |
Attempt parsing of metrics that are unknown and will otherwise be skipped.
value:
type: boolean
example: false
- name: collect_server_info
description: |
Collect Envoy version by accessing the `/server_info` endpoint.
Disable this if this endpoint is not reachable by the agent.
value:
type: boolean
example: true
- name: disable_legacy_cluster_tag
description: |
Enable to stop submitting the tags `cluster_name` and `virtual_cluster_name`,
which has been renamed to `envoy_cluster` and `virtual_envoy_cluster`.
enabled: true
value:
type: boolean
display_default: false
example: true
- template: instances/default
- template: instances/http
overrides:
username.description: |
The username to use if services are behind basic auth.
Note: The Envoy admin endpoint does not support auth until:
https://github.com/envoyproxy/envoy/issues/2763
For an alternative, see:
https://gist.github.com/ofek/6051508cd0dfa98fc6c13153b647c6f8
username.display_priority: 2
password.description: |
The password to use if services are behind basic or NTLM auth.
Note: The Envoy admin endpoint does not support auth until:
https://github.com/envoyproxy/envoy/issues/2763
For an alternative, see:
https://gist.github.com/ofek/6051508cd0dfa98fc6c13153b647c6f8
password.display_priority: 1
- template: logs
example:
- type: file
Expand Down
14 changes: 14 additions & 0 deletions envoy/assets/service_checks.json
Original file line number Diff line number Diff line change
Expand Up @@ -12,5 +12,19 @@
],
"name": "Can Connect",
"description": "Returns `CRITICAL` if the agent can't connect to Envoy to collect metrics, otherwise `OK`."
},
{
"agent_version": "7.34.0",
"integration": "Envoy",
"check": "envoy.openmetrics.health",
"statuses": [
"ok",
"critical"
],
"groups": [
"endpoint"
],
"name": "Openmetrics Can Connect",
"description": "Returns `CRITICAL` if the agent can't connect to Envoy to collect metrics, otherwise `OK`."
}
]
156 changes: 156 additions & 0 deletions envoy/datadog_checks/envoy/check.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
# (C) Datadog, Inc. 2021-present
# All rights reserved
# Licensed under a 3-clause BSD style license (see LICENSE)
import re
from collections import defaultdict

from six.moves.urllib.parse import urljoin, urlparse, urlunparse

from datadog_checks.base import AgentCheck, OpenMetricsBaseCheckV2

from .metrics import PROMETHEUS_METRICS_MAP
from .utils import _get_server_info

ENVOY_VERSION = {'istio_build': {'type': 'metadata', 'label': 'tag', 'name': 'version'}}

LABEL_MAP = {
'cluster_name': 'envoy_cluster',
'envoy_cluster_name': 'envoy_cluster',
'envoy_http_conn_manager_prefix': 'stat_prefix', # tracing
'envoy_listener_address': 'address', # listener
'envoy_virtual_cluster': 'virtual_envoy_cluster', # vhost
'envoy_virtual_host': 'virtual_host_name', # vhost
}


METRIC_WITH_LABEL_NAME = {
r'^envoy_server_(.+\_.+)_watchdog_miss$': {
'label_name': 'thread_name',
'metric_type': 'monotonic_count',
'new_name': 'server.watchdog_miss.count',
},
r'^envoy_server_(.+\_.+)_watchdog_mega_miss$': {
'label_name': 'thread_name',
'metric_type': 'monotonic_count',
'new_name': 'server.watchdog_mega_miss.count',
},
r'^envoy_(.+\_.+)_watchdog_miss$': {
'label_name': 'thread_name',
'metric_type': 'monotonic_count',
'new_name': 'watchdog_miss.count',
},
r'^envoy_(.+\_.+)_watchdog_mega_miss$': {
'label_name': 'thread_name',
'metric_type': 'monotonic_count',
'new_name': 'watchdog_mega_miss.count',
},
r'^envoy_cluster_circuit_breakers_(\w+)_cx_open$': {
'label_name': 'priority',
'metric_type': 'gauge',
'new_name': 'cluster.circuit_breakers.cx_open',
},
r'^envoy_cluster_circuit_breakers_(\w+)_cx_pool_open$': {
'label_name': 'priority',
'metric_type': 'gauge',
'new_name': 'cluster.circuit_breakers.cx_pool_open',
},
r'^envoy_cluster_circuit_breakers_(\w+)_rq_open$': {
'label_name': 'priority',
'metric_type': 'gauge',
'new_name': 'cluster.circuit_breakers.rq_open',
},
r'^envoy_cluster_circuit_breakers_(\w+)_rq_pending_open$': {
'label_name': 'priority',
'metric_type': 'gauge',
'new_name': 'cluster.circuit_breakers.rq_pending_open',
},
r'^envoy_cluster_circuit_breakers_(\w+)_rq_retry_open$': {
'label_name': 'priority',
'metric_type': 'gauge',
'new_name': 'cluster.circuit_breakers.rq_retry_open',
},
r'^envoy_listener_admin_(.+\_.+)_downstream_cx_active$': {
'label_name': 'handler',
'metric_type': 'gauge',
'new_name': 'listener.admin.downstream_cx_active',
},
r'^envoy_listener_(.+\_.+)_downstream_cx_active$': {
'label_name': 'handler',
'metric_type': 'gauge',
'new_name': 'listener.downstream_cx_active',
},
r'^envoy_listener_admin_(.+\_.+)_downstream_cx$': {
'label_name': 'handler',
'metric_type': 'monotonic_count',
'new_name': 'listener.admin.downstream_cx.count',
},
r'^envoy_listener_(.+)_downstream_cx$': {
'label_name': 'handler',
'metric_type': 'monotonic_count',
'new_name': 'listener.downstream_cx.count',
},
}


class EnvoyCheckV2(OpenMetricsBaseCheckV2):
__NAMESPACE__ = 'envoy'

DEFAULT_METRIC_LIMIT = 0

def __init__(self, name, init_config, instances):
super().__init__(name, init_config, instances)
self.check_initializations.append(self.configure_additional_transformers)
openmetrics_endpoint = self.instance.get('openmetrics_endpoint')
self.base_url = None
try:
parts = urlparse(openmetrics_endpoint)
self.base_url = urlunparse(parts[:2] + ('', '', None, None))

except Exception as e:
self.log.debug("Unable to determine the base url for version collection: %s", str(e))

def check(self, _):
self._collect_metadata()
super(EnvoyCheckV2, self).check(None)

def get_default_config(self):
return {
'metrics': [PROMETHEUS_METRICS_MAP],
'rename_labels': LABEL_MAP,
}

def configure_transformer_label_in_name(self, metric_pattern, new_name, label_name, metric_type):
method = getattr(self, metric_type)
cached_patterns = defaultdict(lambda: re.compile(metric_pattern))

def transform(metric, sample_data, runtime_data):
for sample, tags, hostname in sample_data:
parsed_sample_name = sample.name
if sample.name.endswith("_total"):
parsed_sample_name = re.match("(.*)_total$", sample.name).groups()[0]
label_value = cached_patterns[metric_pattern].match(parsed_sample_name).groups()[0]

tags.append('{}:{}'.format(label_name, label_value))
method(new_name, sample.value, tags=tags, hostname=hostname)

return transform

def configure_additional_transformers(self):
for metric, data in METRIC_WITH_LABEL_NAME.items():
self.scrapers[self.instance['openmetrics_endpoint']].metric_transformer.add_custom_transformer(
metric, self.configure_transformer_label_in_name(metric, **data), pattern=True
)

@AgentCheck.metadata_entrypoint
def _collect_metadata(self):
# Replace in favor of built-in Openmetrics metadata when PR is available
# https://github.com/envoyproxy/envoy/pull/18991
if not self.base_url:
self.log.debug("Skipping server info collection due to malformed url: %s", self.base_url)
return
# From http://domain/thing/stats to http://domain/thing/server_info
server_info_url = urljoin(self.base_url, 'server_info')
raw_version = _get_server_info(server_info_url, self.log, self.http)

if raw_version:
self.set_metadata('version', raw_version)
Loading