Limit Prometheus/OpenMetrics checks to 2000 metrics per run by default #2093

xvello · 2018-08-22T11:30:44Z

What does this PR do?

Implement an optional hard limit for the number of metric contexts an integration instance can send per run. Design is based on jmxfetch's implementation for consistency:

every new context increments a counter
when the counter reaches the limit, an integration warning is sent
following metric submissions return directly
counter is reset after every run, if present

We need to track the number of different contexts, not metric names, this is why we compute a context uid. To reduce the performance cost, we introduce a fast path for gauges, rates and monotonic-counters: as it makes no sense to call these methods several times per context, we increment the context count with every call, instead of looking up if the context is new. For other metric types, we keep the seen contexts in a hashmap, that is cleared at the end of the check run.

The initial implementation of the context hash is a simple string format call, we should benchmark it's impact before investigating optimisations. For the initial target of the prometheus checks, the generic metric mapper only uses fast-path metric types, and our integrations seem to do so for >95% (to be double-checked)

Logic is currently implemented for metrics, but AgentCheck could also later instantiate additionnal Limiter objects for events and service checks.

Configurability

Checks can set their default limit via the self.DEFAULT_METRIC_LIMIT class attribute. This allows subchecks to set a different limit. For 6.5, let's go with disabling the limit for all our subchecks, and focus on the limit for custom checks.

Users can override the limit via the max_returned_metrics instance field.

Limitations

This implementation does not keep track to passed/denied contexts between runs, it assumes metrics are submitted in a consistent order across check runs. This is the case for "classic" python integrations, and prometheus metrics too, as:

we process metrics in the order they are exposed in the file.
the prom spec indicates metrics should be listed in alphabetical order

Being resilient to random order submission while always sending the same timeseries would require to:

populate the set for all contexts (disabling the gauge fast path)
keep this set across check runs (increased baseline memory usage)
run a logic to expire contexts that have not been submitted in x runs, to allow for churn

Motivation

Generic prometheus checks can sent a lot of contexts, because metrics are tagged by the emitter. As these are sent as custom metrics, we want to put a safeguard to avoid billing issues, as we do with JMX.

As this could be used by other checks (goexpvar ?), I'm implementing it in the AgentCheck class, instead of in the prometheus logic.

Review checklist

PR has a meaningful title or PR has the no-changelog label attached
Feature or bugfix has tests
Git history is clean
If PR impacts documentation, docs team has been notified or an issue has been opened on the documentation repo

Output

With the following kubelet.yaml:

init_config:

instances:
  - max_returned_metrics: 60

    kubelet (1.4.0)
    ---------------
      Total Runs: 2
      Metric Samples: 60, Total: 120
      Events: 0, Total: 0
      Service Checks: 1, Total: 2
      Average Execution Time : 464ms

      Warning: Exceeded limit of 60 metrics, ignoring next ones

With the following kubelet.yaml:

init_config:

instances:
  - max_returned_metrics: 6000

The check completes with no error:

    kubelet (1.4.0)
    ---------------
      Total Runs: 2
      Metric Samples: 95, Total: 190
      Events: 0, Total: 0
      Service Checks: 1, Total: 2
      Average Execution Time : 510ms

With the following kubelet.yaml:

init_config:

instances:
  - max_returned_metrics: 0

We get a warning during the check load:

2018-08-27 13:33:07 UTC | WARN | (datadog_agent.go:135 in LogMessage) | (base.py:339) | Setting max_returned_metrics to zero is not allowed,reverting to the default of 6000 metrics

Additional Notes

Anything else we should know when reviewing?

hkaj

Great work! LGTM, I think we should discuss the logic around removing the logic before merging but that would be a minor change anyway. I'm also in favor of adding a short README, or some top level comment in limiter.py that explains briefly why we added this, how it works (cut-off, no sampling, configurable, special case of some metrics, etc.) and that links to this PR for reference. Does that make sense to you?

hkaj · 2018-08-27T11:51:57Z

datadog_checks_base/datadog_checks/checks/base.py

+            metric_limit = self.instances[0].get("max_returned_metrics", self.DEFAULT_METRIC_LIMIT)
+        except Exception:
+            metric_limit = self.DEFAULT_METRIC_LIMIT
+        if metric_limit > 0:


That means setting the limit to 0 will remove the limit right? Let's discuss it at today's standup, I want to make sure this is okay, product wise.

I wanted to handle the 0 case because AgentCheck does not have a limit, but it's not that hard to actually enforce a limit if set by the class.
Updated the code to forbid removing the limit if set by the class via DEFAULT_METRIC_LIMIT.

hkaj · 2018-08-27T12:24:07Z

datadog_checks_base/datadog_checks/utils/limiter.py

@@ -0,0 +1,41 @@
+# (C) Datadog, Inc. 2010-2016


nostalgia? 😄

hkaj · 2018-08-27T12:27:11Z

datadog_checks_base/tests/test_agent_check.py

+        assert len(check.get_warnings()) == 1
+        assert len(aggregator.metrics("metric")) == 10
+
+    def test_metric_limit_count(self):


That's a great test 💯 It covers the internal logic of the limiter very well, and provides an example that helps understanding how the Limiter works.

xvello · 2018-08-27T13:11:00Z

@hkaj added pydoc to the Limiter and Agentcheck classes

hkaj

✅

ofek · 2018-08-27T15:58:41Z

datadog_checks_base/tests/test_agent_check.py

@@ -1,7 +1,10 @@
 # (C) Datadog, Inc. 2018
 # All rights reserved
 # Licensed under a 3-clause BSD style license (see LICENSE)
+import unittest


Let's use pytest

ofek · 2018-08-27T16:01:11Z

datadog_checks_base/tests/test_agent_check.py

+
+class TestLimits(unittest.TestCase):
+    def tearDown(self):
+        aggregator.reset()


@pytest.fixture def aggregator(): from datadog_checks.stubs import aggregator aggregator.reset() return aggregator

read the max_returned_metrics instance field, to mimmick jmxfetch behaviour

codecov-io · 2018-08-31T08:15:27Z

Codecov Report

Merging #2093 into master will decrease coverage by 9.74%.
The diff coverage is 96.42%.

hkaj

LGTM minus a wording nitpick. We will need an entry in 6.5's changelog as well, can you take care of it?

hkaj · 2018-08-31T14:07:12Z

prometheus/README.md

+Due to the nature of this integration, it is possible to submit an extremely high number of metrics
+directly to Datadog. To avoid billing issues on configuration errors or input changes, the check
+limits itself to 2000 metric contexts (different metric name or different tags). You can increase
+this limit, if needed, by setting the `max_returned_metrics` option.


What do you think of:

it is possible to submit an extremely high number of metrics directly to Datadog --> it is possible to submit a high number of custom metrics to Datadog

To avoid billing issues --> to prevent overage charges

#2093) * allow checks to limit the number of metric contexts they submit * set limit for prom checks to 2000 * set limits on all prom children checks to 0 (unlimited) * make the metric limit configurable * do not allow to disable limit if class has set one

spiliopoulos · 2021-08-28T17:26:04Z

@hkaj @xvello is there any reasonable way to alert on these warnings or do we need to manually go through the agent status pages to check for integration issues?

olivielpeau · 2021-08-31T08:46:37Z

@spiliopoulos You can see these warnings in-app on the affected hosts in the Infrastructure List and Host Map, but we don't have a recommended way to alert on these warnings at the moment.

We're currently evaluating options to allow alerting on these warnings, and will update this issue once we have a recommendation.

xvello added do-not-merge/WIP integration/datadog_checks_base changelog/Added labels Aug 22, 2018

xvello requested a review from a team as a code owner August 22, 2018 11:30

xvello force-pushed the xvello/metric-limit branch from 475cf65 to e4fbbc5 Compare August 22, 2018 11:52

ofek self-assigned this Aug 22, 2018

xvello removed the do-not-merge/WIP label Aug 27, 2018

xvello changed the title ~~WIP: Allow checks to limit the number of metric contexts they submit~~ Allow checks to limit the number of metric contexts they submit Aug 27, 2018

xvello force-pushed the xvello/metric-limit branch from bd91004 to 383389b Compare August 27, 2018 11:45

hkaj reviewed Aug 27, 2018

View reviewed changes

hkaj previously approved these changes Aug 27, 2018

View reviewed changes

masci added the integration/kubelet label Aug 27, 2018

xvello changed the title ~~Allow checks to limit the number of metric contexts they submit~~ Limit Prometheus/OpenMetrics checks to 350 metrics per run Aug 27, 2018

xvello changed the title ~~Limit Prometheus/OpenMetrics checks to 350 metrics per run~~ Limit Prometheus/OpenMetrics checks to 350 metrics per run by default Aug 27, 2018

xvello dismissed hkaj’s stale review via ae65e09 August 27, 2018 14:10

xvello requested a review from a team as a code owner August 27, 2018 14:10

xvello added integration/prometheus and removed integration/kubelet labels Aug 27, 2018

ofek requested changes Aug 27, 2018

View reviewed changes

xvello force-pushed the xvello/metric-limit branch 4 times, most recently from cb07060 to 82ccdd2 Compare August 31, 2018 08:02

xvello added 6 commits August 31, 2018 10:02

allow checks to limit the number of metric contexts they submit

d7aebe1

make the metric limit configurable

ec511f0

read the max_returned_metrics instance field, to mimmick jmxfetch behaviour

rename _hash_context method to _context_uid

488828b

resilience to bad instance param

7e433df

do not allow to disable limit if class has set one

6b539d3

pydoc

ec740b4

xvello added 4 commits August 31, 2018 10:02

pydoc2

e7adb82

add limit to openmetrics class too, add doc to prometheus integration

35ba718

use pytest fixture

f5a2da3

set limit on all children checks to 0 (unlimited)

fd0a0d4

xvello force-pushed the xvello/metric-limit branch from 82ccdd2 to fd0a0d4 Compare August 31, 2018 08:02

set limit to 2000

63581dd

xvello changed the title ~~Limit Prometheus/OpenMetrics checks to 350 metrics per run by default~~ Limit Prometheus/OpenMetrics checks to 2000 metrics per run by default Aug 31, 2018

hkaj previously approved these changes Aug 31, 2018

View reviewed changes

readme rewording, typo

5bf9bf0

xvello dismissed hkaj’s stale review via 5bf9bf0 August 31, 2018 14:14

hkaj previously approved these changes Aug 31, 2018

View reviewed changes

xvello mentioned this pull request Aug 31, 2018

Add an upgrade note for prometheus check limit DataDog/datadog-agent#2244

Merged

xvello dismissed hkaj’s stale review via be55ade August 31, 2018 15:06

xvello force-pushed the xvello/metric-limit branch from be55ade to 814f125 Compare August 31, 2018 15:17

reword readme

914719d

xvello force-pushed the xvello/metric-limit branch from 814f125 to 914719d Compare August 31, 2018 15:18

reword conf example

9c16458

ofek approved these changes Aug 31, 2018

View reviewed changes

xvello merged commit 2027f26 into master Aug 31, 2018

xvello deleted the xvello/metric-limit branch August 31, 2018 16:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit Prometheus/OpenMetrics checks to 2000 metrics per run by default #2093

Limit Prometheus/OpenMetrics checks to 2000 metrics per run by default #2093

xvello commented Aug 22, 2018 •

edited

Loading

hkaj left a comment

hkaj Aug 27, 2018

xvello Aug 27, 2018

hkaj Aug 27, 2018

hkaj Aug 27, 2018

xvello commented Aug 27, 2018

hkaj left a comment

ofek Aug 27, 2018

ofek Aug 27, 2018

codecov-io commented Aug 31, 2018 •

edited

Loading

hkaj left a comment •

edited

Loading

hkaj Aug 31, 2018

spiliopoulos commented Aug 28, 2021

olivielpeau commented Aug 31, 2021

Limit Prometheus/OpenMetrics checks to 2000 metrics per run by default #2093

Limit Prometheus/OpenMetrics checks to 2000 metrics per run by default #2093

Conversation

xvello commented Aug 22, 2018 • edited Loading

What does this PR do?

Configurability

Limitations

Motivation

Review checklist

Output

Additional Notes

hkaj left a comment

Choose a reason for hiding this comment

hkaj Aug 27, 2018

Choose a reason for hiding this comment

xvello Aug 27, 2018

Choose a reason for hiding this comment

hkaj Aug 27, 2018

Choose a reason for hiding this comment

hkaj Aug 27, 2018

Choose a reason for hiding this comment

xvello commented Aug 27, 2018

hkaj left a comment

Choose a reason for hiding this comment

ofek Aug 27, 2018

Choose a reason for hiding this comment

ofek Aug 27, 2018

Choose a reason for hiding this comment

codecov-io commented Aug 31, 2018 • edited Loading

Codecov Report

hkaj left a comment • edited Loading

Choose a reason for hiding this comment

hkaj Aug 31, 2018

Choose a reason for hiding this comment

spiliopoulos commented Aug 28, 2021

olivielpeau commented Aug 31, 2021

xvello commented Aug 22, 2018 •

edited

Loading

codecov-io commented Aug 31, 2018 •

edited

Loading

hkaj left a comment •

edited

Loading