Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit Prometheus/OpenMetrics checks to 2000 metrics per run by default #2093

Merged
merged 14 commits into from
Aug 31, 2018

Conversation

xvello
Copy link
Contributor

@xvello xvello commented Aug 22, 2018

What does this PR do?

Implement an optional hard limit for the number of metric contexts an integration instance can send per run. Design is based on jmxfetch's implementation for consistency:

  • every new context increments a counter
  • when the counter reaches the limit, an integration warning is sent
  • following metric submissions return directly
  • counter is reset after every run, if present

We need to track the number of different contexts, not metric names, this is why we compute a context uid. To reduce the performance cost, we introduce a fast path for gauges, rates and monotonic-counters: as it makes no sense to call these methods several times per context, we increment the context count with every call, instead of looking up if the context is new. For other metric types, we keep the seen contexts in a hashmap, that is cleared at the end of the check run.

The initial implementation of the context hash is a simple string format call, we should benchmark it's impact before investigating optimisations. For the initial target of the prometheus checks, the generic metric mapper only uses fast-path metric types, and our integrations seem to do so for >95% (to be double-checked)

Logic is currently implemented for metrics, but AgentCheck could also later instantiate additionnal Limiter objects for events and service checks.

Configurability

Checks can set their default limit via the self.DEFAULT_METRIC_LIMIT class attribute. This allows subchecks to set a different limit. For 6.5, let's go with disabling the limit for all our subchecks, and focus on the limit for custom checks.

Users can override the limit via the max_returned_metrics instance field.

Limitations

This implementation does not keep track to passed/denied contexts between runs, it assumes metrics are submitted in a consistent order across check runs. This is the case for "classic" python integrations, and prometheus metrics too, as:

  • we process metrics in the order they are exposed in the file.
  • the prom spec indicates metrics should be listed in alphabetical order

Being resilient to random order submission while always sending the same timeseries would require to:

  • populate the set for all contexts (disabling the gauge fast path)
  • keep this set across check runs (increased baseline memory usage)
  • run a logic to expire contexts that have not been submitted in x runs, to allow for churn

Motivation

Generic prometheus checks can sent a lot of contexts, because metrics are tagged by the emitter. As these are sent as custom metrics, we want to put a safeguard to avoid billing issues, as we do with JMX.

As this could be used by other checks (goexpvar ?), I'm implementing it in the AgentCheck class, instead of in the prometheus logic.

Review checklist

  • PR has a meaningful title or PR has the no-changelog label attached
  • Feature or bugfix has tests
  • Git history is clean
  • If PR impacts documentation, docs team has been notified or an issue has been opened on the documentation repo

Output

With the following kubelet.yaml:

init_config:

instances:
  - max_returned_metrics: 60
    kubelet (1.4.0)
    ---------------
      Total Runs: 2
      Metric Samples: 60, Total: 120
      Events: 0, Total: 0
      Service Checks: 1, Total: 2
      Average Execution Time : 464ms

      Warning: Exceeded limit of 60 metrics, ignoring next ones

With the following kubelet.yaml:

init_config:

instances:
  - max_returned_metrics: 6000

The check completes with no error:

    kubelet (1.4.0)
    ---------------
      Total Runs: 2
      Metric Samples: 95, Total: 190
      Events: 0, Total: 0
      Service Checks: 1, Total: 2
      Average Execution Time : 510ms

With the following kubelet.yaml:

init_config:

instances:
  - max_returned_metrics: 0

We get a warning during the check load:

2018-08-27 13:33:07 UTC | WARN | (datadog_agent.go:135 in LogMessage) | (base.py:339) | Setting max_returned_metrics to zero is not allowed,reverting to the default of 6000 metrics

Additional Notes

Anything else we should know when reviewing?

@xvello xvello requested a review from a team as a code owner August 22, 2018 11:30
@xvello xvello force-pushed the xvello/metric-limit branch from 475cf65 to e4fbbc5 Compare August 22, 2018 11:52
@ofek ofek self-assigned this Aug 22, 2018
@xvello xvello changed the title WIP: Allow checks to limit the number of metric contexts they submit Allow checks to limit the number of metric contexts they submit Aug 27, 2018
@xvello xvello force-pushed the xvello/metric-limit branch from bd91004 to 383389b Compare August 27, 2018 11:45
Copy link
Member

@hkaj hkaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! LGTM, I think we should discuss the logic around removing the logic before merging but that would be a minor change anyway. I'm also in favor of adding a short README, or some top level comment in limiter.py that explains briefly why we added this, how it works (cut-off, no sampling, configurable, special case of some metrics, etc.) and that links to this PR for reference. Does that make sense to you?

metric_limit = self.instances[0].get("max_returned_metrics", self.DEFAULT_METRIC_LIMIT)
except Exception:
metric_limit = self.DEFAULT_METRIC_LIMIT
if metric_limit > 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That means setting the limit to 0 will remove the limit right? Let's discuss it at today's standup, I want to make sure this is okay, product wise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to handle the 0 case because AgentCheck does not have a limit, but it's not that hard to actually enforce a limit if set by the class.
Updated the code to forbid removing the limit if set by the class via DEFAULT_METRIC_LIMIT.

@@ -0,0 +1,41 @@
# (C) Datadog, Inc. 2010-2016
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nostalgia? 😄

assert len(check.get_warnings()) == 1
assert len(aggregator.metrics("metric")) == 10

def test_metric_limit_count(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great test 💯 It covers the internal logic of the limiter very well, and provides an example that helps understanding how the Limiter works.

@xvello
Copy link
Contributor Author

xvello commented Aug 27, 2018

@hkaj added pydoc to the Limiter and Agentcheck classes

hkaj
hkaj previously approved these changes Aug 27, 2018
Copy link
Member

@hkaj hkaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xvello xvello changed the title Allow checks to limit the number of metric contexts they submit Limit Prometheus/OpenMetrics checks to 350 metrics per run Aug 27, 2018
@xvello xvello changed the title Limit Prometheus/OpenMetrics checks to 350 metrics per run Limit Prometheus/OpenMetrics checks to 350 metrics per run by default Aug 27, 2018
@xvello xvello requested a review from a team as a code owner August 27, 2018 14:10
@@ -1,7 +1,10 @@
# (C) Datadog, Inc. 2018
# All rights reserved
# Licensed under a 3-clause BSD style license (see LICENSE)
import unittest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use pytest


class TestLimits(unittest.TestCase):
def tearDown(self):
aggregator.reset()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pytest.fixture
def aggregator():
    from datadog_checks.stubs import aggregator
    aggregator.reset()
    return aggregator

@xvello xvello force-pushed the xvello/metric-limit branch 4 times, most recently from cb07060 to 82ccdd2 Compare August 31, 2018 08:02
@xvello xvello force-pushed the xvello/metric-limit branch from 82ccdd2 to fd0a0d4 Compare August 31, 2018 08:02
@xvello xvello changed the title Limit Prometheus/OpenMetrics checks to 350 metrics per run by default Limit Prometheus/OpenMetrics checks to 2000 metrics per run by default Aug 31, 2018
@codecov-io
Copy link

codecov-io commented Aug 31, 2018

Codecov Report

Merging #2093 into master will decrease coverage by 9.74%.
The diff coverage is 96.42%.

Impacted file tree graph

hkaj
hkaj previously approved these changes Aug 31, 2018
Copy link
Member

@hkaj hkaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM minus a wording nitpick. We will need an entry in 6.5's changelog as well, can you take care of it?

Due to the nature of this integration, it is possible to submit an extremely high number of metrics
directly to Datadog. To avoid billing issues on configuration errors or input changes, the check
limits itself to 2000 metric contexts (different metric name or different tags). You can increase
this limit, if needed, by setting the `max_returned_metrics` option.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of:

  • it is possible to submit an extremely high number of metrics directly to Datadog --> it is possible to submit a high number of custom metrics to Datadog
  • To avoid billing issues --> to prevent overage charges

hkaj
hkaj previously approved these changes Aug 31, 2018
@xvello xvello force-pushed the xvello/metric-limit branch from 814f125 to 914719d Compare August 31, 2018 15:18
@xvello xvello merged commit 2027f26 into master Aug 31, 2018
@xvello xvello deleted the xvello/metric-limit branch August 31, 2018 16:21
nmuesch pushed a commit that referenced this pull request Nov 1, 2018
#2093)

* allow checks to limit the number of metric contexts they submit
* set limit for prom checks to 2000
* set limits on all prom children checks to 0 (unlimited)
* make the metric limit configurable
* do not allow to disable limit if class has set one
@spiliopoulos
Copy link

@hkaj @xvello is there any reasonable way to alert on these warnings or do we need to manually go through the agent status pages to check for integration issues?

@olivielpeau
Copy link
Member

@spiliopoulos You can see these warnings in-app on the affected hosts in the Infrastructure List and Host Map, but we don't have a recommended way to alert on these warnings at the moment.

We're currently evaluating options to allow alerting on these warnings, and will update this issue once we have a recommendation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants