-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit Prometheus/OpenMetrics checks to 2000 metrics per run by default #2093
Conversation
475cf65
to
e4fbbc5
Compare
bd91004
to
383389b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! LGTM, I think we should discuss the logic around removing the logic before merging but that would be a minor change anyway. I'm also in favor of adding a short README, or some top level comment in limiter.py
that explains briefly why we added this, how it works (cut-off, no sampling, configurable, special case of some metrics, etc.) and that links to this PR for reference. Does that make sense to you?
metric_limit = self.instances[0].get("max_returned_metrics", self.DEFAULT_METRIC_LIMIT) | ||
except Exception: | ||
metric_limit = self.DEFAULT_METRIC_LIMIT | ||
if metric_limit > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That means setting the limit to 0 will remove the limit right? Let's discuss it at today's standup, I want to make sure this is okay, product wise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to handle the 0
case because AgentCheck
does not have a limit, but it's not that hard to actually enforce a limit if set by the class.
Updated the code to forbid removing the limit if set by the class via DEFAULT_METRIC_LIMIT.
@@ -0,0 +1,41 @@ | |||
# (C) Datadog, Inc. 2010-2016 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nostalgia? 😄
assert len(check.get_warnings()) == 1 | ||
assert len(aggregator.metrics("metric")) == 10 | ||
|
||
def test_metric_limit_count(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great test 💯 It covers the internal logic of the limiter very well, and provides an example that helps understanding how the Limiter works.
@hkaj added pydoc to the Limiter and Agentcheck classes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅
@@ -1,7 +1,10 @@ | |||
# (C) Datadog, Inc. 2018 | |||
# All rights reserved | |||
# Licensed under a 3-clause BSD style license (see LICENSE) | |||
import unittest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use pytest
|
||
class TestLimits(unittest.TestCase): | ||
def tearDown(self): | ||
aggregator.reset() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pytest.fixture
def aggregator():
from datadog_checks.stubs import aggregator
aggregator.reset()
return aggregator
cb07060
to
82ccdd2
Compare
read the max_returned_metrics instance field, to mimmick jmxfetch behaviour
82ccdd2
to
fd0a0d4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM minus a wording nitpick. We will need an entry in 6.5's changelog as well, can you take care of it?
prometheus/README.md
Outdated
Due to the nature of this integration, it is possible to submit an extremely high number of metrics | ||
directly to Datadog. To avoid billing issues on configuration errors or input changes, the check | ||
limits itself to 2000 metric contexts (different metric name or different tags). You can increase | ||
this limit, if needed, by setting the `max_returned_metrics` option. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of:
it is possible to submit an extremely high number of metrics directly to Datadog
-->it is possible to submit a high number of custom metrics to Datadog
To avoid billing issues
-->to prevent overage charges
be55ade
to
814f125
Compare
814f125
to
914719d
Compare
#2093) * allow checks to limit the number of metric contexts they submit * set limit for prom checks to 2000 * set limits on all prom children checks to 0 (unlimited) * make the metric limit configurable * do not allow to disable limit if class has set one
@spiliopoulos You can see these warnings in-app on the affected hosts in the Infrastructure List and Host Map, but we don't have a recommended way to alert on these warnings at the moment. We're currently evaluating options to allow alerting on these warnings, and will update this issue once we have a recommendation. |
What does this PR do?
Implement an optional hard limit for the number of metric contexts an integration instance can send per run. Design is based on jmxfetch's implementation for consistency:
return
directlyWe need to track the number of different contexts, not metric names, this is why we compute a context uid. To reduce the performance cost, we introduce a fast path for gauges, rates and monotonic-counters: as it makes no sense to call these methods several times per context, we increment the context count with every call, instead of looking up if the context is new. For other metric types, we keep the seen contexts in a hashmap, that is cleared at the end of the check run.
The initial implementation of the context hash is a simple string format call, we should benchmark it's impact before investigating optimisations. For the initial target of the prometheus checks, the generic metric mapper only uses fast-path metric types, and our integrations seem to do so for >95% (to be double-checked)
Logic is currently implemented for metrics, but
AgentCheck
could also later instantiate additionnalLimiter
objects for events and service checks.Configurability
Checks can set their default limit via the
self.DEFAULT_METRIC_LIMIT
class attribute. This allows subchecks to set a different limit. For 6.5, let's go with disabling the limit for all our subchecks, and focus on the limit for custom checks.Users can override the limit via the
max_returned_metrics
instance field.Limitations
This implementation does not keep track to passed/denied contexts between runs, it assumes metrics are submitted in a consistent order across check runs. This is the case for "classic" python integrations, and prometheus metrics too, as:
Being resilient to random order submission while always sending the same timeseries would require to:
x
runs, to allow for churnMotivation
Generic prometheus checks can sent a lot of contexts, because metrics are tagged by the emitter. As these are sent as custom metrics, we want to put a safeguard to avoid billing issues, as we do with JMX.
As this could be used by other checks (goexpvar ?), I'm implementing it in the
AgentCheck
class, instead of in the prometheus logic.Review checklist
no-changelog
label attachedOutput
With the following kubelet.yaml:
With the following kubelet.yaml:
The check completes with no error:
With the following kubelet.yaml:
We get a warning during the check load:
Additional Notes
Anything else we should know when reviewing?