Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DatadogReporter is incorrectly reporting counters to Datadog #45

Open
vasileiosC opened this issue Mar 7, 2016 · 11 comments
Open

DatadogReporter is incorrectly reporting counters to Datadog #45

vasileiosC opened this issue Mar 7, 2016 · 11 comments

Comments

@vasileiosC
Copy link

To reproduce

At time X increase the counter's count value by 5. Datadog should (correctly) display the value 5 for time X.

At a later time increase the counter's count value by 5. Datadog should normally display the value 5 for time X + 10 but it will instead display the value 10.

Additionally, in the time frame between X and X + 10, the DatadogReporter should report the value 0 to Datadog (as no changes to the counter occured) but will instead report the value 5.

This behaviour occurs due to the fact that the DatadogReporter is treating the counters as gauges in the reporting level.

private void reportCounter(String name, Counter counter, long timestamp, List<String> tags)
      throws IOException {
    // A Metrics counter is actually a Datadog Gauge.  Datadog Counters are for rates which is
    // similar to the Metrics Meter type.  Metrics counters have increment and decrement
    // functionality, which implies they are instantaneously measurable, which implies they are
    // actually a gauge. The Metrics documentation agrees, stating:
    // "A counter is just a gauge for an AtomicLong instance. You can increment or decrement its
    // value. For example, we may want a more efficient way of measuring the pending job in a queue"
    request.addGauge(new DatadogGauge(name, counter.getCount(), timestamp, host, tags));
}

By changing the reporter to treat counters as what they really are, the problem is solved.

private void reportCounter(String name, Counter counter, long timestamp, List<String> tags)
      throws IOException {
    request.addCounter(new DatadogCounter(name, counter.getCount(), timestamp, host, tags));

This is because addCounter() is keeping a history of the reported counters and is instead reporting the counter values as relative values instead of absolute.

   if (lastSeenCounters.containsKey(finalMetricsSeenName)) {
        // If we've seen this counter before then calculate the difference
        // by subtracting the new value from the old. StatsD expects a relative
        // counter, not an absolute!
        finalValue = Math.max(0, value - lastSeenCounters.get(finalMetricsSeenName));
      }
      // Store the last value we saw so that the next addCounter call can make
      // the proper relative value
      lastSeenCounters.put(finalMetricsSeenName, value);
@umcodemonkey
Copy link

I'm not sure this is going to work. Transport.Request is created anew for each reporting period, and so the logic in Request.addCounter won't help between reports. It would be awesome if the Datadog reported tracked the history of reports for counters across reporting periods, so that this could be done correctly, but I think that's a larger change, though one I think was intended, originally.

@nickdella
Copy link

@vasileiosC Thanks for raising this dilemma. We've been back and forth on this quite a bit. See #32 and #22 for context

@xunnanxu
Copy link

I came across the thread but this conversation confuses me. Actually I don't think any of the proposed solutions is correct simply because they don't seem to address the root problem which is the metrics library introduces an unnecessary extra layer of aggregation when statsd is used. The solution seems to me is just not to use counter or even not to use the metrics library at all and let statsd do the aggregation.

@mcordone
Copy link

Wondering if this issue has been addressed? I'm experiencing a similar issue, the counter metric reported seems to be inaccurate, see inconsistency between intervals in dataDog, i.e when looking at the "last day" or "last week" the count number is lower than looking at "last hour" count, number make no sense

@navneetpandey
Copy link

According to the mapping of metrics specified by datadog

Statsd counter should be datadog's rate. According to statsd , its counter is events per second i.e. rate.

Therefore metrics meter should be mapped to statsd's counter which gets converted to datadog's rate. I often plot raw count by applying .as_count() function in datadog. This is only possible when the metric is rate. As datadog states

At present, for metrics submitted as rates or counters via statsd,
appending.as_count() or .as_rate() will function correctly. For other
metrics, including gauges submitted by statsd, .as_count()
and.as_rate() will have no effect.

Hence my proposal would be to change #32, let metric's histogram.count and meter.count be reported as statsd's counter, other histogram and meter metrics such as percentile and average rates can be treated as gauge.

@umcodemonkey
Copy link

It's useful to note that there is a "pure" statsd reporter for metrics that takes another approach, of reporting everything as a gauge. See ReadyTalk/metrics-statsd#15 for further details. Reading the explanation, I think that they got it right, and the UDP transport needs to send all stats as gauges. I don't know what this means for the HTTP transport.

@navneetpandey
Copy link

Agree with arguement.

we actually send everything as a gauge because codahale metrics actually keeps the count internally. We don't want StatsD aggregating them again or deleting them after each flush because we've already aggregated them. Instead, we send everything as a gauge because we want StatsD to consider it an arbitrary value we're setting at a point in time. This is also why we don't send durations as StatsD timers because the metrics library has already calculated the values that StatsD would attempt to calculate.

In the UDP, events passes though both metric and statsd aggregation with different semantics i.e.

Your application -> Metric (aggregated here) -> this library -> (dog)statsd (aggregated again)-> datadog server

For HTTP, events does not follow through double aggregation i.e.
Your application -> Metric (aggregated here) -> this library -> datadog server

Hence sending with compatible (Metric and Datadog) semantic is better idea.

Side note: In my forked repo, I send "count" as counter to statsd (than "gauage") and it works perfectly. Not sure why. More investigation requires.

@xunnanxu
Copy link

xunnanxu commented Feb 11, 2018

If you are reading this, I suggest replacing your usage of this repo with micrometer-metrics. You can see what a proper implementation of counter should be here, which is to treat statsd reporting differently from in-app reporting by sending actual value to statsd directly as COUNT type.

The problem with this repo is that it thinks Http reporting and statsd are the same, which caused double counting/missing counting (depending on the difference between reporting interval of this lib and statsd's) when you use dogstatsd.

This is accurate as of when this comment is posted as shown here.

@mpeckWhisker
Copy link

I believe this is happening to me as well. v1.1.4 works fine, but going to anything newer (I tried each version from 1.1.5 through 1.1.13) and the value keeps climbing.

You can see where I had 1.1.13 running until 15:08 or so, and then put 1.1.4 back on:
image

The above is (obviously) a dropwizard app, and is not using any metrics from my app, only from jetty.

I found that 1.1.4 pulls in com.datadoghq:java-dogstatsd-client:jar:2.1.1 and 1.1.13 pulls in com.datadoghq:java-dogstatsd-client:jar:2.3however in the deploy 1.1.4 seen in the graph, it is using 2.3.

Does this mean that 1.1.4 is broken, or that everything after 1.1.4 is broken?

@panga
Copy link

panga commented Sep 18, 2018

1.1.4+ is broken, I had to revert version because of the same issue.

ThuTrinh added a commit to zendesk/maxwell that referenced this issue Nov 4, 2019
dropwizard metrics Meter is mapped to statsd's Counter.
IF we use dropwizard Counter then the metric will endup as a gauge in Datadog.
Reference coursera/metrics-datadog#45

The referenced issue also mentioned a version downgrade for metrics-datadog. I tried it locally and the version didn't seem to make a difference.
ThuTrinh added a commit to zendesk/maxwell that referenced this issue Nov 6, 2019
dropwizard metrics Meter is mapped to statsd's Counter.
IF we use dropwizard Counter then the metric will endup as a gauge in Datadog.
Reference coursera/metrics-datadog#45

The referenced issue also mentioned a version downgrade for metrics-datadog. I tried it locally and the version didn't seem to make a difference.
@ikr0m
Copy link

ikr0m commented Jul 28, 2020

For me version 1.1.2 worked, from version 1.1.3 addCounter method is not used anymore. This PR is the reason https://github.com/coursera/metrics-datadog/pull/32/files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants