Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ScheduledReporter may have missed reports #2664

Closed
dkaukov opened this issue Jun 9, 2022 · 2 comments
Closed

ScheduledReporter may have missed reports #2664

dkaukov opened this issue Jun 9, 2022 · 2 comments
Labels

Comments

@dkaukov
Copy link
Contributor

dkaukov commented Jun 9, 2022

Problem Statement

I think, fix for #1524 has undesired side-effects in our environment. Because executor.scheduleWithFixedDelay accumulating errors (due network latency, for example) when it goes over reporting period, we have "holes" in graphite metrics:
image

Possible solution

In our case, report bursts are probably lesser evil than holes, so maybe we can make it configurable.

@mihalyr
Copy link

mihalyr commented Aug 3, 2022

We are seeing the same problem which is a result of #1524 and the switch to scheduleWithFixedDelay

We are sending counter metrics every minute to Graphite, Carbon Aggregator and we use nonNegativeDerivative to show only the changes per minute and we are seeing "fake" metrics.

image

The derivative function is sensitive to missing metrics, if one of the hosts aggregated is not sending and then sends again, it will show up as an increase in the metric for that minute. We are aggregating to hours in the above graph and the results are completely wrong, there is nothing happening on the server and all those bars are caused by missing metrics.

Here is an example of missing metrics showing up from two hosts:

image

I've captured the metrics on the Graphite server with tcp dump and saw the following reports from a single source server:

  • arrived 08:08:59.087611 metric payload timestamp 1659514138 (08:08:58)
  • arrived 08:09:59.343252 metric payload timestamp 1659514199 (08:09:59)
  • arrived 08:10:59.597794 metric payload timestamp 1659514259 (08:10:59)
  • arrived 08:11:59.856127 metric payload timestamp 1659514319 (08:11:59)
  • arrived 08:13:00.118483 metric payload timestamp 1659514380 (08:13:00)

You can see that we never receive data for 08:12, that's 1-min gap on the charts and one spike in derivatives.
First I suspected network latency issues and missing aggregation buckets on Graphite, but from the timestamps added by GraphiteReporter it is clear that the reporter is sending them late and missing the buckets. The increasing delay was an immediate sign of scheduleWithFixedDelay which was confirmed in the codebase.

Current metric reporting using fixed delay is problematic as the reporting interval is unstable and a function of how long does reporting itself take. The small delays add up and every few hours we see a missed interval and fake values with derivatives.

+1 for making this configurable.

mikebell90 added a commit to mikebell90/metrics that referenced this issue Oct 15, 2022
dropwizard#1524 introduced a fix to queuing issues and ends up doing atFixedDelay rather than atFixedRate.

Issues with binning exist in Graphite (and probably others). See dropwizard#2664
joschi pushed a commit to mikebell90/metrics that referenced this issue Nov 16, 2022
dropwizard#1524 introduced a fix to queuing issues and ends up doing atFixedDelay rather than atFixedRate.

Issues with binning exist in Graphite (and probably others). See dropwizard#2664
joschi pushed a commit to mikebell90/metrics that referenced this issue Dec 9, 2022
dropwizard#1524 introduced a fix to queuing issues and ends up doing atFixedDelay rather than atFixedRate.

Issues with binning exist in Graphite (and probably others). See dropwizard#2664
joschi pushed a commit that referenced this issue Dec 9, 2022
#1524 introduced a fix to queuing issues and ends up doing atFixedDelay rather than atFixedRate.

Issues with binning exist in Graphite (and probably others).

See #2664
@github-actions
Copy link

This issue is stale because it has been open 180 days with no activity. Remove the "stale" label or comment or this will be closed in 14 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants