Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloudwatch output is blocked by metric values close to 0 #2523

Closed
kamalvpm777 opened this issue Mar 9, 2017 · 15 comments · Fixed by #3183
Closed

Cloudwatch output is blocked by metric values close to 0 #2523

kamalvpm777 opened this issue Mar 9, 2017 · 15 comments · Fixed by #3183
Labels
bug unexpected problem or unintended behavior
Milestone

Comments

@kamalvpm777
Copy link

kamalvpm777 commented Mar 9, 2017

##Our telegraf is integrated with cloudwatch output, in the last 4 to 5 few weeks it was working fine . Suddenly in the last 2 days we do see an inconsistency where metrics are not being pushed to cloudwatch output though telegraf setup is still same. Based on the analysis looking for your opinions and troubleshooting for the same. Let me know if something is not clear, I can add more details

Scenario 1 -

As per below telegraf conf we have metrics limit set to 10000 and flush interval of every 20 sec. Though after reaching the limit of 10000 buffer its never getting flushed and gathered metrics keeps accumulating which increases the count of dropped metrics subsequently.
i) What dropped metrics means here, does it mean these many(195956) metrics are not sent to cloudwatch output.
ii) we do generate 222 metrics /min , so per hour its total of 13320 metrics / hour. In the past also we have captured similar volume of metrics with same buffer limit of 10000 and we never had issue in the last 2 months posting the stats to cloudwatch.
iii) Does going above the buffer fullness causes metrics not sent to cloudwatch?

image

2017/03/09 16:39:00 Output [cloudwatch] buffer fullness: 10074 / 10000 metrics. Total gathered metrics: 206030. Total dropped metrics: 195956.
2017/03/09 16:39:20 Output [cloudwatch] buffer fullness: 10074 / 10000 metrics. Total gathered metrics: 206104. Total dropped metrics: 196030.

Scenario 2

i) To troubleshoot if reaching the buffer limit is an issue , disabled few metrics to reduce the volume. With this change we generate only 36 metrics/min, and our total gathered metrics is still under the the limit of 10000.
ii) With this change telegraf just connected successfully to cloudwatch at 12:09 PM and reported the metrics once at 12:10 PM and again stopped reporting afterwards though we under the limit of 10000

2017/03/09 17:34:20 Output [cloudwatch] buffer fullness: 1116 / 10000 metrics. Total gathered metrics: 1116. Total dropped metrics: 0.
2017/03/09 17:34:40 Output [cloudwatch] buffer fullness: 1128 / 10000 metrics. Total gathered metrics: 1128. Total dropped metrics: 0.

Telegraf conf

Configuration for telegraf agent

[agent]

Default data collection interval for all inputs

interval = "20s"
#####Rounds collection interval to 'interval'

ie, if interval="10s" then always collect on :00, :10, :20, etc.

round_interval = true

Telegraf will cache metric_buffer_limit metrics for each output, and will
flush this buffer on a successful write.

metric_buffer_limit = 10000
#####Flush the buffer whenever full, regardless of flush_interval.
flush_buffer_when_full = true
#####Collection jitter is used to jitter the collection by a random amount.
collection_jitter = "0s"

Default flushing interval for all outputs. You shouldn't set this below
interval. Maximum flush_interval will be flush_interval + flush_jitter

flush_interval = "20s"
#####ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
flush_jitter = "0s"

Run telegraf in debug mode

debug = true

Run telegraf in quiet mode

quiet = false

Override default hostname, if empty use os.Hostname()

hostname = ""

Configuration for cloudwatch api to send metrics to

[[outputs.cloudwatch]]

Amazon Region (required)

region = 'us-east-1'

Current behavior:

Telegraf logs looks fine , but its not reporting /updating the metrics to cloudwatch

Use case: Currently our production monitoring with telegraf is completely broken, need to get this fixed as soon as possible

Telegraf logs

2017/03/09 21:45:40 Input [httpjson] gathered metrics, (20s interval) in 7.261833ms
2017/03/09 21:45:40 Input [mesos] gathered metrics, (20s interval) in 7.050785ms
2017/03/09 21:45:40 Input [httpjson] gathered metrics, (20s interval) in 2.375689ms
2017/03/09 21:45:40 Input [httpjson] gathered metrics, (20s interval) in 3.618668ms
2017/03/09 21:45:40 Input [httpjson] gathered metrics, (20s interval) in 6.121121ms
2017/03/09 21:45:40 Input [httpjson] gathered metrics, (20s interval) in 5.218891ms
2017/03/09 21:45:40 Input [httpjson] gathered metrics, (20s interval) in 11.845889ms
2017/03/09 21:45:40 Input [httpjson] gathered metrics, (20s interval) in 14.763179ms
2017/03/09 21:45:40 Input [httpjson] gathered metrics, (20s interval) in 10.747364ms
2017/03/09 21:45:40 Input [httpjson] gathered metrics, (20s interval) in 15.15531ms
2017/03/09 21:45:40 Input [httpjson] gathered metrics, (20s interval) in 23.128821ms
2017/03/09 21:45:40 Input [httpjson] gathered metrics, (20s interval) in 24.935601ms
2017/03/09 21:45:40 Output [cloudwatch] buffer fullness: 10012 / 10000 metrics. Total gathered metrics: 10164. Total dropped metrics: 152.

@danielnelson
Copy link
Contributor

iii) Does going above the buffer fullness causes metrics not sent to cloudwatch?

Yes, the buffer is emptied on every successful push, and should remain empty. Need to determine why nothing is being successfully output to cloudwatch.

@danielnelson
Copy link
Contributor

What version of telegraf are you using? What type of AWS credentials are you using?

@kamalvpm777
Copy link
Author

kamalvpm777 commented Mar 10, 2017

Thanks for your reply, provided the answer below hope it helps
What version of telegraf are you using?
Telegraf - version 0.13.2

What type of AWS credentials are you using?
Access Key and Secret Access Key

#####Scenario 1:
if i understand correctly once it reach the buffer limit of 10000, after the flush triggers this buffer value is supposed to reset to 0. But looking at telegraf logs it doesnt reports anything about flush and dropped metrics count only keeps accumulating . Also below is the AWS cloud watch console where stats updated inconsistenly till March 09, 11 AM.

2017/03/10 04:43:40 Output [cloudwatch] buffer fullness: 10012 / 10000 metrics. Total gathered metrics: 25212. Total dropped metrics: 15200.

image

#####Scenario 2:
Though even after restarting the telegraf when total gathered metrics is just 100 out of buffer limit of 10000 still it wrote metrics only once at 12:09 PM then it doesn't update any metrics to cloudwatch at all as shown in below cloudwatch console. Though looking at telegraf log from 12 - 1 PM accumulated metrics was still under buffer limit and telegraf successfully connected to AWS cloudwatch output as per the telegraf logs but still no metrics sent to AWS

image

It will be great if you can provide some approach for troubleshooting to identify the root cause and fix this stuff ASAP to get our production monitoring live back

@danielnelson
Copy link
Contributor

There have been a lot of changes since this version, can you test with the latest 1.2.1 release?

@kamalvpm777
Copy link
Author

kamalvpm777 commented Mar 13, 2017

Sure , I just upgraded to version 1.2.1. With new version we still see the issue but we are able to narrow down the particular rootcause which is causing metrics not to be wriiten in AWS cloudwatch. Could we get some suggestion on why we are getting this error and how to fix this isssue ?

2017-03-11T07:56:42Z E! CloudWatch: Unable to write to CloudWatch : InvalidParameterValue: The value 0 for parameter MetricData.member.10.Value is invalid.
The value 0 for parameter MetricData.member.12.Value is invalid.

For every spring boot application we are retrieving drop wizard metrics through HTTP JSON INPUT in telegraf like below. In the past for all the apps we were able to ship the metrics successfully to cloudwatch , but now the same telegraf conf works for 1 app but fails for other app metrics

WORKS (Writes metrics to cloudwatch)

[[inputs.httpjson]]
name = "registration"
servers = [
"http://localhost:5481/metrics"
]

HTTP method to use (case-sensitive)

method = "GET"

TELEGRAF Log : Output Buffer (Reset on successful write)

2017-03-11T08:02:00Z D! Output [cloudwatch] buffer fullness: 68 / 10000 metrics.
2017-03-11T08:02:06Z D! Output [cloudwatch] wrote batch of 68 metrics in 6.065242049s
2017-03-11T08:02:20Z D! Output [cloudwatch] buffer fullness: 68 / 10000 metrics.
2017-03-11T08:02:26Z D! Output [cloudwatch] wrote batch of 68 metrics in 6.19724333s

NOT WORKS (Failed to write metrics to cloudwatch)

[[inputs.httpjson]]
name = "Login"
servers = [
"http://localhost:5482/metrics"
]

HTTP method to use (case-sensitive)

method = "GET"

TELEGRAF LOG : Output Buffer (Keeps accumulating without reset)

2017-03+A314:A324-11T07:56:40Z D! Output [cloudwatch] buffer fullness: 212 / 10000 metrics.
2017-03-11T07:56:42Z E! CloudWatch: Unable to write to CloudWatch : InvalidParameterValue: The value 0 for parameter MetricData.member.10.Value is invalid.
The value 0 for parameter MetricData.member.12.Value is invalid.
2017-03-11T07:56:42Z E! Error writing to output [cloudwatch]: InvalidParameterValue: The value 0 for parameter MetricData.member.10.Value is invalid.
The value 0 for parameter MetricData.member.12.Value is invalid.
2017-03-11T07:57:00Z D! Output [cloudwatch] buffer fullness: 280 / 10000 metrics.
2017-03-11T07:57:02Z E! CloudWatch: Unable to write to CloudWatch : InvalidParameterValue: The value 0 for parameter MetricData.member.12.Value is invalid.
2017-03-11T07:57:02Z E! Error writing to output [cloudwatch]: InvalidParameterValue: The value 0 for parameter MetricData.member.12.Value is invalid.

#Attached the sample HTTP JSON METRICS which is captured by Telegraf
HTTPJSON_Metrics.DOCX

@danielnelson
Copy link
Contributor

I found this blacklocus/metrics-cloudwatch#13 which seems to indicate that if you have a very small number you can receive this error from Cloudwatch.

In your sample JSON, there is one value that is 4.784335455263847e-50, I wonder if this is what is causing the problem, are you able to change the http server to return 0 for very small numbers?

@kamalvpm777
Copy link
Author

kamalvpm777 commented Mar 15, 2017

Thanks. It is even okay for cloudwatch to reject these small values , but am still curious why telegraf not ignoring these cloudwatch errors for very small number(e-50) and just write other metrics which are closer and greater than 0's to cloudwatch?

I believe that will probably fix all this issue, right now when this errors comes as per logs "output [cloudwatch] buffer fullness keeps accumulating without reset" and in the logs it says "Unable to Write to cloudwatch" . At the same time it doesn't tell anything about whether the metrics >0 written to cloudwatch or not.

Telegraf log (with errors - Unable to write to cloudwatch / output buffer accumulating 73 -> 146)

2017-03-15T00:27:20Z D! Output [cloudwatch] buffer fullness: 73 / 10000 metrics.
2017-03-15T00:27:26Z E! CloudWatch: Unable to write to CloudWatch : InvalidParameterValue: The value 0 for parameter MetricData.member.19.Value is invalid.
status code: 400, request id: 6c39d867-0916-11e7-ae62-41eb59694862
2017-03-15T00:27:26Z E! Error writing to output [cloudwatch]: InvalidParameterValue: The value 0 for parameter MetricData.member.19.Value is invalid.
status code: 400, request id: 6c39d867-0916-11e7-ae62-41eb59694862
2017-03-15T00:27:40Z D! Output [cloudwatch] buffer fullness: 146 / 10000 metrics.

Telegraf log (without errors - Wrote batch of ## metrics in seconds / output buffer reset 73 -> 73)

2017-03-14T22:16:00Z D! Output [cloudwatch] buffer fullness: 74 / 10000 metrics.
2017-03-14T22:16:10Z D! Output [cloudwatch] wrote batch of 74 metrics in 10.476617767s

Please advise on how to get telegraf simply ignore this very small numbers and go-ahead with other metrics those values closer and > than 0's. Thanks for your help on this

It would be great if you can provide some references/approach for achieving HTTP server to return 0 for small numbers. Also just found in the below link where it says CloudWatch rejects values that are either too small or too large. Values must be in the range of 8.515920e-109 to 1.174271e+108 for cloudwatch putmetrics.
http://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_PutMetricData.html

@danielnelson
Copy link
Contributor

It is even okay for cloudwatch to reject these small values , but am still curious why telegraf not ignoring these cloudwatch errors for very small number(e-50) and just write other metrics which are closer and greater than 0's to cloudwatch?

I think what telegraf should do is round these close to zero values down to exactly zero. In order to do this we need to determine the correct cutoff point for the conversion.

It appears that the inability to push the bad cloudwatch metric is blocking telegraf from processes the good metrics, at some point the bad values will drop off the buffer if they only occur on occasion. If the error occurs infrequently you could lower the metric_buffer_limit in your telegraf.conf, this will cause telegraf to become stuck for shorter periods of time.

Please advise on how to get telegraf simply ignore this very small numbers and go-ahead with other metrics those values closer and > than 0's.

The quickest solution for you may be to update the server that your httpjson input uses as its source to adjust the values before they enter telegraf. We will try to get a fix done for this before the next release, if someone wants to work on this it would be great.

@danielnelson danielnelson added the bug unexpected problem or unintended behavior label Mar 15, 2017
@danielnelson danielnelson added this to the 1.3.0 milestone Mar 15, 2017
@danielnelson danielnelson changed the title [Issue]Telegraf stopped sending metrics to cloudwatch output, total dropped metrics keeps accumulating though flush interval is set ON Cloudwatch output is blocked by metric values close to 0 Mar 15, 2017
@kamalvpm777
Copy link
Author

Thanks for the response and appreciate you taking this care as part of 1.3.0 milestone release.

Regarding the quickest solution could you please elaborate little more. Here is our env currently looks like, any suggestions /recommendations what needs to done on the source server (mesos master) to adjust the values to zero

  • Springboot apps running on Mesos slaves which expose metics on its each port
  • Apps metrics on mesos slave are directly available in mesos master by calling REST endpoint ("http://localhost:2960/metrics")
  • Telegraf conf setup and running on Mesos master which calls each app metrics through the port.
  • Local host here refers to host that is the mesos master which telegraf listens as source.
    [[inputs.httpjson]]
    servers = [
    "http://localhost:2960/metrics"
    ]

Meanwhile this small numbers( E-179) comes very often and consistently in our case. Currently we have a buffer metrics limit of 10000 even on lowering to 1000 not sure if bad values will drop off completely from the buffer or not. But I will definitely give it a try and let you know with buffer limit of default 1000 and let you know how it goes.

@danielnelson
Copy link
Contributor

Based on the metrics above, the problematic one is:

"com.wipr.lac.shop.rest.controller.ControllerV1.getproductsproductData.oneMinuteRate": 4.784335455263847e-50,

Perhaps you can round this where this metric is produced, presumably in your code somewhere.

Another suggestion is that you filter the problematic metric, should be something like:

[[inputs.httpjson]]
  name = "registration"
  servers = [
    "http://localhost:5481/metrics"
  ]
  method = "GET"
  fielddrop = ["com.wipr.lac.shop.rest.controller.ControllerV1.getproductsproductData.oneMinuteRate"]

You may have to filter additional fields, here are additional docs on filtering: https://github.com/influxdata/telegraf/blob/master/docs/CONFIGURATION.md#measurement-filtering

@danielnelson danielnelson modified the milestones: 1.3.0, 1.4.0 Apr 27, 2017
@danielnelson danielnelson modified the milestones: 1.4.0, 1.5.0 Aug 14, 2017
@allingeek
Copy link
Contributor

I'm encountering the same issue with inputs.http_listener and outputs.cloudwatch:

8/14/2017 4:06:30 PM2017-08-14T23:06:30Z E! Error writing to output [cloudwatch]: InvalidParameterValue: The value 0 for parameter MetricData.member.4.Value is invalid.
8/14/2017 4:06:30 PMThe value 0 for parameter MetricData.member.13.Value is invalid.
8/14/2017 4:06:30 PMThe value 0 for parameter MetricData.member.14.Value is invalid.

@allingeek
Copy link
Contributor

allingeek commented Aug 28, 2017

The linked PR adds code to actually enforce the constraints on CloudWatch metrics. Data points that fail those constraints are omitted in the same way the plugin omits metrics with unsupported data types.

I've tested the change in an environment that always encounters this problem and it seems to have been fixed. Boundary checking is important.

@allingeek
Copy link
Contributor

I don't suppose we could pull this change into a 1.3.X release and 1.4.X-rcs?

@danielnelson
Copy link
Contributor

@allingeek I was somewhat hoping to avoid another 1.3.X release, but only because it takes me several hours to prepare a release. I'm planning to release 1.4 by end of week, assuming this, would you still want it as a new 1.3 release?

@allingeek
Copy link
Contributor

1.4 would be fine. Thanks for the quick merge. :)

@danielnelson danielnelson modified the milestones: 1.5.0, 1.4.0 Aug 29, 2017
@danielnelson danielnelson modified the milestones: 1.4.0, 1.5.0 Aug 29, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants