-
Notifications
You must be signed in to change notification settings - Fork 814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[riakcs] support Riak CS 2.1+ stats format #2920
Conversation
|
||
STATS_METHODS = { | ||
"one": "count", | ||
"total": "monotonic_count", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For metrics ending in _one
or _total
I guessed they would be better tracked as counts than a as simple gauge. It wasn't entirely clear which method was the best for each of these. More info on various types of stats here: https://github.com/basho/riak_cs/wiki/%5BRFC%5D-Riak-CS-and-Stanchion-metrics-2.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I'm reviewing these again and after reviewing the stats reported in my test, I think total
maybe should use gauge
rather than monotonic_count
. Thoughts?
Thanks a lot @millerdev Could you update this test with the new format ? Also i'm a bit worried about the number of new metrics it collects. Looks like it will start collecting thousands of metrics while it was just collecting in the hundreds. Any possibility to trim that down and to focus on the most relevant metrics ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@remh I pared down the list of collected metrics and added tests for the new stats format.
# config; the value should be a list of metric names. | ||
# | ||
# Helpful references: | ||
# - https://github.com/basho/riak_cs/wiki/Riak-cs-and-stanchion-metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose to only collect a limited set of S3 API and memory metrics by default. I'm not sure how to identify the set that most people will want, so I just went with a set that seemed reasonable for my usage (although I haven't done anything with it yet, so this may evolve). Here are some reasons for each of the exclusions listed above:
bucket_acl_(get|put)
andobject_acl_(get|put)
- I'm not using ACLs.bucket_policy_(get|put|delete)
- I'm not doing any automated bucket policy updates.*_in_(one|total)
- guessing this is a similar metric to the_out_
variants.*_time_error_*
- it seems like the timing of errors is less important than the count of errors.*_time_100
- had a hard time figuring out what this even means. Is it 99.9th percentile or worst case (max) latency? Decided to exclude it since it was not in the old metrics format.
The total number of metrics collected now (by default) is around 150.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
What does this PR do?
Fixes #2340
Motivation
Our Riak CS integration has been saying "no data received" for a long time. This adapts the existing collector to the new stats format introduced in Riak CS 2.1.
Testing Guidelines
There are no tests for the
riakcs
stats check. However, I droppedriakcs.py
from this PR on one of our servers and verified that it works and that the stats can be queried in the web interface.Additional Notes
The new stats format includes many more stats (>1000), and the app link for riakcs in our infrastructure list on the website now takes FOREVER to load, and it freezes the whole page while it's loading. Seems like that could be optimized, but it's outside the scope of this change.