Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[riakcs] support Riak CS 2.1+ stats format #2920

Merged
merged 2 commits into from
Nov 17, 2016
Merged

Conversation

millerdev
Copy link
Contributor

What does this PR do?

Fixes #2340

Motivation

Our Riak CS integration has been saying "no data received" for a long time. This adapts the existing collector to the new stats format introduced in Riak CS 2.1.

Testing Guidelines

There are no tests for the riakcs stats check. However, I dropped riakcs.py from this PR on one of our servers and verified that it works and that the stats can be queried in the web interface.

Additional Notes

The new stats format includes many more stats (>1000), and the app link for riakcs in our infrastructure list on the website now takes FOREVER to load, and it freezes the whole page while it's loading. Seems like that could be optimized, but it's outside the scope of this change.


STATS_METHODS = {
"one": "count",
"total": "monotonic_count",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For metrics ending in _one or _total I guessed they would be better tracked as counts than a as simple gauge. It wasn't entirely clear which method was the best for each of these. More info on various types of stats here: https://github.com/basho/riak_cs/wiki/%5BRFC%5D-Riak-CS-and-Stanchion-metrics-2.1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I'm reviewing these again and after reviewing the stats reported in my test, I think total maybe should use gauge rather than monotonic_count. Thoughts?

@masci masci added this to the Triage milestone Oct 18, 2016
@remh remh modified the milestones: 5.10.0, Triage Oct 21, 2016
@remh
Copy link

remh commented Oct 25, 2016

Thanks a lot @millerdev
Could you add some tests ? Riak CS tests are actually mocked https://github.com/DataDog/dd-agent/blob/master/tests/checks/mock/test_riakcs.py given that setting up RiakCS in a test environment is not straight forward.

Could you update this test with the new format ?

Also i'm a bit worried about the number of new metrics it collects. Looks like it will start collecting thousands of metrics while it was just collecting in the hundreds. Any possibility to trim that down and to focus on the most relevant metrics ?

@remh remh self-assigned this Oct 25, 2016
@remh remh modified the milestones: 5.10.1, 5.10.0 Oct 27, 2016
Copy link
Contributor Author

@millerdev millerdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@remh I pared down the list of collected metrics and added tests for the new stats format.

# config; the value should be a list of metric names.
#
# Helpful references:
# - https://github.com/basho/riak_cs/wiki/Riak-cs-and-stanchion-metrics
Copy link
Contributor Author

@millerdev millerdev Nov 10, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose to only collect a limited set of S3 API and memory metrics by default. I'm not sure how to identify the set that most people will want, so I just went with a set that seemed reasonable for my usage (although I haven't done anything with it yet, so this may evolve). Here are some reasons for each of the exclusions listed above:

  • bucket_acl_(get|put) and object_acl_(get|put) - I'm not using ACLs.
  • bucket_policy_(get|put|delete) - I'm not doing any automated bucket policy updates.
  • *_in_(one|total) - guessing this is a similar metric to the _out_ variants.
  • *_time_error_* - it seems like the timing of errors is less important than the count of errors.
  • *_time_100 - had a hard time figuring out what this even means. Is it 99.9th percentile or worst case (max) latency? Decided to exclude it since it was not in the old metrics format.

The total number of metrics collected now (by default) is around 150.

Copy link

@remh remh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@remh remh merged commit 20485a6 into DataDog:master Nov 17, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants