-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial stab at a new aggregate output check. #52
Conversation
Here's an example of it running against my staging sensu:
|
Thanks for talking through some of the stuff in slack and submitting this. I will take a closer look shortly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I think this looks good, I do have some suggestions some in the form of preventing bugs others are things that we can do to enhance the check.
description: 'Collects all non-ok outputs', | ||
default: true | ||
|
||
option :warning, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I realize this is following the existing pattern it might be a good time to make some would be breaking changes and make this a bit better. Rather than having four options I believe we can reduce this by having --warning
, --critical
, and --alert-on
which we can default to whatever the community thinks makes the most sense. This reduces the required knobs to get the check working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read through this again and maybe it does make sense as is so that you can monitor both. For example you might want to alert on one handler definition for SLA threshold but another for do early detection. This really is expanding on the differences between severity and urgency. I do think that it would still be better served as 2 seperate check definitions as it provides more control and clarity to operators.
short: '-C PERCENT', | ||
long: '--critical PERCENT', | ||
description: 'PERCENT critical before critical (can be change with --ignore-severity)', | ||
proc: proc(&:to_i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment regarding floats vs integers.
short: '-W PERCENT', | ||
long: '--warning PERCENT', | ||
description: 'PERCENT warning before warning (can be change with --ignore-severity)', | ||
proc: proc(&:to_i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these should be floats rather than integers so that you can report and alert on SLA figures which are usually represented in the number of 9's. 99% of 1,000,000 events is not the same as 99.99% of the same number of events.
description: 'PERCENT warning before warning (can be change with --ignore-severity)', | ||
proc: proc(&:to_i) | ||
|
||
option :warning_count, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I realize this is following the existing pattern it might be a good time to make some would be breaking changes and make this a bit better. Rather than having four options I believe we can reduce this by having --warning
, --critical
, and something like --alert-on
to specify count or percentages. I would suggest default of percentages as we can make better sane defaults for small and larger setups. This reduces the required knobs to get the check working.
description: 'PERCENT critical before critical (can be change with --ignore-severity)', | ||
proc: proc(&:to_i) | ||
|
||
option :critical_count, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment about combining options.
verify_ssl: verify_mode) | ||
JSON.parse(request.get) | ||
rescue Errno::ECONNREFUSED | ||
warning 'Connection refused' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to change anything here but in general it is better to return a status and a message from a function rather than executing immediately as it makes it harder to write tests for small units of code.
end | ||
|
||
def output_results | ||
severities = %w(critical warning unknown) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might make more sense to leverage something like this for future proofing new severities, unlike the previous one we would want to reject ok
. While this is unlikely to occur in the short term as sensu has adhered to the check spec of nagios. With sensu 2.x on the horizon I believe they will likely enhance upon the nagios check spec at some point in the future.
|
||
def aggregate_results | ||
results = api_request("/aggregates/#{config[:check]}?max_age=#{config[:age]}")['results'] | ||
warning "No aggregates found in last #{config[:age]} seconds" if %w(ok warning critical unknown).all? { |x| results[x].zero? } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might make more sense to leverage something like this for future proofing new severities. While this is unlikely to occur in the short term as sensu has adhered to the check spec of nagios with sensu 2.x on the horizon it's possible they might want to enhance on top of the nagios check spec.
closing due to lack of response, to get this through you can either re-open the PR or open a new one. |
Pull Request Checklist
Is this in reference to an existing issue?
This is related to issue #19,
check-aggregate.rb
does not work with newer Sensu servers to show collected output. That script looks like it has grown quite a lot over time and I don't use most of those features, so I wanted to get something out that supports output and some minimum number of default options. To that end, I've created this as a new check so that it won't break anybody that is currently relying on the features in the current check.General
Update Changelog following the conventions laid out on Our CHANGELOG Guidelines
Update README with any necessary configuration snippets
[N/A] Binstubs are created if needed
RuboCop passes
Existing tests pass
New Plugins
Tests
Add the plugin to the README
Does it have a complete header as outlined here
Purpose
Very similar to
check-aggregate.rb
but it actually works with newer versions of the Sensu API to support output. It does not support all of the same features ofcheck-aggregate.rb
mostly because I don't use all of those features and some of thecheck-aggregate.rb
code depends on API functionality that no longer exists. (such as the summarize feature) I'm not opposed to adding those features to this check, and have tried to leave room to enable that to be simple, but I have not taken that on in this PR.Known Compatibility Issues
Requires Sensu 0.24 or greater