Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[kubernetes_state] - Collect job metrics #686

Merged
merged 4 commits into from
Aug 21, 2017

Conversation

bksteiny
Copy link
Contributor

@bksteiny bksteiny commented Aug 17, 2017

What does this PR do?

Add basic Job metrics from Kubernetes State

See #653 for background info

Motivation

We'd currently like to monitor when jobs fail. In the future, we may want to monitor duration, but that's out of scope.

Testing Guidelines

An overview on testing
is available in our contribution guidelines.

Versioning

  • Bumped the version check in manifest.json
  • Updated CHANGELOG.md

Additional Notes

Anything else we should know when reviewing?

@bits-bot
Copy link
Collaborator

@bksteiny, thanks for your PR! By analyzing the history of the files in this pull request, we identified @hkaj, @gmmeyer and @xvello to be potential reviewers.

@xvello xvello self-assigned this Aug 17, 2017
@xvello xvello added this to the 5.17 milestone Aug 17, 2017
Copy link
Contributor

@xvello xvello left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the rebasing, I hate when that happens. Sorry.

Last one, then we're ready for final testing and merging.
Once the tagging logic is reworked, can you please make sure job and namespace tags are reported as they should? I'll work on some unit tests during QA week.

@@ -217,17 +223,37 @@ def kube_job_complete(self, message, **kwargs):
for metric in message.metric:
tags = []
for label in metric.label:
tags.append(self._format_tag(label.name, label.value))
trimmed_job = self._trim_job_tag(label.value)
tags.append(self._format_tag(label.name, trimmed_job))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last thing before merge: the good tag logic was the one you deleted: you should:

  • iterate on labels
  • if label == job, trim, format and append
  • else format and append

If your version of the code, if a namespace value matches the pattern, it will be incorrectly trimmed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies. I misunderstood your original comment, but as I read it again, I see what you're talking about.

Thanks for your patience

xvello
xvello previously approved these changes Aug 18, 2017
Copy link
Contributor

@xvello xvello left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, that LGTM!
I'll test it Monday morning and merge, in time for 5.17.

@xvello
Copy link
Contributor

xvello commented Aug 21, 2017

Hi Chris!

I tested how the metric reacts with CronJobs, and it is not great: k-s-m does not flush previous job instances, an the count just increments for every run.

One can plot the value difference to get the good value, but that is unintuitive. I'll patch the check to submit a rate instead of a counter`. For that, we need to count the instances, then submit the rates after processing the whole file. I'll push on your branch before merging, to have a cleaner history.

In the meantime, do you think submitting a rate will be OK with your use case? One can go back to the absolute number by computing the integral of the rate.

@xvello
Copy link
Contributor

xvello commented Aug 21, 2017

Hi Christ,

I just pushed a commit using the monotonic_count feature: it computes the delta from the last counter value on the client-side and sends it as a counter value. It's more appropriate than rate for that data type. Here are a few screencaps:

I'll be merging before the feature freeze, but I'd love some feedback during this week, as we will be in release-candidate.

Thanks for your contribution, and sorry for the back and forth, ksm metrics are quite tricky.

Copy link
Contributor

@xvello xvello left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

@xvello xvello merged commit 56100e3 into DataDog:master Aug 21, 2017
@xvello xvello mentioned this pull request Aug 21, 2017
@xvello xvello deleted the k8s-job-metrics branch August 21, 2017 17:21
@bksteiny
Copy link
Contributor Author

Hey Xavier-

Thanks for adjusting and merging this. I haven't used the CronJob feature, since it's still in alpha (we have alpha features disabled). My testing/use cases were specifically for the Job feature. I can't say whether or not the CronJob functions the same as a Job, but I found that if you delete failed pods that are created from a job, the failed counter decreases by the number of removed pods. We wouldn't keep the failed pods around, but some folks might. I had planned on using the "Change Alert" monitor, but I'm open to using other monitors.

I will pull these changes down and give them a go. Thanks again for your help on this and merging it before the 5.17 freeze.

@xvello
Copy link
Contributor

xvello commented Aug 22, 2017

Hi Chris,

monotonic_count, like rate ignores negative values, so you should be fine except if during one run (15 secs) we have for a given job name:

  • one old succeeded job pod deleted
  • one new job entering succeeded state

To handle that possible race condition, we'd need to switch to store old job names and ignore them once they've been reported as succeeded, but that would be non-trivial.

I'd love some feedback on whether the current code works OK for your use case.

Regards

@bksteiny
Copy link
Contributor Author

Get @xvello - I tried this out and I it should work out for us. Thanks again for your input and help.

Is there an expected release date for 5.17? I assume it's soon.

Thanks

@xvello
Copy link
Contributor

xvello commented Aug 25, 2017

Thanks for your input Chris!

RC testing is progressing steadily, so 5.17 should be out in a few days.

Regards

gml3ff pushed a commit that referenced this pull request May 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants