[kubernetes_state] - Collect job metrics #686

bksteiny · 2017-08-17T14:23:34Z

What does this PR do?

Add basic Job metrics from Kubernetes State

See #653 for background info

Motivation

We'd currently like to monitor when jobs fail. In the future, we may want to monitor duration, but that's out of scope.

Testing Guidelines

An overview on testing
is available in our contribution guidelines.

Versioning

Bumped the version check in manifest.json
Updated CHANGELOG.md

Additional Notes

Anything else we should know when reviewing?

bits-bot · 2017-08-17T14:23:37Z

@bksteiny, thanks for your PR! By analyzing the history of the files in this pull request, we identified @hkaj, @gmmeyer and @xvello to be potential reviewers.

xvello

Thanks for the rebasing, I hate when that happens. Sorry.

Last one, then we're ready for final testing and merging.
Once the tagging logic is reworked, can you please make sure job and namespace tags are reported as they should? I'll work on some unit tests during QA week.

xvello · 2017-08-17T14:34:06Z

kubernetes_state/check.py

@@ -217,17 +223,37 @@ def kube_job_complete(self, message, **kwargs):
        for metric in message.metric:
            tags = []
            for label in metric.label:
-                tags.append(self._format_tag(label.name, label.value))
+                trimmed_job = self._trim_job_tag(label.value)
+                tags.append(self._format_tag(label.name, trimmed_job))


Last thing before merge: the good tag logic was the one you deleted: you should:

iterate on labels

if label == job, trim, format and append

else format and append

If your version of the code, if a namespace value matches the pattern, it will be incorrectly trimmed.

Apologies. I misunderstood your original comment, but as I read it again, I see what you're talking about.

Thanks for your patience

xvello

Thanks for the changes, that LGTM!
I'll test it Monday morning and merge, in time for 5.17.

xvello · 2017-08-21T11:58:22Z

Hi Chris!

I tested how the metric reacts with CronJobs, and it is not great: k-s-m does not flush previous job instances, an the count just increments for every run.

One can plot the value difference to get the good value, but that is unintuitive. I'll patch the check to submit a rate instead of a counter`. For that, we need to count the instances, then submit the rates after processing the whole file. I'll push on your branch before merging, to have a cleaner history.

In the meantime, do you think submitting a rate will be OK with your use case? One can go back to the absolute number by computing the integral of the rate.

xvello · 2017-08-21T14:26:27Z

Hi Christ,

I just pushed a commit using the monotonic_count feature: it computes the delta from the last counter value on the client-side and sends it as a counter value. It's more appropriate than rate for that data type. Here are a few screencaps:

This cronjob runs a job every 10 minutes, the k_s.job.succeeded shows one spike for every run.
You can alert if the rolling window sum of k_s.job.failed goes above a threshold with the in total quantifier.

I'll be merging before the feature freeze, but I'd love some feedback during this week, as we will be in release-candidate.

Thanks for your contribution, and sorry for the back and forth, ksm metrics are quite tricky.

xvello

🚢

bksteiny · 2017-08-21T19:03:16Z

Hey Xavier-

Thanks for adjusting and merging this. I haven't used the CronJob feature, since it's still in alpha (we have alpha features disabled). My testing/use cases were specifically for the Job feature. I can't say whether or not the CronJob functions the same as a Job, but I found that if you delete failed pods that are created from a job, the failed counter decreases by the number of removed pods. We wouldn't keep the failed pods around, but some folks might. I had planned on using the "Change Alert" monitor, but I'm open to using other monitors.

I will pull these changes down and give them a go. Thanks again for your help on this and merging it before the 5.17 freeze.

xvello · 2017-08-22T09:34:35Z

Hi Chris,

monotonic_count, like rate ignores negative values, so you should be fine except if during one run (15 secs) we have for a given job name:

one old succeeded job pod deleted
one new job entering succeeded state

To handle that possible race condition, we'd need to switch to store old job names and ignore them once they've been reported as succeeded, but that would be non-trivial.

I'd love some feedback on whether the current code works OK for your use case.

Regards

bksteiny · 2017-08-24T16:32:21Z

Get @xvello - I tried this out and I it should work out for us. Thanks again for your input and help.

Is there an expected release date for 5.17? I assume it's soon.

Thanks

xvello · 2017-08-25T12:28:56Z

Thanks for your input Chris!

RC testing is progressing steadily, so 5.17 should be out in a few days.

Regards

Release 4.1.1

Collect job metrics

4ac6d7f

bksteiny mentioned this pull request Aug 17, 2017

[kubernetes_state] - Add basic Job metrics #653

Closed

2 tasks

Add PR number to changelog:wq

f0db078

xvello self-assigned this Aug 17, 2017

xvello added this to the 5.17 milestone Aug 17, 2017

xvello suggested changes Aug 17, 2017

View reviewed changes

Fixed conditional logic

cad480f

xvello previously approved these changes Aug 18, 2017

View reviewed changes

use monotonic_count for job stats, to allow for finer reporting

3d5f757

xvello dismissed their stale review via 3d5f757 August 21, 2017 14:16

xvello approved these changes Aug 21, 2017

View reviewed changes

xvello merged commit 56100e3 into DataDog:master Aug 21, 2017

xvello mentioned this pull request Aug 21, 2017

remove unused variable #696

Merged

xvello deleted the k8s-job-metrics branch August 21, 2017 17:21

gml3ff pushed a commit that referenced this pull request May 14, 2020

Merge pull request #686 from DataDog/albertvaka/release-4-1-1

2f817d2

Release 4.1.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[kubernetes_state] - Collect job metrics #686

[kubernetes_state] - Collect job metrics #686

bksteiny commented Aug 17, 2017 •

edited

Loading

bits-bot commented Aug 17, 2017

xvello left a comment

xvello Aug 17, 2017

bksteiny Aug 17, 2017

xvello left a comment

xvello commented Aug 21, 2017 •

edited

Loading

xvello commented Aug 21, 2017 •

edited

Loading

xvello left a comment

bksteiny commented Aug 21, 2017

xvello commented Aug 22, 2017 •

edited

Loading

bksteiny commented Aug 24, 2017

xvello commented Aug 25, 2017

[kubernetes_state] - Collect job metrics #686

[kubernetes_state] - Collect job metrics #686

Conversation

bksteiny commented Aug 17, 2017 • edited Loading

What does this PR do?

Motivation

Testing Guidelines

Versioning

Additional Notes

bits-bot commented Aug 17, 2017

xvello left a comment

Choose a reason for hiding this comment

xvello Aug 17, 2017

Choose a reason for hiding this comment

bksteiny Aug 17, 2017

Choose a reason for hiding this comment

xvello left a comment

Choose a reason for hiding this comment

xvello commented Aug 21, 2017 • edited Loading

xvello commented Aug 21, 2017 • edited Loading

xvello left a comment

Choose a reason for hiding this comment

bksteiny commented Aug 21, 2017

xvello commented Aug 22, 2017 • edited Loading

bksteiny commented Aug 24, 2017

xvello commented Aug 25, 2017

bksteiny commented Aug 17, 2017 •

edited

Loading

xvello commented Aug 21, 2017 •

edited

Loading

xvello commented Aug 21, 2017 •

edited

Loading

xvello commented Aug 22, 2017 •

edited

Loading