Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes: Add CPU and memory capacity reporting #2935

Merged
merged 5 commits into from
Oct 21, 2016

Conversation

markine
Copy link
Contributor

@markine markine commented Oct 19, 2016

What does this PR do?

Add CPU and memory capacity reporting to the Kubernetes check.

This is a rebase of #2766 with a part of the code moved to KubeUtil.

Motivation

Monitoring of cluster resources from a scheduling/allocation perspective is important. E.g. over-allocation of CPU resources to deployments may cause pod scheduling failures even if actual CPU usage in the cluster is low. We want to get ahead of these errors by monitoring capacity data in DataDog.

Testing

  • Unit tests updated.
  • Live-tested on Kubernetes 1.2.4 in AWS with this diff against dd-agent 5.9.1.

@markine
Copy link
Contributor Author

markine commented Oct 19, 2016

FYI I'll update the tests.

@markine
Copy link
Contributor Author

markine commented Oct 19, 2016

Update: tests are ready, please review. I am not familiar with your build system but I don't think the Travis CI failure is related to my change.

@masci masci self-assigned this Oct 21, 2016
Copy link
Contributor

@masci masci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, just one comment on metric names


tags = instance.get('tags', [])
self.publish_gauge(self, NAMESPACE + '.cpu.capacity', float(num_cores), tags)
self.publish_gauge(self, NAMESPACE + '.memory.capacity', float(memory_capacity), tags)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add node. prefix to those metrics, so that we have kubernetes.node.*?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review, @masci . Are you sure about this? The current name goes nicely with the other metrics:

kubernetes.cpu.limits
kubernetes.cpu.requests
kubernetes.cpu.capacity
kubernetes.memory.limits
kubernetes.memory.requests
kubernetes.memory.capacity

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markine you're right, thanks for pointing out, I underestimated the number of metrics regarding nodes we already have. Hope we'll be able to refactor the naming scheme eventually, for now let's keep those as they are in your code!

@masci masci merged commit de381f7 into DataDog:master Oct 21, 2016
@markine
Copy link
Contributor Author

markine commented Oct 21, 2016

Thank you @masci

jcejohnson pushed a commit to EFXCIA/dd-agent that referenced this pull request Oct 24, 2016
* Add patch from DataDog#2908 DataDog#2908 to beter handle units.

* Port change from DataDog/dd-agent DataDog#2766

* Move machine info URL management into kubeutil

* Update kubernetes tests for capacity data.
@markine markine deleted the feature/capacity-details branch March 22, 2017 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants