-
Notifications
You must be signed in to change notification settings - Fork 641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support consumers to access the metrics through Kubernetes metrics API #279
Comments
@ihac Hi An, thanks for reviewing our proposal! I noted down some thoughts below. (One thing I should mention first, is that my experience is mostly on node OS side (Container-Optimized OS). Lantao has much more experience with Kubernetes ecosystem.
Seems that npd-metrics-adapter sits on a different node with NPD. I wonder why not let NPD directly report to apiserver using metrics API? That means less dependency of NPD and less single-point-of-failure. I'm happy to see that the metrics exported by NPD can be consumed by apiserver as well. I totally agree that there could/should be an exporter, that can export NPD metrics to apiserver somehow. That means even for users that do not have third-party metrics-based monitoring solutions (Stackdriver, Datadog, etc), they can still have metrics-based monitoring support from Kubernetes. I'm not sure what's the right way to implement that. But if we can find a good way, I think this kind of exporter will be valuable.
I'm not sure if this is right purpose of NPD. NPD is designed to monitor only node health for many reasons. One of them I really care about is, allowing NPD to monitor other stuff (e.g. pods) may make NPD too big and complex, which means it's harder to guarantee the resource consumption and reliability of it. Can you describe what kind of object do you have in mind? In my opinion:
I also would like to hear opinion from current maintainers: @dchen1107 @andyxning @Random-Liu
If the code is open sourced, could you share the extended-NPD code? I'd like to learn from the the abstract data model and the unified exporter interface. They may help me improve my current PR #275 . Thanks! |
Hi @xueweiz Thanks for the feedback.
We're using
Couldn't agree more.
Sorry for not being clear. We agree NPD specializes in detecting and reporting the node problems and we're not going to handle metrics of pod with NPD. However, some problems are related to the pod/container running on the node, and we want to expose some basic information of pod/container if possible. Of course, the problem itself still lies within the area of node ecosystem, but it'd be better we could attach other information to it.
Currently it is for internal use and we're still trying to improve it (actually, we discussed briefly about it with @Random-Liu a few days ago). Anyway, we do have a plan to share it soon (hopefully it will help), along with the |
@xueweiz Thank you for your prompt reply, I totally agree that the goal of NPD is detecting the problem of node. |
I see. Thanks for the explanation! I think NPD has the capability to support this.
That would be great :) Thanks a lot! And just for the reference. In my new proposal, I plan to allow flexible problem daemon registration in NPD. See the "More Pluggable Problem Daemon" section. So in case your code has private API that is not suitable for open source NPD project, there is still an option to maintain a private pluggable problem daemon in a downstream-NPD while keeping rebasing pretty easy. |
@CodeJuan
Sounds fair. In today's NPD model, I guess you would report an event, and provide the Pod name in the message field. I don't know whether you want to report metrics on this problem as well? If so, how do you want the metric to look like? In my mind, the metric could look like this: Or it could look like this: By the way, just curious, could you share some context about this "container end of veth pair leak to host caused by a kernel bug"? Thanks! For example, how do you detect it (by watching kernel log? docker log?)? Is it temporary or permanent? |
Hi xuewei,
I'm sorry for my uncorrect description. There are two exceptions.
# iterate all running container
for c in containers
# get all host end veth, append to veths_in_use
# get all veth which attached to bridge
for v in `brctl show | grep veth`
# append to veths_host
# compare two arrays
# if [veth existed in veths_host] and [not existed in veths_in_use], it should be a leaked veth.
# got all veth
veths=`ip a | grep veth`
for v in $veths; do
if [[ `brctl show | grep $v | wc -l` == 0 ]]
# unattached veth
fi
The second type is more clearly. Then remedy system would re-attach veth to container or migrate container by the pod-name. But I'm not sure |
We do plan to support Gauge typed metrics. See the "Extending Today’s Problem Daemon to Report Metrics" section of the proposal. Specifically:
|
@xueweiz |
Oh didn't realize that :) If you have any ideas, it's best if you could comment on the Google Doc. So that everyone can participate in the discussion. Node-Problem-Detector_ Metrics Exporting Option (EXTERNAL).pdf |
/cc @wangzhen127 |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Hi @xueweiz @Random-Liu
Based on the proposal you shared recently, I'd like to share a typical use case in Alibaba. Hopefully it will help to perfect your design.
We all agree that reporting metrics in NPD has become an inelastic demand for production use. NPD should be able to transform the node status into metrics and expose them to Prometheus or any other external monitoring systems. What's more, we expect to do more things in the upper layer based on the metrics. One typical scenario is that when something wrong happens on a node, we want to remove the node from Kubernetes cluster and then repair it with our internal IDC management system. To solve this kind of problem, we'd like to extend the NPD to collect and expose custom metrics, and also develop a specific-to-NPD implementation of Kubernetes metrics API client (named
npd-metrics-adapter
) which allows consumers (e.g. remedy system) to access the metrics through Kubernetes apiserver.One thing to note though, sometimes we want to report custom metrics which describe a Kubernetes object (not only
node
), so a higher demand will be placed on the abstraction of underlying data model in NPD. I'm not sure whether your design is able to cover our need.In fact, we've already implemented a simple but workable version of
extended-NPD
and also thenpd-metrics-adapter
. Our design is in close agreement with your proposal: We proposed an abstract data model in NPD to cover all kinds of node statuses (events, conditions and simple metrics) and a unified interfaceexporter
to adapt to multiple different upstream systems (e.g. Kubernetes, Prometheus). But I think you've done a remarkable job by utilizing OpenCensus lib to support flexible metrics collection and aggregation. Actually it is apparently like the ability that we want to allow users to collect and export arbitrary metrics with NPD, so that our upstream systems (e.g. HPA controller, remedy system) will be able to accomplish more complex tasks.FYI, the figure below demonstrates how our system works based on the
extended NPD
andnpd-metrics-adapter
. We would be greatly appreciated if you could kindly give us some feedback.The text was updated successfully, but these errors were encountered: