-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
k8s w/ ksm integration issues #1853
Comments
hi @SleepyBrett , First, please note that you can disable the ksm installed by our chart by setting As for the ksm-monitoring agents, could you please:
If the issue persists, we'll need a full debug agent flare sent to our support team to investigate further. Cheers |
Ok so i can deploy one less ksm as long as I continue to run it in my namespace, but that doesn't really answer my question "How do you do KSM service discovery?" What if my prometheus KSM is configured differently than the one you deploy? You should be doing discovery by labels and apply datadog ownership labels to your own KSM. We are running 6.2, your performance on the node w/ ksm is still abysmal and I fear that it because of the high amount of self-crashes that I'm losing other metrics on that node.
What exactly does a check runner do? This documentation doesn't actually explain whats happening or what tradeoffs I'm making when I increase this value. Wait .. https://github.com/DataDog/datadog-agent/wiki/Python-Check-Runner So this is going to cause more resource utilization on ALL pods just so one of them can keep up with KSM.. this seems like a bad solution to a bad design. Why can't your runner realize that it's falling behind on a given node and scale it's own additional runner? Or better yet, realize that KSM is a big hunk of metrics on any cluster that isn't tiny and sidecar a special agent into that pod to run separate from the main agent daemonset? |
Hi @SleepyBrett Our autodiscovery process is documented at https://docs.datadoghq.com/agent/autodiscovery/ , and works across all namespaces, by locally querying the kubelet. Our support team will be happy to help you with setting it up for your cluster's specifics. |
So wait you are just essentially checking for docker container names that contain 'kube-state-metrics'? and then hitting them on I run a multi-tenant cluster, I can't guarantee that other teams wont run their own kube-state-metrics for their own purposes and pollute the data. I suggest any of the following approaches:
Of all those options 1 still seems the most sensible. It works around a number of problems. It allows for one special agent that will allow tweaking for resources and scheduling, it guarantees that other collectors will not be interrupted by overloading the primary node agent, and it means you wouldn't have to do autodiscovery since ksm will be on localhost with the special agent. It seems like I can probably implement this workaround myself. |
Hey @SleepyBrett, I just wanted to weigh in on this thread - Apologies for the slight delay! You are correct, in this case, we will be looking for pods running a container image called As you probably saw in the doc @xvello mentioned, this is one of the several processes used to do Autodiscovery (relying on annotations, on KV stores). I understand that given your environment this may not be the desired behavior, luckily it is possible to disable it via a config map. This default behavior is enabled via the file backend for Autodiscovery. You can see the configuration at This file defines the configuration of the check, as well as the identifiers to use for discovery of containers/pods the integration should monitor. You can disable this behavior by adding a config map to the datadog-agent which replaces the auto_conf.yaml file at You can then optionally enable the integration for the KSM instances you do want to monitor by adding the following annotations to their manifest:
You can find more details on file based and annotation based autodiscovery in our documentation at: https://docs.datadoghq.com/agent/autodiscovery/#template-source-kubernetes-pod-annotations That being said as @xvello mentioned Datadog does not require our own instance of Kube-State-Metrics. This is an optional dependency and we’re happy to monitor the existing instances on your cluster. In addition to this we wanted to share a bit more about the performance challenges being seen with KSM in your cluster, since there have been a few issues on this topic. There are presently two bottlenecks with the kube-state-metrics check which are impacting your environment: 1/ Payload size in large clusters The size of the output generated when hitting the There are a few ways to address this:
So even if you set this in your DaemonSet, only the agent running the KSM check will have an increased number of runners.
We agree that the performance here is not ideal and are working on a number of solutions to upstream kube-state-metrics and prometheus_client to allow them to scale as your clusters grow. 2/ Processing limitations of the upstream library One of the most recent improvements we’ve made was to contribute a 4.5x performance improvement on the standard python prometheus-client library. You can find the PR at: prometheus/client_python#282 I hope this helps provide a better understanding of the behavior you are seeing and offers some insight into our plans moving forward. We would be happy to discuss any ideas or suggestions you may have on how to further improve the experience. Thank you for your patience and feedback on the check and integration. Best, |
This seems to be a problem with your agent. As a test I spun up a pod with ksm+venuer-prometheus+venuer and it can parse and ship it without fail (~75nodes, ~7500pods), not to mention my cluster-local prometheus is scraping it quite happily. You do a significant amount of munging the data gathered from KSM, I'd suggest that the proper direction for you to move in is to create your own "KSM" that does all the transforms that you need and ship directly from that pod. |
I believe we're hitting the same issue with out cluster. Is there a definitive way to test that the Our basic symptoms are that any datadog agent living on the same node as the datadog-kube-state-metrics instance has lots of restarts which result in gaps in data collection of kube-state-metrics. |
@endzyme thank you for your report and apologies for the headache. We still have this on our roadmap, it has not been something easy to fix - We decided against forking KSM (or re-writing our own version of it). The current workaround solution is to have one agent, with higher memory/cpu specs, deployed as a sidecar of KSM and solely run the KSM check, as suggested earlier in this thread. If this is still not enough we can split up the KSM deployment into 2 collectors (one with pod level data and the other one with the rest) and the agent runs the KSM checks at different frequencies in order to provide a stable and consistent collection. While these are workarounds, we are planning on having a stable solution that would benefit from local metrics collected from the kubelet and cluster level metrics collected from KSM, in order to eliminate as much as possible the overlap that exists between them. Best, |
Thanks @CharlyF for the workarounds! These will be helpful as we scale. I am thinking of contributing to the existing datadog agent helm chart, to allow for a separate agent deploy specifically bound to |
Of course! |
Thanks for the heads up - maybe we'll run into each other at kubecon. Sounds like I should hold off on contributing until you all can come up with a game plan. We appreciate you all digging into this! |
We are running into this as well. Our clusters are growing to sizes where the output of I discovered something that can help a lot to alleviate this issue until a more final fix is found:
|
I think a good easy solution for this and other kubernetes issues is to create a stripped down agent that ONLY does checks, ship it with the checks, all off, allow us to turn them on. This container should not perform any other checks including built in system checks. Also it should be set up to not run as root (a huge problem with your other images). In this way we could sidecar that agent onto ksm, customers on our cluster could sidecar onto their nginx pods and whatnot to get your integration. Those agents should also be set up just to emit statsd (or be able to be configured to do that) with configuration for statsd target host and port. |
FYI KSM 1.5 was released a couple weeks ago. KSM's performance was massively improved in kubernetes/kube-state-metrics#498 so that should help with the above issue too. In our case it was an almost 10x improvement in |
So I followed the advice here (no cpu limits on v1.5.0 ksm) but the collocated datadog-agent pod still restarts non stop due to OOM errors, I do not want to increase memory requests just cause one pod of the DS is restarting though. Any suggestions? Logs say:
|
I think it's been suggested a few times above to try deploying a separate set of agents specifically for kube-state-metrics collection. That way you can manage those resources differently than your normal agents for metrics and statsd collection. |
Hmm seeing that no proper inclusion of this approach is in the helm chart, I opted to just increase the memory limit of the agent ds to allow the one that interacts with ksm to consume more memory, so far it yields good results, e.g. 0 restarts. |
Hi all, Quick update on this issue. With the cluster agent, it is now possible to run the kubernetes_state check as a cluster-level check (more info here). The main advantage of this is that cluster check can be run by a separate agent deployment: https://github.com/DataDog/datadog-agent/blob/6.12.2/Dockerfiles/manifests/agent-clusterchecks-only.yaml - which can be sized appropriately, with 2 or 3 replicas limited to a few GB of memory usage for example. This is especially useful for checks like KSM that are resource intensive, and create imbalance in the agent daemonset workload. Since kubernetes_state is autodiscovered, you will also need to mount an empty file in place of the autodiscovery template in the agent daemonset to avoid a normal agent running the check as well. That template is at This is what we are using in production for kubernetes_state checks in large clusters (the check needs more memory than we recommend allocating in large clusters), and it works very well, allowing us to reduce memory requests and limits for the agent daemonset, making the placement of its pods easier. We are working on describing this process more precisely in the documentation, please reach out on slack or to support if you need more details in the meantime. |
Got OOM Killed with 256Mi memory limit:
Image: datadog/agent:6.13.0 Deployed from https://github.com/helm/charts/tree/master/stable/datadog Only one pod crashes constantly.
I guess this happened because priorityClassName hadn't been set and the node had a lot of pods after cluster rollout. I updated it to UP: I found that the chart has the lines in
So according to the comment I replaced them with
UP2: even after the empty file was configured one pod still crashes with the same error message after @hkaj could you help with the issue, please? I can't find how the check can be disabled. UP2: UP3: Finally fixed (many thanks to @hkaj for the help) by adding these settings into the chart
|
Hi @kivagant-ba - the issue here is that the agent running the kubernetes_state needs more memory than that to complete it. This is due to the amount of metrics that ksm exposes (which is both a good thing for observability, and a not-so-good thing for resource usage 😄 ). The two solutions you have are:
What I would consider if I were you is the overall memory usage of your cluster, i.e. if you have 3 nodes and need to add 64MB of RAM to the daemonset to not OOM, that's 192 MB total. Less than what a separate agent deployment would need. If you have a larger cluster, or need more RAM to not OOM, the side deployment is better. Here's the PR for the docs on how to do that btw: DataDog/documentation#5013 - it's very early stage, but the instructions are correct. |
@hkaj , the documentation link really helped! I updated original message to collect everything together. |
Output of the info page (if this is a bug)
Describe what happened:
K8S 1.9.6 (though I don't think it matters)
I have datadog deployed as a daemonset (using your stable chart) in the same namespace as my current monitoring stack (prometheus 2.x + node exporter + ksm + ...). I have the KSM integration enabled in the chart.
All the pods come up fine though two of them have significantly higher cpu usage and crash a lot. They seem to be getting liveness killed at a very high rate (2200+ crashes over 14 days). It just so happens that those two pods are on the same nodes as the two ksm pods (one mine one yours).
So I'll probably move DD to it's own namespace though I'm not sure that will resolve the discovery of the ksm i want it to ignore.
So my question is, how are you doing KSM discovery? And maybe it makes more sense if your KSM scraper can't handle this size of cluster (50ish m4.10x nodes, 2500ish services) to deploy a separate agent from the daemonset specifically for scraping that KSM (maybe co-habitated in the same pod) so that I don't lose other node specific metrics due to the high rate of crashing.
Describe what you expected:
DD k8s w/ KSM to behave.
Steps to reproduce the issue:
Build a reasonably sized cluster and try to run dd w/ ksm
Additional environment details (Operating System, Cloud provider, etc):
coreos, aws, k8s 1.9.6
The text was updated successfully, but these errors were encountered: