-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance Optimization Proposal #498
Comments
Great!. I am willing to seem this happen in v1.5.0. :) |
/cc @caesarxuchao |
This doesn't really have anything with client-go, or there being a problem with it. We are just proposing to change the architecture, by removing the informer cache, which we are currently building metrics from, basically with a cache of our own, that has the pre-computed results, meaning the reflector is still intended to be used. |
IIUC, informer cache is inside client-go. @brancz |
correct, but the reflector is also in client-go and is independently usable, it just won't write into the informer cache, but directly into the metrics cache |
Is there somewhere to think about adding some kind of horizontal scaling? |
This indicates that we need to change client-go or add metrics cache plugin to replace the informer cache, right? |
No other program will ever need what we intend to build, and client-go's reflector is already pluggable 🙂 , so I don't think there is anything to do in client-go for this concrete proposal. I think there are valid things that we can look at separately to improve client-go but they don't directly influence the proposal.
I think this is a valid point, although somewhat different to the problems the proposal addresses. I would like to investigate whether we would fulfill the Kubernetes scalability requirements simply with functional sharding, basically by having one kube-state-metrics with all cluster-wide collectors enabled and one kube-state-metrics per namespace/tenant (maybe multiple namespaces). If that doesn't suffice we could look into (consistent) hashing of object UUIDs to distribute them among kube-state-metrics instances. Any further optimizations would require changes to the Kubernetes API to support sharding list/watch somehow. I would do improvements roughly in the order I laid out, starting with this proposal. |
@directionless on a side note with metric white- and blacklisting you could already do horizontal sharding based on metric names. I hope to follow up with a small proof-of-concept via Kubernetes client-go reflectors soon. |
For my own clusters, I was thinking about doing sharding based on the collectors. (secrets, vs pods, vs nodes). But I think you both call out better patterns. Either sharding via metric names or global and per-namespace. Feels more scalable. I'm glad it's in your thoughts |
Sharding according to resource objects is an option. But may not be the optimal one. Since the number of pods is the most one among all the resource objects. Even use an single kube-state-metrics instance to collect all the pods may not work properly. white- and black-list feature should help to this but is more hard to maintain as we need to split at more concrete metric names level instead of resource level. Trade off has to be made. We need to add a documentation about how to shard kube-state-metrics to support larger clusters with larger resource objects. |
/cc @jeremyeder @mrsiano |
I performed a few performance measurement on SoundCloud's main K8s cluster (roughly 400 nodes, 13k cores, 10k pods). I also compared the metrics output to find out if the metrics provided are still correct. The versions compared are the following:
I'll report the results in different comments instead of one super-large comment. |
Scrape duration and resource usageMeasurementsNote that all these values are taken from a bare-metal cluster with some hardware diversity and of course different load state on each node. Thus, all these numbers have a certain amount of noise included. However, the general trends are clearly visible. Each KSM gets scraped every 30s by a single Prometheus server.
ConclusionThe improvements are remarkable. Personally, I think the scrape duration reduction is most relevant, as the question if I need 0.2 or 1.0 cores to monitor 13k cores is fairly moot. It would need a very large cluster to hit single machine limits in terms of core or memory. However, the scraping time issue has real practical relevance as it limits resolution of data collection. Simply by vendoring the current state of prometheus/common and prometheus/client_golang into v1.4 gives already a nice boost, but the (much more invasive) changes in master of KSM yield about the same boost again. Another important insight is that disabling gzip in the given scenario (where network bandwidth is presumably not a bottle neck) is quite a gain in almost each dimension. |
Metrics comparisonObservationsFirst of all, the mxinden_* versions by design do not create ane mxinden_* has some metrics that don't exist at all in v1.4. I assume they have been added in master after the v1.4 release (but I didn't check the commits). The metrics are:
kube_pod_info is a metric that has an additional label ( There is one instance of a tiny deviation in kube_pod_container_info / kube_pod_container_status_* that looks like drift from slightly different times of data collection. So far, I think the deviations are benign. The following is something that needs investigation (paging @mxinden ;o):
ConclusionOverall, I think the mxinden_* code is doing the right thing, assuming that my idea about certain metric types accumulating over time is correct. (This raises, however, a concern whether those metrics are ever GC'd, and if the semantics of accumulating events over the lifetime of the KSM binary that are then exposed forever is sane.) It would be cool if HELP/TYPE could be brought back (the Prometheus server is just starting to do something with that information). I personally really like the sorting, but I could imagine it is difficult to implement and/or computationally expensive. |
Memory profilingObservationsThis is just a short summary. A lot could be investigated in more detail here. For v1.4, the worst offender for allocations is For newprom_gzip, the For newprom_nogzip, the situation is essentially the same. The gzip encoding doesn't make a dent in the allocation profile. mxinden_gzip and mxinden_nogzip look very similar allocation-wise, too. The worst offender is ConclusionIn newprom_*, the The mxinden_* code caches metrics, but it's not using the client_golang const metrics for it, but a home-grown solution, which creates the text format directly. That does not only save the re-creation of metrics on each scrape, but also the “in between” creation of proto messages. It already performs very well. The In general, the principle of creating metrics upon K8s changes and not upon scrapes of KSM could have been applied while still using client_golang. However, this would have only eliminated one of the two large allocators ( |
I haven't done any CPU profiling, but I don't think it would add to the insights already gained. If anybody is interested in CPU profiling, please let me know soon (before I tear my whole test setup down). |
we've high level measurements comparison, for 10K pods on top 250 nodes
but it will be nice, to run CPU benchmarking with pprof. |
Note that the pprof CPU profiling gives relative measures (where do I spend all those CPU cycles) but doesn't add to the comparison of total CPU cycles. That's already done in the “Scrape duration and resource usage” section above. |
@mrsiano BTW: What is “perf8”? What exactly are you comparing? And how can Prometheus burn 431 cores if there are only 40? Or is that percentage? |
right, it's a percentage and it's relative to 40 cores (in other-words something like less than 1 core). Perf8 it's another image that we've tested, as far as I remember it's the gzip with encoding disabled, max can you approve that @mxinden |
CPU profilingSo I did CPU profiling after all, with some interesting results. ObservationsApproach: I ran the standard 30s pprof run, during which I performed exactly one scrape (using The percentage of time spent in various functions is relative to the sampled CPU time, not to the 30s the sampling took.
Conclusion
Overall, mxinden_nogzip has pretty much maxed out (pun is totally intended ;o) the potential of CPU usage optimization. Most CPU cycles are spent in the raw serving of data via HTTP. Even optimizing |
thanks a lot that is indeed very useful information. |
Why does the newer prometheus client test |
What master branch are you referring to? kube-state-metrics master branch should behave more or less the same as mxinden_gzip. |
@andyxning I am sorry for the confusion. There have been two concurrent performance optimizations happening.
mxinden_* is the 1. optimization. newprom_* is based on kube-state-metrics v1.4.0 with the 2. performance optimization (Prometheus client_golang) vendored. Why does newprom_ have a worse performance than mxinden_ (current master)? While newprom_* is faster than v1.4.0 (optimized text exposition), it does not have the caching improvement mentioned in the 1. performance optimization. mxinden_* (equal to current master) has the new caching (mentioned in 1.) but misses the major text exposition optimization (2.). Those major text exposition optimizations will be added to kube-state-metrics via #567. Let me know if this explanation is of any help. |
One should also add that the “major optimizations of the Prometheus client_golang text exposition logic” was actually fairly minor, while @mxinden 's change of caching the metrics (and updating them only on K8s state changes rather than recreating them on each scrape) was a much more involved work and had a much higher gain (as can be seen from above's analysis). |
With #601 merged all major refactorings should be done and we are back to being Prometheus exposition format compliant. In case anyone wants to give this a shot:
I am planning on preparing the release notes for a first alpha tomorrow. //CC @ehashman |
@mxinden I have tested this in QA and the missing quota metric stats have returned on upgrade. I still am not seeing the Will be doing a rollout against a production cluster to gather some perf stats on a large (~200 node) cluster tonight. Looking forward to the 1.5.0 release candidate! |
FWIW I should also attach the metrics I collected from the following KSM releases, all running in parallel:
Note that I didn't collect scrape duration info as it was going to be too much of a pain to set up. Anecdotally, the scrape times I measured are massively reduced (5-20s for 1.4.0 vs 0.5-1.2s for perf.6 and 0.6-2.2s for perf.gzip.3) |
Given v1.5.0 was released with the new performance optimizations I am closing here. Thanks everyone for the great help and feedback on this! |
/kind improvement
I have been brainstorming with @brancz on how we can improve kube-state-metrics' performance in terms of response time and memory usage.
I am welcoming any feedback on this Optimization Proposal.
Related issues: #493 #257
The text was updated successfully, but these errors were encountered: