Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Optimization Proposal #498

Closed
mxinden opened this issue Jul 23, 2018 · 31 comments
Closed

Performance Optimization Proposal #498

mxinden opened this issue Jul 23, 2018 · 31 comments

Comments

@mxinden
Copy link
Contributor

mxinden commented Jul 23, 2018

/kind improvement

I have been brainstorming with @brancz on how we can improve kube-state-metrics' performance in terms of response time and memory usage.

I am welcoming any feedback on this Optimization Proposal.

Related issues: #493 #257

@mxinden mxinden changed the title Performance Obtimization Proposal Performance Optimization Proposal Jul 23, 2018
@andyxning
Copy link
Member

Great!.
We should work with developers focus on client-go to work this out. I am a little busy recently. But i am willing give some help or ideas if possible.

I am willing to seem this happen in v1.5.0. :)

@andyxning
Copy link
Member

/cc @caesarxuchao

@brancz
Copy link
Member

brancz commented Jul 24, 2018

This doesn't really have anything with client-go, or there being a problem with it. We are just proposing to change the architecture, by removing the informer cache, which we are currently building metrics from, basically with a cache of our own, that has the pre-computed results, meaning the reflector is still intended to be used.

@andyxning
Copy link
Member

by removing the informer cache, which we are currently building metrics from, basically with a cache of our own

IIUC, informer cache is inside client-go. @brancz

@brancz
Copy link
Member

brancz commented Jul 24, 2018

correct, but the reflector is also in client-go and is independently usable, it just won't write into the informer cache, but directly into the metrics cache

@directionless
Copy link

Is there somewhere to think about adding some kind of horizontal scaling?

@andyxning
Copy link
Member

but the reflector is also in client-go and is independently usable, it just won't write into the informer cache, but directly into the metrics cache

This indicates that we need to change client-go or add metrics cache plugin to replace the informer cache, right?

@brancz
Copy link
Member

brancz commented Jul 24, 2018

This indicates that we need to change client-go or add metrics cache plugin to replace the informer cache, right?

No other program will ever need what we intend to build, and client-go's reflector is already pluggable 🙂 , so I don't think there is anything to do in client-go for this concrete proposal. I think there are valid things that we can look at separately to improve client-go but they don't directly influence the proposal.

Is there somewhere to think about adding some kind of horizontal scaling?

I think this is a valid point, although somewhat different to the problems the proposal addresses. I would like to investigate whether we would fulfill the Kubernetes scalability requirements simply with functional sharding, basically by having one kube-state-metrics with all cluster-wide collectors enabled and one kube-state-metrics per namespace/tenant (maybe multiple namespaces). If that doesn't suffice we could look into (consistent) hashing of object UUIDs to distribute them among kube-state-metrics instances. Any further optimizations would require changes to the Kubernetes API to support sharding list/watch somehow.

I would do improvements roughly in the order I laid out, starting with this proposal.

@mxinden
Copy link
Contributor Author

mxinden commented Jul 24, 2018

Is there somewhere to think about adding some kind of horizontal scaling?

@directionless on a side note with metric white- and blacklisting you could already do horizontal sharding based on metric names.


I hope to follow up with a small proof-of-concept via Kubernetes client-go reflectors soon.

@directionless
Copy link

For my own clusters, I was thinking about doing sharding based on the collectors. (secrets, vs pods, vs nodes). But I think you both call out better patterns. Either sharding via metric names or global and per-namespace. Feels more scalable. I'm glad it's in your thoughts

@andyxning
Copy link
Member

For my own clusters, I was thinking about doing sharding based on the collectors. (secrets, vs pods, vs nodes). But I think you both call out better patterns. Either sharding via metric names or global and per-namespace. Feels more scalable. I'm glad it's in your thoughts

Sharding according to resource objects is an option. But may not be the optimal one. Since the number of pods is the most one among all the resource objects. Even use an single kube-state-metrics instance to collect all the pods may not work properly.

white- and black-list feature should help to this but is more hard to maintain as we need to split at more concrete metric names level instead of resource level. Trade off has to be made.

We need to add a documentation about how to shard kube-state-metrics to support larger clusters with larger resource objects.

@mrsiano
Copy link

mrsiano commented Oct 3, 2018

/cc @jeremyeder @mrsiano

@beorn7
Copy link
Contributor

beorn7 commented Oct 22, 2018

I performed a few performance measurement on SoundCloud's main K8s cluster (roughly 400 nodes, 13k cores, 10k pods). I also compared the metrics output to find out if the metrics provided are still correct.

The versions compared are the following:

I'll report the results in different comments instead of one super-large comment.

@beorn7
Copy link
Contributor

beorn7 commented Oct 22, 2018

Scrape duration and resource usage

Measurements

Note that all these values are taken from a bare-metal cluster with some hardware diversity and of course different load state on each node. Thus, all these numbers have a certain amount of noise included. However, the general trends are clearly visible.

Each KSM gets scraped every 30s by a single Prometheus server.

Version Scrape duration CPU usage RSS peak % in GC
v1.4 22s 1.0 2.5GiB 0.015
newprom_gzip 16s 0.8 2.2 GiB 0.010
newprom_nogzip 13s 0.6 2.4 GiB 0.014
mxinden_gzip 11s 0.3 0.81 GiB 0.006
mxinden_nogzip 7s 0.16 0.73 GiB 0.004

Conclusion

The improvements are remarkable. Personally, I think the scrape duration reduction is most relevant, as the question if I need 0.2 or 1.0 cores to monitor 13k cores is fairly moot. It would need a very large cluster to hit single machine limits in terms of core or memory. However, the scraping time issue has real practical relevance as it limits resolution of data collection.

Simply by vendoring the current state of prometheus/common and prometheus/client_golang into v1.4 gives already a nice boost, but the (much more invasive) changes in master of KSM yield about the same boost again.

Another important insight is that disabling gzip in the given scenario (where network bandwidth is presumably not a bottle neck) is quite a gain in almost each dimension.

@beorn7
Copy link
Contributor

beorn7 commented Oct 22, 2018

Metrics comparison

Observations

First of all, the mxinden_* versions by design do not create ane HELP or TYPE strings and do not sort the metrics. That's still valid Prometheus exposition format, but both features are “nice to have”, if their cost is not prohibitive.

mxinden_* has some metrics that don't exist at all in v1.4. I assume they have been added in master after the v1.4 release (but I didn't check the commits). The metrics are:

  • kube_pod_container_status_last_terminated_reason
  • kube_replicaset_owner

kube_pod_info is a metric that has an additional label (uuid) in mxinden_* but is otherwise identical. Again I assume it's a deliberate new feature in master.

There is one instance of a tiny deviation in kube_pod_container_info / kube_pod_container_status_* that looks like drift from slightly different times of data collection.

So far, I think the deviations are benign. The following is something that needs investigation (paging @mxinden ;o):

  • v1.4 has 1101 occurrences each of various kube_deployment_* metrics, while mxinden_* has only 630. Possibly this is information that accrues (each time a deployment is actually created) but gets lost on a restart. (The scraped v1.4 instance has run for much longer than the mxinden_* instance, indeed.) But it's just a wild guess, needs confirmation by an KSM expert.
  • Similar issue with kube_endpoint_* metrics, but 1143 vs. 326.
  • Similar issue with kube_secret_*, but 466 vs. 465.
  • Similar issue with kube_service_*, but 1141 vs. 324

Conclusion

Overall, I think the mxinden_* code is doing the right thing, assuming that my idea about certain metric types accumulating over time is correct. (This raises, however, a concern whether those metrics are ever GC'd, and if the semantics of accumulating events over the lifetime of the KSM binary that are then exposed forever is sane.)

It would be cool if HELP/TYPE could be brought back (the Prometheus server is just starting to do something with that information). I personally really like the sorting, but I could imagine it is difficult to implement and/or computationally expensive.

@beorn7
Copy link
Contributor

beorn7 commented Oct 22, 2018

Memory profiling

Observations

This is just a short summary. A lot could be investigated in more detail here.

For v1.4, the worst offender for allocations is prometheus.makeLabelPairs (25%), followed by prometheus.processMetrics (21%). Note that expfmt.MetricFamilyToText causes 13% of allocations, of which most (11%) is expfmt.labelPairsToText

For newprom_gzip, the expfmt.MetricFamilyToText allocs have almost disappeared. That raises the percentage of the other offenders prometheus.makeLabelPairs (28%) and prometheus.processMetrics (23%).

For newprom_nogzip, the situation is essentially the same. The gzip encoding doesn't make a dent in the allocation profile.

mxinden_gzip and mxinden_nogzip look very similar allocation-wise, too. The worst offender is MetricsStore.GetAll, followed by metrics.NewMetric (26%).

Conclusion

In newprom_*, the prometheus.processMetrics is allocation heavy because it creates the protomessages, which are then transformed into the text format. The latter doesn't cause significant allocations anymore with the improved client_golang (or common/expfmt). The protomessage allocation could be optimized, but the current Metric interface makes that hard (ironically, it was made the way it is in the hope of enabling performance optimization but now it's more the opposite, which has been known for a while – this will be addressed in the upcoming v0.10 of client_golang). Thus, there is no “low hanging fruit” way out of this. However, the bulk load of the proto messages in this scenario, the label pairs, is already pre-allocated and readily set in the const metrics of client_golang. This behavior of client_golang is not leveraged by KSM v1.4 because it re-creates those const labels on each scrape. Avoiding the re-creation would save the 28% allocations we see in prometheus.makeLabelPairs.

The mxinden_* code caches metrics, but it's not using the client_golang const metrics for it, but a home-grown solution, which creates the text format directly. That does not only save the re-creation of metrics on each scrape, but also the “in between” creation of proto messages. It already performs very well. The metrics.NewMetric allocations could be reduced a lot by using the same principles that the improved client_golang and common/expfmt uses to create the text format (as already discussed in person with @mxinden ). The GetAll allocations are IMHO really just the large slice of all metrics, which is probably unavoidable (but also, I guess, GC friendly). A trivial optimization would be to allocate the slice with the right capacity.

In general, the principle of creating metrics upon K8s changes and not upon scrapes of KSM could have been applied while still using client_golang. However, this would have only eliminated one of the two large allocators (prometheus.makeLabelPairs). The other part (prometheus.processMetrics) could only be addressed partially by further improvements of client_golang v0.9. Avoiding allocations of proto messages altogether is technically easy for const-metrics, but it requires breaking changes of the Metric interface, which are only planned for v0.10. Thus, the second half of the present KSM optimizations, namely exchanging client_golang const-metrics by homegrown metrics, has a significant performance gain in the current context. However, it also comes with a maintenance burden. Any future changes (e.g. support of OpenMetrics, or the re-introduction of protobuf exposition, should it be preferred by some clients) has to happen in the custom KSM code, while it would come “for free” with future versions of client_golang.

@beorn7
Copy link
Contributor

beorn7 commented Oct 22, 2018

I haven't done any CPU profiling, but I don't think it would add to the insights already gained. If anybody is interested in CPU profiling, please let me know soon (before I tear my whole test setup down).

@mrsiano
Copy link

mrsiano commented Oct 23, 2018

I haven't done any CPU profiling, but I don't think it would add to the insights already gained. If anybody is interested in CPU profiling, please let me know soon (before I tear my whole test setup down).

we've high level measurements comparison, for 10K pods on top 250 nodes
both promethues and kube-state-metrics runs 40 cores, and the following table represents the pidstats average

  gzip perf8 plain
kube-state-metrics CPU 2.27 2.39 9.8
prometheus CPU 259 239 431
scrape duration response time 1.17 3 N/A

but it will be nice, to run CPU benchmarking with pprof.

@beorn7
Copy link
Contributor

beorn7 commented Oct 23, 2018

Note that the pprof CPU profiling gives relative measures (where do I spend all those CPU cycles) but doesn't add to the comparison of total CPU cycles. That's already done in the “Scrape duration and resource usage” section above.

@beorn7
Copy link
Contributor

beorn7 commented Oct 23, 2018

@mrsiano BTW: What is “perf8”? What exactly are you comparing? And how can Prometheus burn 431 cores if there are only 40? Or is that percentage?

@mrsiano
Copy link

mrsiano commented Oct 23, 2018

@mrsiano BTW: What is “perf8”? What exactly are you comparing? And how can Prometheus burn 431 cores if there are only 40? Or is that percentage?

right, it's a percentage and it's relative to 40 cores (in other-words something like less than 1 core).

Perf8 it's another image that we've tested, as far as I remember it's the gzip with encoding disabled, max can you approve that @mxinden

@beorn7
Copy link
Contributor

beorn7 commented Oct 23, 2018

CPU profiling

So I did CPU profiling after all, with some interesting results.

Observations

Approach: I ran the standard 30s pprof run, during which I performed exactly one scrape (using curl). This gives us the additional information of CPU usage for one scrape and reflects how much the CPU is idle in those 30s. The results are in broad agreement with the measurement via Prometheus metrics above.

The percentage of time spent in various functions is relative to the sampled CPU time, not to the 30s the sampling took.

Version curl duration CPU time sampled by pprof compression sorting MetricFamilyToText (compression deducted) NewConstMetric / NewMetric
v1.4 21s 28s (95%) 4.9s (17%) 7.5s (26%) 5.3s (19%) 1.9s (7%)
newprom_gzip 15s 23s (76%) 3.2s (14%) 5.5s (24%) 2.7s (12%) 2.2s (9%)
newprom_nogzip 10s 18s (59%) n/a 4.9s (28%) 2.6s (15%) 2.0s (11%)
mxinden_gzip 7s 10s (33%) 6.4s (66%) n/a n/a 0.4s (4%)
mxinden_nogzip 3s 5s (15%) n/a n/a n/a 0.4s (8%)

Conclusion

  1. Compression is a fairly big deal, especially if all the other components have been optimized. mxinden_gzip spends 66% of its cycles in compression. However, as all binaries compress roughly the same amount of data, you can see that newprom_gzip uses least absolute time for it. That might be due to the recent improvements handling gzip compression. Those might be applicable to mxinden_gzip, too. But even with some improvements there, if you are into saving cores and scrape time, once you are as optimized as with mxinden_*, switching off gzip compression will give you a huge relative improvement.
  2. Sorting is indeed expensive in terms of CPU cycles. (Note that this only becomes relevant with the very large number of metrics in a scenario like KSM. Nevertheless, it is a real use case. I filed an issue for client_golang for optional disabling of sorting.) Note that almost half of the CPU savings between newprom_* and mxinden_* are achieved by simply dropping sorting.
  3. The MetricFamilyToText column illustrates the progress made between v1.4 and newprom_*. However, it also shows that much more CPU cycles are used elsewhere. On the other hand, once you have reached the optimization level of mxinden_nogzip. reintroducing the conversion from protobuf to text format would increase the total CPU load by 50%.
  4. The NewConstMetric / NewMetric column is interesting in multiple ways: NewConstMetric is the function to create a metric in v1.4 and newprom_, while NewMetric is used to create a (non-client_golang) metric in mxinden_. As explained before, NewMetric is only called if a change is received from the K8s client, while NewConstMetric is called for each metric on each scrape. As also explained before, the former could be applied to the latter, i.e. a KSM could be written that uses client_golang metrics but only calls NewConstMetric where mxinden_* calls NewMetric. However, we can see now that this optimization would only save a small percentage of the total CPU time. For example, dropping sorting saves 3x more CPU cycles. Another insight is that optimization of NewMetric in mxinden_* (as suggested in the allocation section above) will only save a very small relative amount of CPU cycles.

Overall, mxinden_nogzip has pretty much maxed out (pun is totally intended ;o) the potential of CPU usage optimization. Most CPU cycles are spent in the raw serving of data via HTTP. Even optimizing NewMetric would only help a tiny bit (as it is essentially already excluded from the hot path of scraping). At that level, disabling gzip is by far the biggest deal. For the hypothetical case of optimizing KSM while still using client_golang, the lowest hanging fruit would be to simply not sort the output, followed by not gzip'ing, and only then followed by caching metrics instead of creating them on each scrape (at about the same level of CPU cycles used to convert proto messages into the text format).

@mrsiano
Copy link

mrsiano commented Oct 23, 2018

thanks a lot that is indeed very useful information.

@andyxning
Copy link
Member

andyxning commented Oct 26, 2018

@beorn7

newprom_gzip 16s 0.8 2.2 GiB 0.010
newprom_nogzip 13s 0.6 2.4 GiB 0.014
mxinden_gzip 11s 0.3 0.81 GiB 0.006
mxinden_nogzip 7s 0.16 0.73 GiB 0.004

Why does the newer prometheus client test newprom_gzip and newprom_nogzip has a worse performance than the currently master branch one? I am curious about this. It should be better, IMHO.

@beorn7
Copy link
Contributor

beorn7 commented Oct 26, 2018

What master branch are you referring to? kube-state-metrics master branch should behave more or less the same as mxinden_gzip.

@mxinden
Copy link
Contributor Author

mxinden commented Oct 26, 2018

@andyxning I am sorry for the confusion. There have been two concurrent performance optimizations happening.

  1. I have refactored kube-state-metrics to cache metrics instead of Kubernetes objects and optimized the text exposition a bit ( pkg/collectors: Introduce MetricsStore updated by Reflectors #534)

  2. @beorn7 has done some major optimizations of the Prometheus client_golang text exposition logic ( Make MetricFamilyToText allocation-free prometheus/common#148)

mxinden_* is the 1. optimization. newprom_* is based on kube-state-metrics v1.4.0 with the 2. performance optimization (Prometheus client_golang) vendored.

Why does newprom_ have a worse performance than mxinden_ (current master)?

While newprom_* is faster than v1.4.0 (optimized text exposition), it does not have the caching improvement mentioned in the 1. performance optimization.

mxinden_* (equal to current master) has the new caching (mentioned in 1.) but misses the major text exposition optimization (2.). Those major text exposition optimizations will be added to kube-state-metrics via #567.

Let me know if this explanation is of any help.

@beorn7
Copy link
Contributor

beorn7 commented Oct 26, 2018

One should also add that the “major optimizations of the Prometheus client_golang text exposition logic” was actually fairly minor, while @mxinden 's change of caching the metrics (and updating them only on K8s state changes rather than recreating them on each scrape) was a much more involved work and had a much higher gain (as can be seen from above's analysis).

@mxinden
Copy link
Contributor Author

mxinden commented Nov 29, 2018

With #601 merged all major refactorings should be done and we are back to being Prometheus exposition format compliant.

In case anyone wants to give this a shot:

  • quay.io/mxinden/kube-state-metrics:v1.4.0-perf.10
  • minden/kube-state-metrics:v1.4.0-perf.10

I am planning on preparing the release notes for a first alpha tomorrow.

//CC @ehashman

@ehashman
Copy link
Member

ehashman commented Dec 3, 2018

@mxinden I have tested this in QA and the missing quota metric stats have returned on upgrade. I still am not seeing the ksm_* metrics, although I don't use them so I don't really mind.

Will be doing a rollout against a production cluster to gather some perf stats on a large (~200 node) cluster tonight.

Looking forward to the 1.5.0 release candidate!

@ehashman
Copy link
Member

ehashman commented Dec 3, 2018

FWIW I should also attach the metrics I collected from the following KSM releases, all running in parallel:

  • 1.4.0 official release
  • mxinden/perf.6
  • mxinden/perf-gzip.3

ksm1

Note that I didn't collect scrape duration info as it was going to be too much of a pain to set up. Anecdotally, the scrape times I measured are massively reduced (5-20s for 1.4.0 vs 0.5-1.2s for perf.6 and 0.6-2.2s for perf.gzip.3)

@mxinden
Copy link
Contributor Author

mxinden commented Jan 26, 2019

Given v1.5.0 was released with the new performance optimizations I am closing here. Thanks everyone for the great help and feedback on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants