-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Heapster long term vision #769
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,130 @@ | ||
# Heapster Long Term Vision | ||
|
||
## Current status | ||
|
||
Heapster is an important component of Kubernetes that is responsible for metrics and event | ||
handling. It reads metrics from cluster nodes and writes them to external, permanent | ||
storage. This is the main use case of Heapster. | ||
|
||
To support system components of Kubernetes Heapster calculates aggregated metrics (like | ||
sum of containers' CPU usage in a pod) and long term statistics (average, 95percentile | ||
with 1h resolution), keeps them in memory and exposes via Heapster API. This API is mainly | ||
used by Horizontal Pod Autoscaler which asks for the most recent performance related | ||
metrics to adjust the number of pods to the incoming traffic. The API is also used by KubeDash | ||
and will be used by the new UI (which will replace KubeDash) as well. | ||
|
||
Additionally Heapster API allows to list all active nodes, namespaces, pods, containers | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the use case for this rather than querying the API server directly? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This was done for KubeDash but we should get the list of pods from APIserver only. I will add more info on that. |
||
etc. present in the system. | ||
|
||
There is also a HeapsterGKE API dedicated for GKE through which it’s possible to get a full | ||
dump of all metrics (spanning last minute or two). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. last two minutes as of now to be specific. |
||
|
||
Metrics are gathered from cluster nodes, but Heapster developers wanted it to be useful also | ||
in non-Kubernetes clusters. They wrote Heapster in a such a way that metrics can be read not | ||
only from Kubernetes nodes (via Kubelet API) but also from custom deployments via cAdvisor | ||
(with support for CoreOS Fleet and flat file node lists). | ||
|
||
Metrics collected by Heapster can be written into multiple kinds of storage - Influxdb, | ||
OpenTSDB, Google Cloud Monitoring, Hawkular, Kafka, Riemann, ElasticSearch (some of them are | ||
not yet submitted). | ||
|
||
In addition to gathering metrics Heapster is responsible for handling Kubernetes events - it | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What do you mean by "handling"? Events are supposed to be non-operational. Listing alone has a huge performance impact b/c events outnumber pod-count ~ 5-10:1 on a large churn cluster. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Heapster listens on events and writes them to external data stores (Google Cloud Logging, Influxdb etc) |
||
reads them from Kubernetes API server and writes them, without extra processing, to a selection | ||
of persistent storages: Google Cloud Logging, Influxdb, Kafka, OpenTSDB, Hawkular, | ||
ElasticSearch, etc. | ||
|
||
There is/was a plan to add resource prediction components (Initial Resources, Vertical | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What does that mean is/was? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We wrote some code that predicts resource consumption for a container https://github.com/kubernetes/kubernetes/blob/master/docs/proposals/initial-resources.md But further work is suspended at this moment due to higher-priority tasks. |
||
Pod Autoscaling) to Heapster binary. | ||
|
||
## Separation of Use Cases | ||
From the current state description (see above) the following use cases can be extracted: | ||
|
||
* [UC1] Read metrics from nodes and write them to an external storage. | ||
* [UC2] Expose metrics from the last 2-3 minutes (for HPA and GKE) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That is a very short queryable period. Is there some discussion somewhere on this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this mean raw metrics? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. All data is written to some (hopefully) permanent storage. If you need them you should query the storage directly. AFAIK there is no current need for a bigger window for ALL metrics in Heapster, UI needs cpu and memory only (and maybe some net stats in the future). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Disk will be required before net stats.. On Tue, Dec 8, 2015 at 12:49 PM, Marcin Wielgus [email protected]
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, will add it when needed. |
||
* [UC3] Read Events from the API server and write them to a permanent storage | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the reason that event storage is part of heapster rather than a separate tool to do that? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah I see the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep, right now metrics and events are combined into one tool but we are planning to split them. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The original plan was to combine events and metrics data and build more interesting signals for end users. I guess even if we split heapster into separate binaries, if we want to build such models, we will have to combine the data somewhere else. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. AFAIK there are no immediate plans to for any event/metrics combining. Once we decide to do it we can revisit this item and decide what is the best:
If you want to discuss this now please schedule a VC. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. events are too heavy imo, I would vote to not try to amalgamate, but leave that to back-end systems to ETL and learn from operational data. |
||
* [UC4] Do some long-term (hours, days) metrics analysis to get stats (average, 95 percentile) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The "max" value is also important here, because it lets you see if there was any activity at all (not particularly useful for things like CPU and memory, but for net, or certain custom metrics like hits per second, it could be used to determine if the pod was useful) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Currently, max, avg and 95%ile is made available for last minute, hour and day. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep, max will be there too. |
||
and expected resource usage. | ||
* [UC5] Provide cpu and memory metrics for longer time window for the new Kubernetes | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Don't really like the idea of having different queryable windows for different metrics. Makes is hard to compare. I understand the importance of CPU & memory but would like to see it be a bit more consistent. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well this is a compromise. Storing all of the data for 15 min will consume lots of memory for nothing (no use case). UI and some Vish's experimental scheduler needs cpu and mem info for a longer time window (preferably 1h). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we should define what "lots" means before we try to compromise/optimise. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you have any specific use case that could justify the extra memory? Right now we have 2 flags/constants that control this behavior. We can measure what is the cost of storing all metrics for 15 min. We hope to run some more performance test late this/early next week. Then we can decide based on hard data, but I have a feeling that being consistent will make memory consumption >4x bigger. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would it be possible to let users decide what metrics they want to retain? Essentially a whitelist? That will require standardizing metric names though, which will not work for non system metrics well. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, it will be possible. You can tweak 2-3 min, 15 min and whitelisted metrics names. |
||
Dashboard UI (15 min for 1.2, up to 1h later for plots) | ||
|
||
UC1 and UC2 go together - to expose the most recent metrics the API should be connected | ||
to the metrics stream. | ||
UC3 can be completely separated from UC1, UC2 and UC4 - it reads different data from a | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Although this is true today, what if we wanna combine events with metrics? Is the suggestion to aggregate the same data in a separate service? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See my comment above. Happy to chat about it over a VC. |
||
different place and writes it in a slightly different format to different sinks. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One other use case has been that of letting heapster act as a source for events instead of using etcd. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @wojtek-t What is the current plan regarding events storage? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mwielgus - Folks have wanted to shard events to different systems, e.g. kafka. |
||
UC4 is connected to UC1 and UC2 but it is more based on data from the permanent storage | ||
than on the super-fresh metrics stored in the memory. | ||
UC5 can go either with UC1/UC2 or with UC4. As there is no immediate need for UC4 we will | ||
provide basic UC5 together with UC1/UC2 but in the future it will join UC4. | ||
|
||
This separation leads to an idea of splitting Heapster into 3 binaries: | ||
|
||
* Core Heapster - covering UC1, UC2 and temporarily UC5 | ||
* Eventer - covering UC3 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👍 |
||
* Oldtimer - covering UC4 and UC5 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it makes sense to keep UC1,2,4,5 together as they're dealing with related data. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe. We don't have clear requirements for Oldtimer so for now we will keep it separate. Heapster and Oldtimer can be glued together at any time if needed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If there's no clear requirements then no need for it all. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually there is a need for it. Initial Resources Estimator is kindof piggybacked in Kubernetes Master. There is/was a plan to move it to Heapster. Vish's experimental scheduler also wants some long-term stats. And there is UI. But all of them are a bit vague and not top priority. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, long-term stats are useful for things like idling (e.g. "did this metric have a non-zero value over the past 24 hours ever?") There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So does kubedash. If heapster API can hide oldtimer and gracefully reduce the amount of available data, then I guess oldtime is only an implementation detail. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @DirectXMan12 Yes @vishh We will try to keep the api and also hack some stuff around stats in Heapster to temporary support some stats (avg/max, 95% is more tricky to squeeze into small memory) in the core/new Heapster but everything will depend on how much time we will have left. This is not the top priority feature. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Backwards compatibility could be opt-in. If a user cares about kubedash, they can then pay the extra resource overhead. I'd prefer calling that out explicitly in the doc though to be clear. |
||
|
||
## Reduction of Responsibility | ||
|
||
With 3 possible node sources (Kuberentes API Server, flat file, CoreOS Fleet), 2 metrics | ||
sources (cAdvisor and Kubelet) and constantly growing number of sinks we have to separate | ||
the stuff that the core Heapster/K8S team is responsible for and what is provided as a | ||
plugin/addition and doesn’t come in the main release package. | ||
|
||
We decided to focus only on: | ||
|
||
Kubernetes API Server node source | ||
Kubelet metrics source | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. kubernetes/kubernetes#18770 proposes using cAdvisor directly for metrics. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. At the time of writing the plan was to have a Kubelet specific api. Then the direction changed couple times and I'm personally not sure what is the current "approved" plan and how stable it is. For now we will talk with Kubelet via the old (but fixed) cAdvisor-specific api and see what the Node Team will provide. |
||
Influxdb, GCM, GKE (there is special endpoint for GKE that exposes all available metrics), | ||
Hawkular sinks for Heapster | ||
Influxdb, GCL sinks for Eventer | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Instead, can we define what the requirements are for submitting a sink to heapster? Kafka for example is also a popular deployment option for users. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Kafka is not a "trivial" deployment and anyone who has it will probably be able to build his own image. With 1.2 the sinks will remain compiled in. We can work out some specific requirements later (tests, e2e tests, performance checks, only "official" client libraries etc). |
||
|
||
The rest of the sources/sinks will be available as plugins. The plugin will be used in 2 flavors: | ||
|
||
* Complied in - will require the user to rebuild the package and create his own image with | ||
the desired set of plugins. | ||
* Side-car - Heapster will talk to plugin’s HTTP server to get/pass metrics through a well | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Doesn't necessarily have to be a side-car container, just an endpoint that accepts the push of metrics in defined format. Might even be outside of cluster in some cases. But the idea of a side-car container is also a possibility. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agree, it can be outside of the cluster. |
||
defined json interface. The plugin runs in a separate container. | ||
|
||
K8s team will explicitly say that it is NOT giving any warranty on the plugins. Plugins e2e | ||
tests can be included in some CI suite but we will not block our development (too much) if | ||
something breaks Kafka or Riemann. We will also not pay attention to whether a particular sink scales up. | ||
|
||
For now we will keep all of the currently available sinks compiled-in by default, to keep the | ||
new Heapster more or less compatible with the old one, but eventually (if the number of sinks grows) | ||
we will migrate some of them to plugins. | ||
|
||
## Custom Metrics Status | ||
|
||
Heapster is not a generic solution for gathering arbitrary number of arbitrary-formated custom | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why not? What if custom metrics are required for the GKE pipeline in the near future? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Because we cannot scale that much to say that we can take arbitrary number of custom metrics. 100+ metrics per pod with 1000 nodes and 30 pods on each node will probably require sharded/clustered Heapster for which we will likely not get "time budget" anytime soon. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You'd likely get more mileage out of contributing to existing projects, e.g. Prometheus, if this became a requirement. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we assume that none of those custom-metrics are cached, do we still expect to have a huge resource impact to proxy metrics over to a sink? |
||
metrics. The support for custom metrics is focused on auto-scaling and critical functionality | ||
monitoring (and potentially scheduling). And Heapster is oriented towards system metrics, not | ||
application/business level metrics. | ||
|
||
Kubernetes users and application developers will be able to push any number of their custom | ||
metrics through our pipeline to the storage but this should be considered as a bonus/best effort | ||
functionality. Custom metrics will not influence our performance targets (no extra fine-tunning effort | ||
to support >5 custom metrics per pod). There will be a flag in Kubelet that will limit the | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you explain more about There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It means that if we scale to 1000 nodes x 30 pods with 5 custom metrics but we fail with 6 custom metrics then we will be happy and we will not try to improve. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A lot of the information here is still being debated in various issues and PRs. I'd suggest adding a TBD for this section. |
||
number of custom metrics. | ||
|
||
## Performance Target | ||
|
||
Heapster product family (Core, Eventer and Oldtimer) should follow the same performance goals | ||
as core Kubernetes. As Eventer is fairly simple and Oldtimer not yet fully defined this section | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will OldTimer be abstracted using Heapster APIs? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is no official Heapster/Metrics api yet. There is some api for stats but I'm not sure if will be exactly reproduced in Oldtimer. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So is the answer "no", there will not be a common API in the future, or, "maybe", there can potentially be a common API through heapster? |
||
will focus only on Core Heapster (for metrics). | ||
|
||
For 1.2 we should scale to 1000 nodes each running at least 30 pods (100 for 1.3) each reporting | ||
20 metrics every 1 min (30 sec preferably). That brings us to the number of 600k metrics | ||
per minute and 10k metrics per second. | ||
|
||
Stretch goal (for 1.2/1.3) is 60k metrics per second (possibly with not everything being written to Influxdb). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Going to have to be careful with these figures as long term storage options will probably struggle with this target. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I know that long term storage will most likely have troubles with this amount of data. Hopefully we will run some tests soon to check whether the targets are doable at all. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Storage backends have the option of downsampling data as well. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, but I would rather downsample/drop metrics in Heapster and not push unneeded data through the wire. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 Sinks should be able to filter what's required in heapster. |
||
On smaller deployments, like 500 nodes with 15-30 pods each it should be easy to have 30 sec | ||
metrics resolution or smaller. | ||
|
||
Memory target - Fit into 2 gb with 1000 nodes x 30 pods and 6 gb with 1000 node x 100 pods (~60kb per pod). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. From this can we extract the target for a single container sample with all metrics in memory? That would be very useful when we work on the data structs. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Kindof. We don't want to use more than 60kb per pod (with 1 container) in total. In theory you can divide it by the number of samples you want to store and get some number (like 4kb for 15 samples for 15 minutes). In our internal discussions we did it the other way round. We calculated the weight of all information and data structs needed to keep it and we got around 3.5 kb per pod sample. |
||
|
||
Latency, measured from the time when we initiate scraping metrics to the moment the metric | ||
change is visible in the API, should be less than 1*metrics resolution, which mainly depends | ||
on how fast it is possible to get all the metrics through the wire and parse them. | ||
|
||
The e2e latency from the moment the metric changes in the container to the moment the change is | ||
visible in Heapster API is: metric_resolution + heapster_latency. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a link to an issue that tracks the replacement?
We should place that issue in the kubedash repo as well to make it clear for the existing kubedash users.