-
Notifications
You must be signed in to change notification settings - Fork 601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
consistent metric use for memory #227
Comments
I am ok with RSS as well (although unfortunately not as representative as working set bytes for go programs anymore), as long as we're consistent I'm happy :) FWIW we probably should differentiate further between the different types in our Pod dashboard. |
I agree the focus should be on consistency. RSS is definitely not as useful as I would like for go >= 1.12. I would be happy to use working set if someone could explain to me how my above graphs show such different values. But otherwise, I think it would be safer to overestimate memory usage by using RSS than have a pod OOM and our reported memory not be close to the limit. |
Yeah, I need to dig into the OOMkiller again, and I feel like whatever it uses should be the default that we use for display, and then show all the breakdown(s) in the Pod dashboard. |
👍 that sounds ideal. If you get to digging into the OOMKiller before me I would love to hear what you learn! |
Reading this, it sounds like |
Disclaimer: I am not a virtual memory subsystem expert ;-) Just working on consolidating those metrics. I agree with @brancz on using
Which has RSSish semantics (as in "accounted resident memory" minus "unused file caches") although it might include some fuzziness as per https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt. @csmarchbanks I rechecked your graph and noted that your stack query doesn't apply the On my cluster
is less then The latter also accounts for active (aka non evictable) filesystem cache memory which is not present in the heap/stack golang metrics. |
ugh nevermind 🤦♂️ the subsequent stack query inherits the label selector from the heap query. |
Did we kind of have an agreement on |
Also for another documentation reference about the semantics of that metric: http://www.brendangregg.com/wss.html (courtesy of @paulfantom) |
I am ok with moving forward with Also, @s-urbaniak I do not think the reference you posted by brendan gregg is the same working set as reported by cAdvisor. As you said above: |
|
I am going to echo what @s-urbaniak said and say I am also not a virtual memory subsystem expert. Is it possible that the reason I am seeing such low working set size is that prometheus caches things in memory but does not touch them for a long time that they would be removed from Another datapoint, today I have a prom server with:
|
I spent some more time dwelling into inner workings of kernel and kubernetes memory management system. From that I would say we have 3 main concerns about choosing right metric:
First one I hope is self explanatory, so let's look at second one - OOMKiller. OOMKillerThis beast is taking only things that can be reliably measured by kernel and kill process with highest oom_score. Score is proportional to RSS + SWAP divided by total available memory [1][2] and also takes into consideration an adjuster in form of oom_score_adj (imporant for k8s [3]). Since everything in linux runs in a cgroup this score can be counted for any container by using "total available memory" of said cgroup (or a parent one if current cgroup doesn't have limits). So if we want to go only this route it seems like choosing RSS (+SWAP) would be the best way. However let's look at third option - Pod eviction. Pod evictionAccording to kubernetes documentation there are 5 signals which might cause pod eviciton [4] and only one of them relates to memory. Memory-based eviction signal is derived from cgoups and is known as What's first?In normal conditions pod eviction should happen before OOMKill due to how node eviction thresholds [6] are set compared to all available memory. When thresholds are met kubelet should induce memory pressure and processes should avoid OOMKill. However due to how kubelet obtains data [7] there might be a case where it won't see a condition before OOMKiller kicks in. SummaryConsidering all those findings I would say that our reference metric responsible for "used" memory should be WSS. However we should keep in mind that this makes sense ONLY for kubernetes due to some additional memory tweaking made by kubelet on every pod. [1]: https://github.com/torvalds/linux/blob/master/fs/proc/base.c#L547-L557 |
Thank you for the in depth description @paulfantom! One point, I would say I experience far more OOMKills from container limits than pod evictions, but I am sure that depends on your deployment. I am happy to use WSS for now, and see how it goes. Closing this ticket since #238 has already been merged. |
The resource dashboards use inconsistent metrics for displaying memory usage. Currently the cluster dashboard uses RSS and the namespace one uses total usage. I would propose that both use working-set-bytes, and the pod dashboard continues to show the distinct types as a stacked graph.
@gouthamve @metalmatze @csmarchbanks @tomwilkie @paulfantom @kakkoyun
The text was updated successfully, but these errors were encountered: