-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High CPU and memory usage on Windows Node #2235
Comments
Created Azure support request #2403220050000684 |
Killing the pod "fixed" the issue. I need to find a way to repo this. Things started going downhill in the node after adding a new node pool, and moving a pod+disk between the pools serveral times. |
@david-garcia-garcia do you know which container is consuming more cpu and memory? could you run following command when you could repro? thanks.
|
@andyzhangx I opened a support ticket again. I am keeping the pod alive this time, hope I can hold it for a while because it totally draining a whole CPU full time.
I believe the most straight forward way of finding the bug in the driver would be to take a process dump directly from within the node and analyzing what is going on with cpu. |
@david-garcia-garcia can you share the logs of csi-azure-disk-m-node-win-m2vrh pod first? thanks. |
I removed the subscription id from the logs. |
BTW, memory usage keeps accumulating:
|
@david-garcia-garcia if you have an azure file driver pod on the same node, could you run following command to get the etl file?
|
@andyzhangx yes, there is an azure file driver on the same node (I believe these are both deployed automatically with a managed AKS). Here is the etl file: As per the metrics, the azurefile driver seems to be OK in terms of memory and cpu usage:
|
@david-garcia-garcia: @andyzhangx and I have taken a closer look, and it seems that the issue was caused by a PowerShell command. We zoomed in from 98.55s to 99.25s, with the overall CPU usage attributed to the PowerShell command being 60%. Specifically, the PowerShell process (PID 15096) accounted for 44.15% of the total CPU usage, as shown in the image below: PowerShell(15096) command line is highed below: @andyzhangx and I will investigate the CSI disk components to identify the source of this issue. |
@Howard-Haiyang-Hao thanks for the feedback, I am trying to put things together. I understand from your message that you have identified what might be causing the issue and that further investigation needs to be done. Just as an update - in case this might be helpful - current memory usage for the disk plugin on the node is almost 1.3GB
Maybe a memory dump could help to pinpoint the cause of the issue. |
Thanks, @david-garcia-garcia, for highlighting the memory issues. Could you please collect a new trace for us? Here are the steps:
Thanks, |
|
@david-garcia-garcia Apologies for the incorrect trace command. You're right, we need to use "-filemode": |
@Howard-Haiyang-Hao results here: Use "azure-csi-2024" to open |
@david-garcia-garcia could you get the full logs of the driver pod on that node?
|
@andyzhangx yes in my post #2235 (comment) BTW all the CSI driver pods died tonight, for whatever reason K8S decided to kill them, so I've lost any possiblity of extracting any relevant logs again until the issue surfaces again. |
@david-garcia-garcia I suspect that the high CPU usage is caused by a dead loop in get volume stats: #2267 |
@andyzhangx looking at the PR, having depth protection on a recursive call makes sense, but if that is the cause, then there must be something flawed or and edge case not covered in this recursion logic. Having a recursion error is in any case better than an infinite recursion. Considering that - as to my understanding of the algorithm - the mount argument mutates on every recursion, not sure if we could accumulate all the mutations and have them in the error output for troubleshooting purposes. Something like this (I'm not familiar/comfortable with the golang language and syntax, so I asked an AI for some help):
|
@david-garcia-garcia I will publish v1.28.7 with the fix, are you willing to have a try first? just mail me the aks cluster fqdn, I could upgrade csi driver version in your cluster this week, thanks. |
@andyzhangx Sent you an e-mail. Thanks! |
@david-garcia-garcia driver version updated, pls try with windows workloads, thanks. |
finally I have worked out a PR to get rid of expensive call |
@david-garcia-garcia would you mind to share the trace again? Thanks! |
I am experiencing the same issue. I have between 6 to 16 disks attached on B8ms windows nodes. I modified temporarily the limits on the deamon set of csi-azuredisk-node-win to avoid going bursting the server. Do I just need to update the driver version in the deamonset to get it fixed? @Howard-Haiyang-Hao @andyzhangx |
@MostefaKamalLala what's the result of following command?
|
@andyzhangx It wont show me the value I had it at a couple of hours ago, but I still have the result on my terminal of the command and after changing the limit to 150Mi just for information, those showing 0 are from dormant nodes. Here is the result of your command that I executed now: I wanted to get the metrics from prometheus to show you form history but it seems like it doesnt scrap it |
@andyzhangx I assume It is better to uninstall the AKS managed csi-driver and install it myself with the chart to the latest version right? |
it's fixed in https://github.com/kubernetes-sigs/azuredisk-csi-driver/releases/tag/v1.28.8, we will rollout v1.28.8 in next aks-rp release, it's now in our master branch. |
What happened:
On a windows node, the Azure-Disk-CSI driver reports constant usage of something between 50% and 100% of a single CPU, and almost 1GB of RAM.
This is a test cluster, with barely any activity at all.
Node VM type is:
node.kubernetes.io/instance-type=Standard_D2s_v3
What you expected to happen:
Even though there are memory limits on the configuration (#129):
Memory based eviction is not working on windows nodes:
Azure/AKS#2820
How to reproduce it:
Add a windows node pool to an AKS cluster, K8S version 1.27.7 with azure-disk-csi driver 1.28.6
Attach some disk to a pod.
Memory and cpu usage are excesive.
Anything else we need to know?:
Environment:
kubectl version
): 1.27.7The text was updated successfully, but these errors were encountered: