Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU and memory usage on Windows Node #2235

Closed
david-garcia-garcia opened this issue Mar 22, 2024 · 29 comments · Fixed by #2273
Closed

High CPU and memory usage on Windows Node #2235

david-garcia-garcia opened this issue Mar 22, 2024 · 29 comments · Fixed by #2273

Comments

@david-garcia-garcia
Copy link

david-garcia-garcia commented Mar 22, 2024

What happened:

On a windows node, the Azure-Disk-CSI driver reports constant usage of something between 50% and 100% of a single CPU, and almost 1GB of RAM.

image

image

This is a test cluster, with barely any activity at all.

Node VM type is:

node.kubernetes.io/instance-type=Standard_D2s_v3

What you expected to happen:

Even though there are memory limits on the configuration (#129):

image

Memory based eviction is not working on windows nodes:

Azure/AKS#2820

How to reproduce it:

Add a windows node pool to an AKS cluster, K8S version 1.27.7 with azure-disk-csi driver 1.28.6

Attach some disk to a pod.

Memory and cpu usage are excesive.

Anything else we need to know?:

Environment:

$ kubectl get po -n kube-system -o yaml | grep mcr | grep azuredisk
      image: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi:v1.28.6
      image: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi:v1.28.6
      imageID: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi@sha256:cb78f2a27b6afedbe4a4accbf7536efb56e947fd28b82d35d156d5504d94f996
      image: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi:v1.28.6-windows-hp
      image: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi:v1.28.6-windows-hp
      image: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi:v1.28.6-windows-hp
      imageID: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi@sha256:5eadf08609bc798a35c16eb5e32b28dd2f6926d26b2eb1fd97d05715d729c737
      image: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi:v1.28.6-windows-hp
      imageID: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi@sha256:5eadf08609bc798a35c16eb5e32b28dd2f6926d26b2eb1fd97d05715d729c737
  • CSI Driver version: 1.28.6
  • Kubernetes version (use kubectl version): 1.27.7
  • OS (e.g. from /etc/os-release): Windows Server 2022
@david-garcia-garcia david-garcia-garcia changed the title High CPU and memory usage in Windows Node High CPU and memory usage on Windows Node Mar 22, 2024
@david-garcia-garcia
Copy link
Author

Created Azure support request #2403220050000684

@david-garcia-garcia
Copy link
Author

Killing the pod "fixed" the issue.

I need to find a way to repo this. Things started going downhill in the node after adding a new node pool, and moving a pod+disk between the pools serveral times.

@andyzhangx
Copy link
Member

@david-garcia-garcia do you know which container is consuming more cpu and memory? could you run following command when you could repro? thanks.

kubectl top po --containers -n kube-system | grep disk

@david-garcia-garcia
Copy link
Author

david-garcia-garcia commented Apr 4, 2024

@andyzhangx I opened a support ticket again. I am keeping the pod alive this time, hope I can hold it for a while because it totally draining a whole CPU full time.

PS C:\Windows\System32> kubectl top po --containers -n kube-system | Select-String -Pattern "disk"

csi-azure←[7mdisk←[0m-node-ln6dj                             azuredisk                 1m           15Mi
csi-azure←[7mdisk←[0m-node-ln6dj                             liveness-probe            1m           18Mi
csi-azure←[7mdisk←[0m-node-ln6dj                             node-driver-registrar     3m           21Mi
csi-azure←[7mdisk←[0m-node-win-m2vrh                         azuredisk                 987m         556Mi
csi-azure←[7mdisk←[0m-node-win-m2vrh                         node-driver-registrar     0m           5Mi

I believe the most straight forward way of finding the bug in the driver would be to take a process dump directly from within the node and analyzing what is going on with cpu.

@andyzhangx
Copy link
Member

@david-garcia-garcia can you share the logs of csi-azure-disk-m-node-win-m2vrh pod first? thanks.

@david-garcia-garcia
Copy link
Author

@andyzhangx

logs.txt

I removed the subscription id from the logs.

@david-garcia-garcia
Copy link
Author

BTW, memory usage keeps accumulating:

csi-azuredisk-node-ln6dj                             azuredisk                 1m           15Mi
csi-azuredisk-node-ln6dj                             liveness-probe            1m           18Mi
csi-azuredisk-node-ln6dj                             node-driver-registrar     4m           21Mi
csi-azuredisk-node-win-m2vrh                         azuredisk                 800m         703Mi
csi-azuredisk-node-win-m2vrh                         node-driver-registrar     0m           5Mi

@andyzhangx
Copy link
Member

@david-garcia-garcia if you have an azure file driver pod on the same node, could you run following command to get the etl file?

kubectl exec -it csi-azurefile-node-win-xxx -n kube-system -c azurefile -- cmd
wpr -start cpu
# sleep 30s
wpr -stop 'c:\cpu.etl'
kubectl cp kube-system/csi-azurefile-node-win-xxx:/cpu.etl -c azurefile /tmp/cpu.etl

@david-garcia-garcia
Copy link
Author

@andyzhangx yes, there is an azure file driver on the same node (I believe these are both deployed automatically with a managed AKS).

Here is the etl file:

https://we.tl/t-EevjXq2MHH

As per the metrics, the azurefile driver seems to be OK in terms of memory and cpu usage:

PS C:\Windows\System32> kubectl top po --containers -n kube-system | Select-String -Pattern "disk|file"

csi-azure←[7mdisk←[0m-node-win-m2vrh                         azuredisk                 841m         707Mi
csi-azure←[7mdisk←[0m-node-win-m2vrh                         node-driver-registrar     0m           5Mi
csi-azure←[7mfile←[0m-node-win-xwp4r                         azurefile                 0m           10Mi
csi-azure←[7mfile←[0m-node-win-xwp4r                         node-driver-registrar     0m           5Mi

@Howard-Haiyang-Hao
Copy link

Howard-Haiyang-Hao commented Apr 8, 2024

@david-garcia-garcia: @andyzhangx and I have taken a closer look, and it seems that the issue was caused by a PowerShell command. We zoomed in from 98.55s to 99.25s, with the overall CPU usage attributed to the PowerShell command being 60%. Specifically, the PowerShell process (PID 15096) accounted for 44.15% of the total CPU usage, as shown in the image below:

image

PowerShell(15096) command line is highed below:

image

@andyzhangx and I will investigate the CSI disk components to identify the source of this issue.

@david-garcia-garcia
Copy link
Author

@Howard-Haiyang-Hao thanks for the feedback, I am trying to put things together. I understand from your message that you have identified what might be causing the issue and that further investigation needs to be done.

Just as an update - in case this might be helpful - current memory usage for the disk plugin on the node is almost 1.3GB

   Id ProcessName                     WS
   -- -----------                     --
 9032 azurediskplugin         1269940224
13284 MsMpEng                  361623552
 8316 sqlservr                 361082880
13824 w3wp                     323289088
13244 powershell               221679616
 8516 powershell               179961856
 6424 powershell               151060480
  104 Registry                 144760832
 3840 powershell               139792384
 1828 powershell               120958976
 4192 powershell               118669312
16260 powershell               113315840
 3644 kubelet                  101240832
 9956 powershell                96866304
 3664 containerd                95223808
 3360 svchost                   84959232
12856 addon-token-adapter-win   79749120
11136 powershell                78569472
14772 WindowsAzureGuestAgent    76914688
 6096 w3wp                      75304960
 6184 svchost                   70791168
  996 svchost                   61079552
12992 telegraf                  60268544
 9048 powershell                58748928
12440 otelcollector             54804480

Maybe a memory dump could help to pinpoint the cause of the issue.

@Howard-Haiyang-Hao
Copy link

Howard-Haiyang-Hao commented Apr 8, 2024

Thanks, @david-garcia-garcia, for highlighting the memory issues. Could you please collect a new trace for us? Here are the steps:

  1. wpr.exe -start "c:\Program Files\Containerd\containerplatform.wprp" -start cpu -start fileio -start diskio -start ResidentSet -start ReferenceSet
  2. Sleep 30 seconds
  3. wpr.exe -stop azurediskws.etl

Thanks,
Howard.

@david-garcia-garcia
Copy link
Author

david-garcia-garcia commented Apr 8, 2024

@Howard-Haiyang-Hao

PS C:\Users\hostadmin> wpr.exe -start "c:\Program Files\Containerd\containerplatform.wprp" -start cpu -start fileio -start diskio -start ResidentSet -start ReferenceSet
One or more profiles does not support memory mode. Did you forget to add -filemode?
        The Logging Mode does not match with the profile passed.

        Error code: 0xc5584017

@Howard-Haiyang-Hao
Copy link

@david-garcia-garcia Apologies for the incorrect trace command. You're right, we need to use "-filemode":
wpr.exe -start "c:\Program Files\Containerd\containerplatform.wprp" -start cpu -start fileio -start diskio -start ResidentSet -start ReferenceSet -filemode

@david-garcia-garcia
Copy link
Author

@Howard-Haiyang-Hao results here:

https://we.tl/t-2QP2sccdDL

Use "azure-csi-2024" to open

@andyzhangx
Copy link
Member

@david-garcia-garcia could you get the full logs of the driver pod on that node?

kubectl logs csi-azuredisk-node-cvgbs -c azuredisk -n kube-system > csi-azuredisk-node.log

@david-garcia-garcia
Copy link
Author

@andyzhangx yes in my post #2235 (comment)

BTW all the CSI driver pods died tonight, for whatever reason K8S decided to kill them, so I've lost any possiblity of extracting any relevant logs again until the issue surfaces again.

@andyzhangx
Copy link
Member

@david-garcia-garcia I suspect that the high CPU usage is caused by a dead loop in get volume stats: #2267

@david-garcia-garcia
Copy link
Author

@andyzhangx looking at the PR, having depth protection on a recursive call makes sense, but if that is the cause, then there must be something flawed or and edge case not covered in this recursion logic.

Having a recursion error is in any case better than an infinite recursion. Considering that - as to my understanding of the algorithm - the mount argument mutates on every recursion, not sure if we could accumulate all the mutations and have them in the error output for troubleshooting purposes. Something like this (I'm not familiar/comfortable with the golang language and syntax, so I asked an AI for some help):

func getTarget(mount string, depth int, path []string) (string, error) {
	if depth == 0 {
		return "", fmt.Errorf("maximum depth reached on mount %s, path: %v", mount, path)
	}
	cmd := "(Get-Item -Path $Env:mount).Target"
	out, err := azureutils.RunPowershellCmd(cmd, fmt.Sprintf("mount=%s", mount))
	if err != nil || len(out) == 0 {
		return "", fmt.Errorf("error getting volume from mount. cmd: %s, output: %s, error: %v, path: %v", cmd, out, err, path)
	}
	volumeString := strings.TrimSpace(string(out))
	if !strings.HasPrefix(volumeString, "Volume") {
                path = append(path, volumeString)
		return getTarget(volumeString, depth-1, path)
	}
	volumeString = ensureVolumePrefix(volumeString)
	return ensureVolumePrefix(volumeString), nil
}

@andyzhangx
Copy link
Member

@david-garcia-garcia I will publish v1.28.7 with the fix, are you willing to have a try first? just mail me the aks cluster fqdn, I could upgrade csi driver version in your cluster this week, thanks.

@david-garcia-garcia
Copy link
Author

@andyzhangx Sent you an e-mail. Thanks!

@andyzhangx
Copy link
Member

@david-garcia-garcia driver version updated, pls try with windows workloads, thanks.

@andyzhangx
Copy link
Member

andyzhangx commented Apr 12, 2024

finally I have worked out a PR to get rid of expensive call (Get-Item -Path $Env:mount).Target

@Howard-Haiyang-Hao
Copy link

@david-garcia-garcia would you mind to share the trace again? Thanks!

https://we.tl/t-2QP2sccdDL

@MostefaKamalLala
Copy link

I am experiencing the same issue. I have between 6 to 16 disks attached on B8ms windows nodes. I modified temporarily the limits on the deamon set of csi-azuredisk-node-win to avoid going bursting the server. Do I just need to update the driver version in the deamonset to get it fixed? @Howard-Haiyang-Hao @andyzhangx
Do I need to update AKS as well? I am running on 1.27.7 and the driver is azuredisk-csi:v1.28.7-windows-hp

@andyzhangx
Copy link
Member

I am experiencing the same issue. I have between 6 to 16 disks attached on B8ms windows nodes. I modified temporarily the limits on the deamon set of csi-azuredisk-node-win to avoid going bursting the server. Do I just need to update the driver version in the deamonset to get it fixed? @Howard-Haiyang-Hao @andyzhangx Do I need to update AKS as well? I am running on 1.27.7 and the driver is azuredisk-csi:v1.28.7-windows-hp

@MostefaKamalLala what's the result of following command?

kubectl top po --containers -n kube-system | Select-String -Pattern "disk|file"

@MostefaKamalLala
Copy link

I am experiencing the same issue. I have between 6 to 16 disks attached on B8ms windows nodes. I modified temporarily the limits on the deamon set of csi-azuredisk-node-win to avoid going bursting the server. Do I just need to update the driver version in the deamonset to get it fixed? @Howard-Haiyang-Hao @andyzhangx Do I need to update AKS as well? I am running on 1.27.7 and the driver is azuredisk-csi:v1.28.7-windows-hp

@MostefaKamalLala what's the result of following command?

kubectl top po --containers -n kube-system | Select-String -Pattern "disk|file"

@andyzhangx It wont show me the value I had it at a couple of hours ago, but I still have the result on my terminal of the command kubectl top pods -n kube-system before changing the cpu limit on the deamon set. bTW THE 1.2 core usage is on a pod I killed thats why you do not see it after.
image

and after changing the limit to 150Mi
image

just for information, those showing 0 are from dormant nodes.

Here is the result of your command that I executed now:
image

I wanted to get the metrics from prometheus to show you form history but it seems like it doesnt scrap it
image

@MostefaKamalLala
Copy link

MostefaKamalLala commented May 31, 2024

@andyzhangx I assume It is better to uninstall the AKS managed csi-driver and install it myself with the chart to the latest version right?

@andyzhangx
Copy link
Member

@andyzhangx I assume It is better to uninstall the AKS managed csi-driver and install it myself with the chart to the latest version right?

it's fixed in https://github.com/kubernetes-sigs/azuredisk-csi-driver/releases/tag/v1.28.8, we will rollout v1.28.8 in next aks-rp release, it's now in our master branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants