High CPU and memory usage on Windows Node #2235

david-garcia-garcia · 2024-03-22T08:33:21Z

What happened:

On a windows node, the Azure-Disk-CSI driver reports constant usage of something between 50% and 100% of a single CPU, and almost 1GB of RAM.

This is a test cluster, with barely any activity at all.

Node VM type is:

node.kubernetes.io/instance-type=Standard_D2s_v3

What you expected to happen:

Even though there are memory limits on the configuration (#129):

Memory based eviction is not working on windows nodes:

Azure/AKS#2820

How to reproduce it:

Add a windows node pool to an AKS cluster, K8S version 1.27.7 with azure-disk-csi driver 1.28.6

Attach some disk to a pod.

Memory and cpu usage are excesive.

Anything else we need to know?:

Environment:

$ kubectl get po -n kube-system -o yaml | grep mcr | grep azuredisk
      image: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi:v1.28.6
      image: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi:v1.28.6
      imageID: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi@sha256:cb78f2a27b6afedbe4a4accbf7536efb56e947fd28b82d35d156d5504d94f996
      image: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi:v1.28.6-windows-hp
      image: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi:v1.28.6-windows-hp
      image: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi:v1.28.6-windows-hp
      imageID: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi@sha256:5eadf08609bc798a35c16eb5e32b28dd2f6926d26b2eb1fd97d05715d729c737
      image: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi:v1.28.6-windows-hp
      imageID: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi@sha256:5eadf08609bc798a35c16eb5e32b28dd2f6926d26b2eb1fd97d05715d729c737

CSI Driver version: 1.28.6
Kubernetes version (use kubectl version): 1.27.7
OS (e.g. from /etc/os-release): Windows Server 2022

The text was updated successfully, but these errors were encountered:

david-garcia-garcia · 2024-03-22T08:39:57Z

Created Azure support request #2403220050000684

david-garcia-garcia · 2024-03-22T09:29:11Z

Killing the pod "fixed" the issue.

I need to find a way to repo this. Things started going downhill in the node after adding a new node pool, and moving a pod+disk between the pools serveral times.

andyzhangx · 2024-03-22T12:21:28Z

@david-garcia-garcia do you know which container is consuming more cpu and memory? could you run following command when you could repro? thanks.

kubectl top po --containers -n kube-system | grep disk

david-garcia-garcia · 2024-04-04T16:13:48Z

@andyzhangx I opened a support ticket again. I am keeping the pod alive this time, hope I can hold it for a while because it totally draining a whole CPU full time.

PS C:\Windows\System32> kubectl top po --containers -n kube-system | Select-String -Pattern "disk"

csi-azure←[7mdisk←[0m-node-ln6dj                             azuredisk                 1m           15Mi
csi-azure←[7mdisk←[0m-node-ln6dj                             liveness-probe            1m           18Mi
csi-azure←[7mdisk←[0m-node-ln6dj                             node-driver-registrar     3m           21Mi
csi-azure←[7mdisk←[0m-node-win-m2vrh                         azuredisk                 987m         556Mi
csi-azure←[7mdisk←[0m-node-win-m2vrh                         node-driver-registrar     0m           5Mi

I believe the most straight forward way of finding the bug in the driver would be to take a process dump directly from within the node and analyzing what is going on with cpu.

andyzhangx · 2024-04-05T01:17:53Z

@david-garcia-garcia can you share the logs of csi-azure-disk-m-node-win-m2vrh pod first? thanks.

david-garcia-garcia · 2024-04-05T05:45:16Z

@andyzhangx

logs.txt

I removed the subscription id from the logs.

david-garcia-garcia · 2024-04-05T05:47:15Z

BTW, memory usage keeps accumulating:

csi-azuredisk-node-ln6dj                             azuredisk                 1m           15Mi
csi-azuredisk-node-ln6dj                             liveness-probe            1m           18Mi
csi-azuredisk-node-ln6dj                             node-driver-registrar     4m           21Mi
csi-azuredisk-node-win-m2vrh                         azuredisk                 800m         703Mi
csi-azuredisk-node-win-m2vrh                         node-driver-registrar     0m           5Mi

andyzhangx · 2024-04-05T06:41:39Z

@david-garcia-garcia if you have an azure file driver pod on the same node, could you run following command to get the etl file?

kubectl exec -it csi-azurefile-node-win-xxx -n kube-system -c azurefile -- cmd
wpr -start cpu
# sleep 30s
wpr -stop 'c:\cpu.etl'
kubectl cp kube-system/csi-azurefile-node-win-xxx:/cpu.etl -c azurefile /tmp/cpu.etl

david-garcia-garcia · 2024-04-05T07:11:16Z

@andyzhangx yes, there is an azure file driver on the same node (I believe these are both deployed automatically with a managed AKS).

Here is the etl file:

https://we.tl/t-EevjXq2MHH

As per the metrics, the azurefile driver seems to be OK in terms of memory and cpu usage:

PS C:\Windows\System32> kubectl top po --containers -n kube-system | Select-String -Pattern "disk|file"

csi-azure←[7mdisk←[0m-node-win-m2vrh                         azuredisk                 841m         707Mi
csi-azure←[7mdisk←[0m-node-win-m2vrh                         node-driver-registrar     0m           5Mi
csi-azure←[7mfile←[0m-node-win-xwp4r                         azurefile                 0m           10Mi
csi-azure←[7mfile←[0m-node-win-xwp4r                         node-driver-registrar     0m           5Mi

Howard-Haiyang-Hao · 2024-04-08T05:53:49Z

@david-garcia-garcia: @andyzhangx and I have taken a closer look, and it seems that the issue was caused by a PowerShell command. We zoomed in from 98.55s to 99.25s, with the overall CPU usage attributed to the PowerShell command being 60%. Specifically, the PowerShell process (PID 15096) accounted for 44.15% of the total CPU usage, as shown in the image below:

PowerShell(15096) command line is highed below:

@andyzhangx and I will investigate the CSI disk components to identify the source of this issue.

david-garcia-garcia · 2024-04-08T07:35:04Z

@Howard-Haiyang-Hao thanks for the feedback, I am trying to put things together. I understand from your message that you have identified what might be causing the issue and that further investigation needs to be done.

Just as an update - in case this might be helpful - current memory usage for the disk plugin on the node is almost 1.3GB

   Id ProcessName                     WS
   -- -----------                     --
 9032 azurediskplugin         1269940224
13284 MsMpEng                  361623552
 8316 sqlservr                 361082880
13824 w3wp                     323289088
13244 powershell               221679616
 8516 powershell               179961856
 6424 powershell               151060480
  104 Registry                 144760832
 3840 powershell               139792384
 1828 powershell               120958976
 4192 powershell               118669312
16260 powershell               113315840
 3644 kubelet                  101240832
 9956 powershell                96866304
 3664 containerd                95223808
 3360 svchost                   84959232
12856 addon-token-adapter-win   79749120
11136 powershell                78569472
14772 WindowsAzureGuestAgent    76914688
 6096 w3wp                      75304960
 6184 svchost                   70791168
  996 svchost                   61079552
12992 telegraf                  60268544
 9048 powershell                58748928
12440 otelcollector             54804480

Maybe a memory dump could help to pinpoint the cause of the issue.

Howard-Haiyang-Hao · 2024-04-08T07:40:39Z

Thanks, @david-garcia-garcia, for highlighting the memory issues. Could you please collect a new trace for us? Here are the steps:

wpr.exe -start "c:\Program Files\Containerd\containerplatform.wprp" -start cpu -start fileio -start diskio -start ResidentSet -start ReferenceSet
Sleep 30 seconds
wpr.exe -stop azurediskws.etl

Thanks,
Howard.

david-garcia-garcia · 2024-04-08T08:10:28Z

@Howard-Haiyang-Hao

PS C:\Users\hostadmin> wpr.exe -start "c:\Program Files\Containerd\containerplatform.wprp" -start cpu -start fileio -start diskio -start ResidentSet -start ReferenceSet
One or more profiles does not support memory mode. Did you forget to add -filemode?
        The Logging Mode does not match with the profile passed.

        Error code: 0xc5584017

Howard-Haiyang-Hao · 2024-04-08T16:20:13Z

@david-garcia-garcia Apologies for the incorrect trace command. You're right, we need to use "-filemode":
wpr.exe -start "c:\Program Files\Containerd\containerplatform.wprp" -start cpu -start fileio -start diskio -start ResidentSet -start ReferenceSet -filemode

david-garcia-garcia · 2024-04-09T06:10:17Z

@Howard-Haiyang-Hao results here:

https://we.tl/t-2QP2sccdDL

Use "azure-csi-2024" to open

andyzhangx · 2024-04-09T14:12:51Z

@david-garcia-garcia could you get the full logs of the driver pod on that node?

kubectl logs csi-azuredisk-node-cvgbs -c azuredisk -n kube-system > csi-azuredisk-node.log

david-garcia-garcia · 2024-04-09T14:49:49Z

@andyzhangx yes in my post #2235 (comment)

BTW all the CSI driver pods died tonight, for whatever reason K8S decided to kill them, so I've lost any possiblity of extracting any relevant logs again until the issue surfaces again.

andyzhangx · 2024-04-09T14:57:32Z

@david-garcia-garcia I suspect that the high CPU usage is caused by a dead loop in get volume stats: #2267

david-garcia-garcia · 2024-04-09T16:00:47Z

@andyzhangx looking at the PR, having depth protection on a recursive call makes sense, but if that is the cause, then there must be something flawed or and edge case not covered in this recursion logic.

Having a recursion error is in any case better than an infinite recursion. Considering that - as to my understanding of the algorithm - the mount argument mutates on every recursion, not sure if we could accumulate all the mutations and have them in the error output for troubleshooting purposes. Something like this (I'm not familiar/comfortable with the golang language and syntax, so I asked an AI for some help):

func getTarget(mount string, depth int, path []string) (string, error) {
	if depth == 0 {
		return "", fmt.Errorf("maximum depth reached on mount %s, path: %v", mount, path)
	}
	cmd := "(Get-Item -Path $Env:mount).Target"
	out, err := azureutils.RunPowershellCmd(cmd, fmt.Sprintf("mount=%s", mount))
	if err != nil || len(out) == 0 {
		return "", fmt.Errorf("error getting volume from mount. cmd: %s, output: %s, error: %v, path: %v", cmd, out, err, path)
	}
	volumeString := strings.TrimSpace(string(out))
	if !strings.HasPrefix(volumeString, "Volume") {
                path = append(path, volumeString)
		return getTarget(volumeString, depth-1, path)
	}
	volumeString = ensureVolumePrefix(volumeString)
	return ensureVolumePrefix(volumeString), nil
}

andyzhangx · 2024-04-11T03:52:02Z

@david-garcia-garcia I will publish v1.28.7 with the fix, are you willing to have a try first? just mail me the aks cluster fqdn, I could upgrade csi driver version in your cluster this week, thanks.

david-garcia-garcia · 2024-04-11T09:21:21Z

@andyzhangx Sent you an e-mail. Thanks!

andyzhangx · 2024-04-12T06:30:41Z

@david-garcia-garcia driver version updated, pls try with windows workloads, thanks.

andyzhangx · 2024-04-12T11:58:38Z

finally I have worked out a PR to get rid of expensive call (Get-Item -Path $Env:mount).Target

Howard-Haiyang-Hao · 2024-04-17T16:36:14Z

@david-garcia-garcia would you mind to share the trace again? Thanks!

https://we.tl/t-2QP2sccdDL

MostefaKamalLala · 2024-05-15T05:35:01Z

I am experiencing the same issue. I have between 6 to 16 disks attached on B8ms windows nodes. I modified temporarily the limits on the deamon set of csi-azuredisk-node-win to avoid going bursting the server. Do I just need to update the driver version in the deamonset to get it fixed? @Howard-Haiyang-Hao @andyzhangx
Do I need to update AKS as well? I am running on 1.27.7 and the driver is azuredisk-csi:v1.28.7-windows-hp

andyzhangx · 2024-05-15T06:22:21Z

I am experiencing the same issue. I have between 6 to 16 disks attached on B8ms windows nodes. I modified temporarily the limits on the deamon set of csi-azuredisk-node-win to avoid going bursting the server. Do I just need to update the driver version in the deamonset to get it fixed? @Howard-Haiyang-Hao @andyzhangx Do I need to update AKS as well? I am running on 1.27.7 and the driver is azuredisk-csi:v1.28.7-windows-hp

@MostefaKamalLala what's the result of following command?

kubectl top po --containers -n kube-system | Select-String -Pattern "disk|file"

MostefaKamalLala · 2024-05-15T12:48:44Z

I am experiencing the same issue. I have between 6 to 16 disks attached on B8ms windows nodes. I modified temporarily the limits on the deamon set of csi-azuredisk-node-win to avoid going bursting the server. Do I just need to update the driver version in the deamonset to get it fixed? @Howard-Haiyang-Hao @andyzhangx Do I need to update AKS as well? I am running on 1.27.7 and the driver is azuredisk-csi:v1.28.7-windows-hp

@MostefaKamalLala what's the result of following command?
kubectl top po --containers -n kube-system | Select-String -Pattern "disk|file"

@andyzhangx It wont show me the value I had it at a couple of hours ago, but I still have the result on my terminal of the command kubectl top pods -n kube-system before changing the cpu limit on the deamon set. bTW THE 1.2 core usage is on a pod I killed thats why you do not see it after.

and after changing the limit to 150Mi

just for information, those showing 0 are from dormant nodes.

Here is the result of your command that I executed now:

I wanted to get the metrics from prometheus to show you form history but it seems like it doesnt scrap it

MostefaKamalLala · 2024-05-31T03:02:11Z

@andyzhangx I assume It is better to uninstall the AKS managed csi-driver and install it myself with the chart to the latest version right?

andyzhangx · 2024-06-07T08:33:50Z

@andyzhangx I assume It is better to uninstall the AKS managed csi-driver and install it myself with the chart to the latest version right?

it's fixed in https://github.com/kubernetes-sigs/azuredisk-csi-driver/releases/tag/v1.28.8, we will rollout v1.28.8 in next aks-rp release, it's now in our master branch.

david-garcia-garcia changed the title ~~High CPU and memory usage in Windows Node~~ High CPU and memory usage on Windows Node Mar 22, 2024

andyzhangx mentioned this issue Apr 8, 2024

[WIP] fix: replace Get-Item powershell cmd with golang api #2265

Closed

4 tasks

andyzhangx mentioned this issue Apr 9, 2024

fix: possible dead loop in GetVolumeStats on Windows #2267

Merged

4 tasks

andyzhangx mentioned this issue Apr 10, 2024

fix: cache GetVolumeStats on Windows node #2273

Merged

4 tasks

andyzhangx closed this as completed in #2273 Apr 11, 2024

This was referenced Apr 11, 2024

[release-1.29] fix: cache GetVolumeStats on Windows node #2275

Merged

[release-1.28] fix: cache GetVolumeStats on Windows node #2276

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High CPU and memory usage on Windows Node #2235

High CPU and memory usage on Windows Node #2235

david-garcia-garcia commented Mar 22, 2024 •

edited

Loading

david-garcia-garcia commented Mar 22, 2024

david-garcia-garcia commented Mar 22, 2024

andyzhangx commented Mar 22, 2024

david-garcia-garcia commented Apr 4, 2024 •

edited

Loading

andyzhangx commented Apr 5, 2024

david-garcia-garcia commented Apr 5, 2024

david-garcia-garcia commented Apr 5, 2024

andyzhangx commented Apr 5, 2024

david-garcia-garcia commented Apr 5, 2024

Howard-Haiyang-Hao commented Apr 8, 2024 •

edited

Loading

david-garcia-garcia commented Apr 8, 2024

Howard-Haiyang-Hao commented Apr 8, 2024 •

edited

Loading

david-garcia-garcia commented Apr 8, 2024 •

edited

Loading

Howard-Haiyang-Hao commented Apr 8, 2024

david-garcia-garcia commented Apr 9, 2024

andyzhangx commented Apr 9, 2024

david-garcia-garcia commented Apr 9, 2024

andyzhangx commented Apr 9, 2024

david-garcia-garcia commented Apr 9, 2024

andyzhangx commented Apr 11, 2024

david-garcia-garcia commented Apr 11, 2024

andyzhangx commented Apr 12, 2024

andyzhangx commented Apr 12, 2024 •

edited

Loading

Howard-Haiyang-Hao commented Apr 17, 2024

MostefaKamalLala commented May 15, 2024

andyzhangx commented May 15, 2024

MostefaKamalLala commented May 15, 2024

MostefaKamalLala commented May 31, 2024 •

edited

Loading

andyzhangx commented Jun 7, 2024

High CPU and memory usage on Windows Node #2235

High CPU and memory usage on Windows Node #2235

Comments

david-garcia-garcia commented Mar 22, 2024 • edited Loading

david-garcia-garcia commented Mar 22, 2024

david-garcia-garcia commented Mar 22, 2024

andyzhangx commented Mar 22, 2024

david-garcia-garcia commented Apr 4, 2024 • edited Loading

andyzhangx commented Apr 5, 2024

david-garcia-garcia commented Apr 5, 2024

david-garcia-garcia commented Apr 5, 2024

andyzhangx commented Apr 5, 2024

david-garcia-garcia commented Apr 5, 2024

Howard-Haiyang-Hao commented Apr 8, 2024 • edited Loading

david-garcia-garcia commented Apr 8, 2024

Howard-Haiyang-Hao commented Apr 8, 2024 • edited Loading

david-garcia-garcia commented Apr 8, 2024 • edited Loading

Howard-Haiyang-Hao commented Apr 8, 2024

david-garcia-garcia commented Apr 9, 2024

andyzhangx commented Apr 9, 2024

david-garcia-garcia commented Apr 9, 2024

andyzhangx commented Apr 9, 2024

david-garcia-garcia commented Apr 9, 2024

andyzhangx commented Apr 11, 2024

david-garcia-garcia commented Apr 11, 2024

andyzhangx commented Apr 12, 2024

andyzhangx commented Apr 12, 2024 • edited Loading

Howard-Haiyang-Hao commented Apr 17, 2024

MostefaKamalLala commented May 15, 2024

andyzhangx commented May 15, 2024

MostefaKamalLala commented May 15, 2024

MostefaKamalLala commented May 31, 2024 • edited Loading

andyzhangx commented Jun 7, 2024

david-garcia-garcia commented Mar 22, 2024 •

edited

Loading

david-garcia-garcia commented Apr 4, 2024 •

edited

Loading

Howard-Haiyang-Hao commented Apr 8, 2024 •

edited

Loading

Howard-Haiyang-Hao commented Apr 8, 2024 •

edited

Loading

david-garcia-garcia commented Apr 8, 2024 •

edited

Loading

andyzhangx commented Apr 12, 2024 •

edited

Loading

MostefaKamalLala commented May 31, 2024 •

edited

Loading