Skip to content
This repository has been archived by the owner on Oct 24, 2023. It is now read-only.

kubelet failed to start on windows nodes with accelerated networking #2281

Closed
yixwang opened this issue Nov 7, 2019 · 35 comments
Closed

kubelet failed to start on windows nodes with accelerated networking #2281

yixwang opened this issue Nov 7, 2019 · 35 comments
Labels
backlog bug Something isn't working windows
Milestone

Comments

@yixwang
Copy link

yixwang commented Nov 7, 2019

Describe the bug
In one of our newly created cluster, we noticed that some windows nodes are missing from the cluster, when running kubectl get nodes. However, these nodes do exist and run healthy as shown on the Portal.
Example:
Only one node show up. However, all three windows nodes in the node pool of VMSS on Portal are running ok.
image
image

After looking into kubelet.err.log on the windows nodes that are not recognized by Kubernetes. We found it has a lot of the following errors repeat many times:

At C:\k\kubeletstart.ps1:50 char:54
+ ...    $mgmtIPAfterNetworkCreate = Get-NetIPAddress $managementIP -ErrorA ...
+                                                     ~~~~~~~~~~~~~
    + CategoryInfo          : InvalidData: (:) [Get-NetIPAddress], ParameterBindingValidationException
    + FullyQualifiedErrorId : ParameterArgumentValidationError,Get-NetIPAddress
 
Get-NetIPAddress : Cannot validate argument on parameter 'IPAddress'. The argument is null. Provide a valid value for 
the argument, and then try running the command again.

Line 50 corresponds to the 3rd line in the following in the kubeletstart.ps1 file

for ($i=0;$i -lt 60;$i++)
    {
        $mgmtIPAfterNetworkCreate = Get-NetIPAddress $managementIP -ErrorAction SilentlyContinue
        if ($mgmtIPAfterNetworkCreate)
        {
            break
        }
        sleep -Milliseconds 1000
    }

Can someone help identify the root cause in this case?

Steps To Reproduce

Create the cluster with the following apimodel.json

{
  "apiVersion": "vlabs",
  "properties": {
    "orchestratorProfile": {
      "orchestratorType": "Kubernetes",
      "orchestratorVersion": "1.15.4",
      "kubernetesConfig": {
        "networkPolicy": "azure",
        "enableRbac": true,
        "enableDataEncryptionAtRest": true,
        "loadBalancerSku": "Basic",
        "useManagedIdentity": true,
        "excludeMasterFromStandardLB": true,
        "apiServerConfig": {
          "--tls-cipher-suites": "TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA"
        },
        "addons": [
          {
            "name": "tiller",
            "enabled": true,
            "containers": [
              {
                "name": "tiller",
                "image": "gcr.io/kubernetes-helm/tiller:v2.14.3"
              }
            ]
          }
        ]
      }
    },
    "masterProfile": {
      "count": 3,
      "dnsPrefix": "{{ tpl .Topology.properties.masterDnsPrefix }}",
      "vmSize": "Standard_D8s_v3"
    },
    "agentPoolProfiles": [
      {
        "name": "windowspool",
        "count": 3,
        "vmSize": "Standard_D4s_v3",
        "availabilityProfile": "VirtualMachineScaleSets",
        "osType": "Windows",
        "osDiskSizeGB": 256
      },
      {
        "name": "linuxpool",
        "count": 3,
        "vmSize": "Standard_D4s_v3",
        "availabilityProfile": "VirtualMachineScaleSets"
      }
    ],
    "windowsProfile": {
      "adminUsername": "azureuser",
      "adminPassword": "[TO_BE_REPLACED]",
      "enableAutomaticUpdates": true,
      "sshEnabled": true
    },
    "linuxProfile": {
      "adminUsername": "azureuser",
      "ssh": {
        "publicKeys": [
          {
            "keyData": "[TO_BE_REPLACED]"
          }
        ]
      }
    },
    "servicePrincipalProfile": {
      "clientId": "[TO_BE_REPLACED]",
      "objectId": "[TO_BE_REPLACED]",
      "secret": "[TO_BE_REPLACED]"
    },
    "aadProfile": {
      "serverAppID": "[TO_BE_REPLACED]",
      "clientAppID": "[TO_BE_REPLACED]",
      "tenantID": "[TO_BE_REPLACED]",
      "adminGroupID": "[TO_BE_REPLACED]"
    },
    "certificateProfile": {
      "apiServerCertificate": "",
      "apiServerPrivateKey": "",
      "caCertificate": "",
      "caPrivateKey": "",
      "clientCertificate": "",
      "clientPrivateKey": "",
      "kubeConfigCertificate": "",
      "kubeConfigPrivateKey": "",
      "etcdServerCertificate": "",
      "etcdServerPrivateKey": "",
      "etcdClientCertificate": "",
      "etcdClientPrivateKey": "",
      "etcdPeerCertificates": [ "", "", "" ],
      "etcdPeerPrivateKeys": [ "", "", "" ]
    }
  }
}

Expected behavior
All nodes should be recongnized by the cluster

AKS Engine version
0.43.0

Kubernetes version
1.15.4
Additional context

@yixwang yixwang added the bug Something isn't working label Nov 7, 2019
@CecileRobertMichon
Copy link
Contributor

cc @marosset @PatrickLang

@marosset
Copy link
Contributor

marosset commented Nov 7, 2019

cc @daschott @pradipd
The check to wait for managment IP added by #2049 looks like it timed out here.
Should we increase the timeout?

@pradipd
Copy link
Contributor

pradipd commented Nov 7, 2019

Yes. I would bump to 3 minutes.
We are actively investigating why it takes so long.

@yixwang
Copy link
Author

yixwang commented Nov 12, 2019

@marosset Thanks a lot for making the fix. Just wondering when will it be released? Asking because we are currently blocked since we recreated our cluster using the latest 0.43.1 and we are seeing missing nodes behaviors in all our clusters. We were trying to revert back to 0.41.5 which is the last release before #2049 is introduced. However, 0.41.5 doesn't have the latest security patches for windows image, which is added in #2176. So it would be good to know when this fix will be released as well as any other options to take at this time.

@marosset marosset added this to the v0.44.0 milestone Nov 13, 2019
@marosset
Copy link
Contributor

We are planning on releasing a v0.44.0 mid next week which should include the fix to increase the timeout.
You should also be able to target newer Windows VHDs in your apimodel files if you cannot wait.
https://github.com/Azure/aks-engine/blob/da5ad7c157c0e172703fff00f834cf7c9e3f28c7/docs/topics/windows-and-kubernetes.md has details on update the api model and also searching for the latest image references.

@marosset
Copy link
Contributor

@pradipd have you learned anything from your investigations as to why this is taking a long time?

@pradipd
Copy link
Contributor

pradipd commented Nov 13, 2019

@yixwang Once everything is up and running, can you do the following on a node?
ipconfig /release
ipconfig /renew
One issue we have seen is DHCP taking a long time to give an IP address.

If that is not the issue, my guess is that it takes a while to create the vswitch. We are working with a few team to figure out why. But, that will not be a quick investigation and it won't be a quick fix.

@marosset Is there anyway we can collect logs? We have a script we use to debug these issues:
https://github.com/microsoft/SDN/blob/master/Kubernetes/windows/debug/collectlogs.ps1
https://github.com/microsoft/SDN/blob/master/Kubernetes/windows/debug/starthnstrace.cmd
But, we would have to put them in windowskubeletfunc.ps1 and that would effect everybody. Is there anyway to create private binaries and just have @yixwang run them?
Can we add a debug environment variable or something?

@marosset
Copy link
Contributor

@pradipd We're already downloading the log collection files in kubeneretswindowssetup.ps1 :)

Get-NetworkLogCollectionScripts

They should on the nodes that @yixwang provisioned but if it is possible to just download collectlogs.ps1 with curl. Running that should download the rest.

@pradipd
Copy link
Contributor

pradipd commented Nov 13, 2019

Awesome. Can we run starthnstrace.cmd? That will start hns tracing. Do we want to do that on all deployments? Or should we hide behind some sort of debug flag?

@marosset
Copy link
Contributor

I don't think we want to start hns tracing in all deployments.
For now it might be easiest to just have @yixwang start collecting hns traces and then restart the kubelet service to get a repro.

@pradipd
Copy link
Contributor

pradipd commented Nov 13, 2019

This is going to happen only on the initial deployment when we create the external switch.

@pradipd
Copy link
Contributor

pradipd commented Nov 13, 2019

@yixwang can you share the entire kubelet.err.log .

@marosset
Copy link
Contributor

Creating the external switch happens in c:\k\kubeletstart.ps1 which runs whenever the kubelet service started. You should be able to stop the kubeproxy/kubelet service, delete the existing 'ext' hns network/switch and then start kubelet to repro this I believe.

@yixwang
Copy link
Author

yixwang commented Nov 13, 2019

@marosset @pradipd So to collect the kubelet.err.log, my understanding is that i need to do the following:

  1. download collectlogs.ps1
  2. run collectlogs.ps1 to download starthnstrace.cmd
  3. run starthnstrace.cmd
  4. run kubeletstart.ps1
  5. collect the logs from kubelet.err.log

Let me know if I am missing anything above. Once confirm, I will collect the log and upload here.

@pradipd
Copy link
Contributor

pradipd commented Nov 13, 2019

@yixwang. You should already have kubelet.err.log. You mentioned it in your original post.

@yixwang
Copy link
Author

yixwang commented Nov 13, 2019

Yes, here it is
kubelet_log.zip

@pradipd
Copy link
Contributor

pradipd commented Nov 13, 2019

What is the output of "ipconfig /all" on the node

@yixwang
Copy link
Author

yixwang commented Nov 13, 2019

Note, the output below is collected on a different node from the one where kubelet_log is collected since the previous cluster is already gone. Nevertheless, both nodes are having the same error where kubelet failed to start. If needed, I can collect the kubelet log from the same node where the ipconfig /all command is run and upload here.

Windows IP Configuration
   Host Name . . . . . . . . . . . . : 2005k8s00000001
   Primary Dns Suffix  . . . . . . . :
   Node Type . . . . . . . . . . . . : Hybrid
   IP Routing Enabled. . . . . . . . : No
   WINS Proxy Enabled. . . . . . . . : No
   DNS Suffix Search List. . . . . . : uy13f1yvzouu5lfst0vn3ubbhh.bx.internal.cloudapp.net
Ethernet adapter Ethernet 2:
   Connection-specific DNS Suffix  . : uy13f1yvzouu5lfst0vn3ubbhh.bx.internal.cloudapp.net
   Description . . . . . . . . . . . : Microsoft Hyper-V Network Adapter #2
   Physical Address. . . . . . . . . : 00-0D-3A-8A-69-D2
   DHCP Enabled. . . . . . . . . . . : Yes
   Autoconfiguration Enabled . . . . : Yes
   Link-local IPv6 Address . . . . . : fe80::5a8:ce73:52e1:3b32%7(Preferred)
   IPv4 Address. . . . . . . . . . . : 10.240.0.218(Preferred)
   Subnet Mask . . . . . . . . . . . : 255.240.0.0
   Lease Obtained. . . . . . . . . . : Tuesday, November 12, 2019 2:35:41 AM
   Lease Expires . . . . . . . . . . : Sunday, December 21, 2155 2:42:09 AM
   Default Gateway . . . . . . . . . : 10.240.0.1
   DHCP Server . . . . . . . . . . . : 168.63.129.16
   DHCPv6 IAID . . . . . . . . . . . : 218107194
   DHCPv6 Client DUID. . . . . . . . : 00-01-00-01-25-5B-D1-E8-00-0D-3A-F7-99-46
   DNS Servers . . . . . . . . . . . : 168.63.129.16
   NetBIOS over Tcpip. . . . . . . . : Enabled
Ethernet adapter vEthernet (nat):
   Connection-specific DNS Suffix  . :
   Description . . . . . . . . . . . : Hyper-V Virtual Ethernet Adapter
   Physical Address. . . . . . . . . : 00-15-5D-8A-D7-4D
   DHCP Enabled. . . . . . . . . . . : No
   Autoconfiguration Enabled . . . . : Yes
   Link-local IPv6 Address . . . . . : fe80::e45e:f01:d1d9:83a4%11(Preferred)
   IPv4 Address. . . . . . . . . . . : 172.17.208.1(Preferred)
   Subnet Mask . . . . . . . . . . . : 255.255.240.0
   Default Gateway . . . . . . . . . :
   DHCPv6 IAID . . . . . . . . . . . : 184554845
   DHCPv6 Client DUID. . . . . . . . : 00-01-00-01-25-5B-D1-E8-00-0D-3A-F7-99-46
   DNS Servers . . . . . . . . . . . : fec0:0:0:ffff::1%1
                                       fec0:0:0:ffff::2%1
                                       fec0:0:0:ffff::3%1
   NetBIOS over Tcpip. . . . . . . . : Enabled

@pradipd
Copy link
Contributor

pradipd commented Nov 13, 2019

Can I have the kubelet logs from this node?

@yixwang
Copy link
Author

yixwang commented Nov 14, 2019

@pradipd Here's the one from the same node, 2005k8s00000001:
2005k8s00000001_kubelet_log.zip

@pradipd
Copy link
Contributor

pradipd commented Nov 14, 2019

It looks like the script can't find the IP address for Ethernet 3. Can you run the following in powershell?
$na = @(Get-NetAdapter -Physical)
$na
$managementIP = (Get-NetIPAddress -ifIndex `$na[0].ifIndex -AddressFamily IPv4).IPAddress
$managementIP
$adapterName = $na[0].Name
write-host "Using adapter $adapterName with IP address $managementIP"

@jackfrancis jackfrancis modified the milestones: v0.44.0, v0.45.0 Nov 20, 2019
@yixwang
Copy link
Author

yixwang commented Nov 24, 2019

Hi @pradipd , sorry for taking a bit long to get back. I upgraded to 0.44.0 which has the fix for increasing timeout. But I am still seeing 2 of 3 windows nodes missing.
image
image

Note, I noticed that we do have "acceleratedNetworkingEnabledWindows": true enabled. for some reason, the apimode.json I had in description doesn't have it. Not sure if this play a role in causing the issue through.

Here's the kubelet error log from one of the missing node 4142k8s00000000:
k_log2.zip

Here's the output of ipconfig /all

PS C:\> ipconfig /all

Windows IP Configuration

   Host Name . . . . . . . . . . . . : 4142k8s00000000
   Primary Dns Suffix  . . . . . . . :
   Node Type . . . . . . . . . . . . : Hybrid
   IP Routing Enabled. . . . . . . . : No
   WINS Proxy Enabled. . . . . . . . : No
   DNS Suffix Search List. . . . . . : gyfn2kuqnsie3e1ewgrbnpiwdc.xx.internal.cloudapp.net

Ethernet adapter Ethernet 2:

   Connection-specific DNS Suffix  . : gyfn2kuqnsie3e1ewgrbnpiwdc.xx.internal.cloudapp.net
   Description . . . . . . . . . . . : Microsoft Hyper-V Network Adapter #2
   Physical Address. . . . . . . . . : 00-0D-3A-C2-EC-B3
   DHCP Enabled. . . . . . . . . . . : Yes
   Autoconfiguration Enabled . . . . : Yes
   Link-local IPv6 Address . . . . . : fe80::accb:f0f1:de04:eef2%4(Preferred)
   IPv4 Address. . . . . . . . . . . : 10.240.0.187(Preferred)
   Subnet Mask . . . . . . . . . . . : 255.240.0.0
   Lease Obtained. . . . . . . . . . : Sunday, November 24, 2019 6:30:22 AM
   Lease Expires . . . . . . . . . . : Thursday, January 1, 2156 1:41:37 AM
   Default Gateway . . . . . . . . . : 10.240.0.1
   DHCP Server . . . . . . . . . . . : 168.63.129.16
   DHCPv6 IAID . . . . . . . . . . . : 218107194
   DHCPv6 Client DUID. . . . . . . . : 00-01-00-01-25-6B-DA-E9-00-0D-3A-F7-99-46
   DNS Servers . . . . . . . . . . . : 168.63.129.16
   NetBIOS over Tcpip. . . . . . . . : Enabled

Ethernet adapter vEthernet (nat):

   Connection-specific DNS Suffix  . :
   Description . . . . . . . . . . . : Hyper-V Virtual Ethernet Adapter
   Physical Address. . . . . . . . . : 00-15-5D-4B-4D-1A
   DHCP Enabled. . . . . . . . . . . : No
   Autoconfiguration Enabled . . . . : Yes
   Link-local IPv6 Address . . . . . : fe80::41b6:2a0d:9dd4:9f72%11(Preferred)
   IPv4 Address. . . . . . . . . . . : 172.21.0.1(Preferred)
   Subnet Mask . . . . . . . . . . . : 255.255.240.0
   Default Gateway . . . . . . . . . :
   DHCPv6 IAID . . . . . . . . . . . : 184554845
   DHCPv6 Client DUID. . . . . . . . : 00-01-00-01-25-6B-DA-E9-00-0D-3A-F7-99-46
   DNS Servers . . . . . . . . . . . : fec0:0:0:ffff::1%1
                                       fec0:0:0:ffff::2%1
                                       fec0:0:0:ffff::3%1
   NetBIOS over Tcpip. . . . . . . . : Enabled

Here's the output of the commands from your last comment: $na = @(Get-NetAdapter -Physical).... Note, I think you have an extra ` in the third command before $na[0].ifIndex, So I removed it when I was running the command it, let me know if it was otherwise intended:

PS C:\> $na = @(Get-NetAdapter -Physical)
PS C:\> $na

Name                      InterfaceDescription                    ifIndex Status       MacAddress             LinkSpeed
----                      --------------------                    ------- ------       ----------             ---------
Ethernet 3                Mellanox ConnectX-3 Virtual Function...       6 Up           00-0D-3A-C2-EC-B3        40 Gbps
Ethernet 2                Microsoft Hyper-V Network Adapter #2          4 Up           00-0D-3A-C2-EC-B3        40 Gbps


PS C:\> $managementIP = (Get-NetIPAddress -ifIndex $na[0].ifIndex -AddressFamily IPv4).IPAddress
Get-NetIPAddress : No matching MSFT_NetIPAddress objects found by CIM query for instances of the ROOT/StandardCimv2/MSFT_NetIPAddress class on the  CIM server: SELECT * FROM MSFT_NetIPAddress  WHERE ((InterfaceIndex = 6)) AND
((AddressFamily = 2)). Verify query parameters and retry.
At line:1 char:18
+ ... gementIP = (Get-NetIPAddress -ifIndex $na[0].ifIndex -AddressFamily I ...
+                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : ObjectNotFound: (MSFT_NetIPAddress:String) [Get-NetIPAddress], CimJobException
    + FullyQualifiedErrorId : CmdletizationQuery_NotFound,Get-NetIPAddress

PS C:\> $managementIP
PS C:\> $adapterName = $na[0].Name
PS C:\> write-host "Using adapter $adapterName with IP address $managementIP"
Using adapter Ethernet 3 with IP address

@pradipd
Copy link
Contributor

pradipd commented Nov 25, 2019

This issue is not because of the timeout.
The acceleratedNetworkingEnabledWindows is the issue. It is adding Ethernet 3 (Mellanox …). I will have to investigate more on why this does not have an IPv4 address. Additionally, I'll have to check on how we handle multiple NIC's.

@jackfrancis jackfrancis modified the milestones: v0.45.0, v0.46.0 Dec 16, 2019
@jackfrancis jackfrancis modified the milestones: v0.46.0, Next Dec 18, 2019
@stale
Copy link

stale bot commented Feb 16, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Feb 16, 2020
@stale stale bot closed this as completed Feb 23, 2020
@craiglpeters
Copy link
Contributor

This is still an issue for some teams. @jackfrancis and @marosset can we fix this

@craiglpeters craiglpeters reopened this Aug 11, 2020
@stale stale bot removed the stale label Aug 11, 2020
@marosset
Copy link
Contributor

The Windows team determined this was an issue with the Mellanox drivers and we need to get new drivers for accelerated networking to work reliably in these scenarios.

Let me try and get the status of the drivers and see if there is a date the new drivers would be included in the Windows VHDs available in Azure.

@craiglpeters
Copy link
Contributor

@marosset did you get any update on the drivers? cc: @immuzz

@marosset
Copy link
Contributor

marosset commented Sep 3, 2020

I have not. @daschott do you know when new Mellanox drivers will be available in the Azure marketplace images?

@jackfrancis
Copy link
Member

ping @daschott

@daschott
Copy link

The driver itself has been released:
https://docs.mellanox.com/display/WINOFv55053000/Bug+Fixes+in+this+Version

I forwarded our last discussions on this over e-mail. I unfortunately don't have many insights into Azure qualification procedure :( Let me know if there is anything in addition that is being asked.

@immuzz
Copy link

immuzz commented Dec 4, 2020

@marosset @craiglpeters @jsturtevant any update when this will be fixed since the Windows drivers are fixed?

@jsturtevant
Copy link
Contributor

@immuzz I think we are waiting for the drivers to be inside the Images produced in the Azure Marketplace. Do you know if these components are in the those images?

@romina2001
Copy link

@jsturtevant any news on the new images containing the drivers for windows node pools where accelerate networking can be enabled?

@jsturtevant
Copy link
Contributor

@romina2001 the driver has been released but has not been validated in k8s and doesn't not come installed in the base images AFAIK. If you have the ability you could give it a test run and report back. This is currently on our backlog to test via Azure cluster api

@bridgetkromhout
Copy link
Contributor

Closing as this was fixed in #4585.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
backlog bug Something isn't working windows
Projects
None yet
Development

No branches or pull requests