-
Notifications
You must be signed in to change notification settings - Fork 558
nslookup (dns) for windows services fails after stopping vms and restarting a hybrid kubernetes cluster in azure #1903
Comments
@jackfrancis @JiangtianLi Any idea how or when this might be resolved? At the moment a hybrid cluster becomes unusable once the VMS are stopped after the initial creation. |
@douglaswaights Can you help to collect more info? From the windows container: |
hi, sure. I had to create a new cluster though as i removed the old one. The new one created starts up again as expected with the below output. Name Type TTL Section NameHost www.bing.com CNAME 6 Answer www-bing-com.a-0001.a-msedge.net Name : a-0001.a-msedge.net Name : a-0001.a-msedge.net Name : a-msedge.net root@nginx:/# nslookup iis-1-svc Name: iis-1-svc.default.svc.cluster.local root@nginx:/# ping iis-1-svc root@nginx:/# Unfortunately after i stop the VM's and then restart them, the Windows VM (despite being running in the azure portal) never seems to get past the Not ready state in Kubernetes (see below). C:\git\dm\kubernetes\svc (develop -> origin) The original cluster at least managed to have the windows node running after vm restart although the DNS obviously wasn't working. Anything else i can try? Earlier today, I went back to ACS-engine 0.8 with the previous version of Windows Server and kubernetes 1.7.9 (i think it was) and this was behaving better after retarting the VM's. I don't know if the problem is with the new acs-engine, kubernetes 1.8 or Windows RS 3 or what really... |
@douglaswaights The issue appears to be RS3 Windows and kubelet on that Windows node. If you can RDP to windows, can you check |
can you point me in the direction of how to rdp in? Connect doesnt seem to be available / enabled in the usual manner with a vm running in the portal... I've tried adding a new windows vm to the same vnet so i can rdp into that from my local machine then the idea was to rdp into the windows node from there.... the ports and rules look ok and nsg etc but it still doesnt want to let me in.... i guess im missing something. |
@douglaswaights you can establish an RDP connection to your Windows host with a simple ssh tunnel using the master node, here is how: Get windows node hostname
write down the name of the windows node Establish an ssh tunnel with master node to gain RDP access to the windows node:From the Establish an ssh tunnel with the windows host through the master node:
(or an equivalent setup if you are using a Windows tool like Putty) From that point, you should be able to establish an RDP connection to localhost:33890, it will be redirected to your Windows host. beware This is a Windows Core server, so don't expect any fancy gui out there... |
@JiangtianLi I have the very same issue as @douglaswaights except that I don't even have to reboot anything. Windows pods are unable to resolve any public or cluster (service) ip. Here are my outputs:
And here are my Kubelet logs: |
Thanks a lot @odauby - very helpful! @JiangtianLi SERVICE_NAME: kubelet and here are the logs Today i spun the cluster up and the nodes all came up but again the dns wasnt working. Yesterday i spun it up one time and the stars aligned and my services could communicate. Be great to get this fixed! |
@douglaswaights @odauby It appears that kubelet is running on Windows node. And from the kubelet I didn't see failure to start kubelet. So does the "Not Ready" issue still exist or it is the DNS issue? For DNS issue, if you run |
Looks like the DNS query times out, and no, waiting 15 minutes does not improve the situation on my side. Below commands have been executed from a windows pod:
|
I've fired up the cluster again and this time all the nodes are in ready state. However i have the same problem with the DNS. Waiting 15 mins does not help. Windows IP Configuration Host Name . . . . . . . . . . . . : iis-1709-1 Ethernet adapter vEthernet (a5135afeb2fc1e344f71a02117ba9b5ba9b8daa8d7fb469aac3fe2286f8138be_l2bridge): Connection-specific DNS Suffix . : nslookup iis-2-svc.default.svc.cluster.local DNS request timed out. PS C:> Resolve-DnsName www.bing.com
PS C:> I also tried creating new pods and services after waiting for 20 mins or so to see if that made a difference with something changing in the windows node networking but the same result. |
This looks like a different issue. Does |
Test-NetConnection fails in the container ComputerName : 10.0.0.10 Inside the windows node PS C:\Users\sdm> Get-HnsEndpoint ActivityId : 9b63aa27-a6d2-4ea0-bb30-3da5f4f913e0 ActivityId : 98875e14-c13a-4362-93ff-882afee7de45 ActivityId : dbccdc66-5354-41b7-b6a5-e4a6fd3ea7c3 ActivityId : 6650c5bf-b80e-4ddb-b00a-d2ecc1bde529 ActivityId : 3c37a58f-f7ab-4491-b01b-fcc999c71c80 ActivityId : 26777281-0024-4d87-ab6e-69e31e5e5582 ActivityId : b4171851-4987-4ecb-89ff-94baee7dd576 ActivityId : 15aef9d6-453a-4ef1-a4db-9c1185f10256 ActivityId : f8da6e9f-b5ff-4ff1-ab2f-64dda6087702 ActivityId : cf5ad1fb-232a-417b-a114-4ef4aecfdf40 PS C:\Users\sdm> Get-HNSNetwork ActivityId : 42102725-f95c-4370-b01e-0819bd367057 ActivityId : 8bbb780a-30ce-4c3a-93ef-bcb99e1e3bdb PS C:\Users\sdm> |
@JiangtianLi Here is my input TL;DR I don't think the problem is DNS but TCP and maybe even IP routing !
Facts I observed:
Assumption:
before the reboot:Windows pod default gateway is in another subnet, this is weird.
Port TCP/80 looks like closed on 10.0.0.10:
I assume you wanted to wanted to see if port 53 was open, right ? (DNS uses UDP/53 and TCP/53)
Service short name resolution fails:
Service FQDN resolution and TCP handshaking work:
On Linux pods, service short names do resolve:
This is because they have proper search suffixes:
Get-HnsEndpoint on Windows host:
Get-HNSNetwork on Windows host:
after the reboot:Windows pods still have no DNS suffix and a weird gateway:
Windows pods can't reach TCP/80 on 10.0.0.10 (but I assume this is ok):
But now,they can't reach TCP/53 on 10.0.0.10 anymore:
Service short names are still an issue:
Service long names, too:
Even connection to the service ip address fails:
While Linux pods do not have any DNS or IP issue:
Get-HnsEndpoint on Windows host:
Get-HNSNetwork on Windows host:
Kind regards, O. |
@odauby I see the exact same thing on with acs build from master today and Kubernetes 1.9.1. @JiangtianLi Let me know if you need more debug information regarding this issue. I will be happy to help. |
I'm also having the issue described, i originally deployed a kubernetes hybrid cluster running 1.8.4 and everything worked great for a week, but then i had to reboot the windows machines, after the restart dns and tcp stopped working. I've also tried upgrading to version 1.9.1 to see if it would solve the problem but no luck there. However in my case it is only related to outgoing traffic, i have a service that points toward a pod that is running iis and it can serve the traffic fine, but only for pages that does not try to access a external database or such for obvious reasons. I'm not sure if the same applies for @brunsgaard I've also noticed that if i log in to the host via rdp, i cannot ping any external addresses, it properly resolves the dns and if i run
|
@KaptenMorot For the issue with reboot, we are aware of it and working with Windows team on it. Meanwhile, one mitigation is to restart hns network on Windows, e.g., For the ping issue, I think ping packet is blocked from Azure VM node. I can't ping www.google.com from master node either. |
Just tried with freshly released acs-engine v 0.12.0, same result.
|
@odauby There is currently an issue with service vip on windows node so using POD IP instead of cluster IP is indeed the workaround. We are going to roll out the patch asap. |
@JiangtianLi - we are (intermittently) running into the issue where our Windows containers cannot communicate with service IPs. This issue can occur on a fresh node that didn't previously have an HNS interface created. Do you have any more information on the patch that you referenced on your Jan 17th comment? Thanks. Our environment:
|
@jbiel Can you use https://github.com/Microsoft/SDN/blob/master/Kubernetes/windows/hns.psm1 to run
on windows node and |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contribution. Note that acs-engine is deprecated--see https://github.com/Azure/aks-engine instead. |
Hi!
Is this a request for help?:
yes
Is this an ISSUE or FEATURE REQUEST? (choose one):
issue
What version of acs-engine?:
0.10.0
Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
kubernetes version 1.8.4 with acs-engine 0.10
What happened:
I have created a hybrid kubernetes cluster with the acs-engine (1 master, 1 linux, 1 windows) and to begin with the out-of-the-box windows iis 1709 pods can see each other through nslookup (although i have to use the fqdn as per bug Azure/ACS#94 after execing into the pods. Everything works as expected.
I then shutdown the vms in the cluster and then later restart them. Now nslookup fails to see the windows pods from one to the other if i exec into them again. if i deploy nginx on the linux and expose with loadbalancer that is visible fine from the outside world
What you expected to happen:
the cluster should return to its orginal state as it was when created with dns and service discovery working
How to reproduce it (as minimally and precisely as possible):
spin up a hybrid cluster in azure and add a couple of iis pods and the corresponding services for each. Confirm they can see each other. Turn off the vms. Turn them back on again and although everything looks ok on the surface dns is now broken.
Anything else we need to know:
Can you explain to me why this might happen. I presume i should be able to spin down the vms and bring them back up later. i.e the cluster doesnt have to be always up.
Can you help me get the dns working again and troubleshoot? It doesnt help if i stagger the re-launch order of the vms
Thanks
Doug
kubernetes-hybrid.json below created in Azure NorthEurope
{
"apiVersion": "vlabs",
"properties": {
"orchestratorProfile": {
"orchestratorType": "Kubernetes",
"kubernetesConfig": {
"addons": [
{
"name": "tiller",
"enabled": true
},
{
"name": "kubernetes-dashboard",
"enabled": true
}
],
"enableRbac": true
},
"orchestratorRelease":"1.8"
},
"masterProfile": {
"count": 1,
"dnsPrefix": "sdmhybridk8s",
"vmSize": "Standard_D2_v2"
},
"agentPoolProfiles": [
{
"name": "linuxpool1",
"count": 1,
"vmSize": "Standard_D2_v2",
"availabilityProfile": "AvailabilitySet"
},
{
"name": "windowspool2",
"count": 1,
"vmSize": "Standard_D2_v2",
"availabilityProfile": "AvailabilitySet",
"osType": "Windows"
}
],
"windowsProfile": {
"adminUsername": "sdm",
"adminPassword": "redacted"
},
"linuxProfile": {
"adminUsername": "azureuser",
"ssh": {
"publicKeys": [
{
"keyData": "redacted"
}
]
}
},
"servicePrincipalProfile": {
"clientId": "redacted",
"secret": "redacted"
}
}
}
The text was updated successfully, but these errors were encountered: