-
Notifications
You must be signed in to change notification settings - Fork 558
Regression with acs 0.14.X when using Calico #2607
Comments
Thank you for opening the issue @khaldoune. I've deployed a Calico cluster successfully (with both version 0.14.5 and 0.15.0) so it's most likely not a general Calico issue, but most likely something else in your apimodel maybe isn't compatible with Calico. Let me try to find out what it is and get back to you. Let me know if you make any progress/discoveries on your side. Merci! |
@dtzar fyi |
Thanks for your responsiveness. Do you have a custom dns in your case? If not, I will try tomorrow without it and come back to you. |
I do not. So maybe that'd be a good test: try to deploy a Calico cluster (https://github.com/Azure/acs-engine/blob/master/examples/networkpolicy/kubernetes-calico.json) with your custom dns and let me know what the outcome is. |
@CecileRobertMichon @jackfrancis @dtzar Hi, I've found the paramater that introduces this regression: enablePodSecurityPolicy Now you/we can start to try to find out how this parameter is incompatible with calico and acs 0.14.x. Furthermore, I could not find enablePodSecurityPolicy in the cluster-definition documentation: https://github.com/Azure/acs-engine/blob/master/docs/clusterdefinition.md Thanks for your assistance. |
@khaldoune thanks for catching that! The documentation gap is definitely a miss, I will fix it today. Let's investigate to find out why it is incompatible. For reference here are the two relevant PRs: #2048, #2125 |
fyi @pidah who implemented enablePodSecurityPolicy and might have more insight on why it's not compatible with calico |
@pidah @CecileRobertMichon @jackfrancis kubectl get psp and kubectl get clusterrole | grep privileged give no results which is very strange. Also, the folder /etc/cni is still empty, the kubelet complains against it... Are you able to reproduce? This issue does not seems to be related to custom dns anymore. |
I am not sure why this is happening, but I suspect the default restricted policy applied might need tweaking. The nodes are not ready but is kubelet container running on the nodes and do you see anything in the logs ? If the API server is running you might see some clues in its logs too. @khaldoune @CecileRobertMichon @jackfrancis |
@pidah @jackfrancis @CecileRoberMichon The kubelet starts but complains against the missing /etc/cni content (no net.d folder) and thus, no pod can start with no cni, including the api server and other master components. There id no pod running on my cluster. Where in the code does the cni folder gets populated? Add to that the fact that there is no psp no clusterrole so I don't think that the psp is the problem. I'm confused |
I'm not in front of my laptop but I'm thinking about what I've said: How can I get responses from kubect if the api server is down? So the api server is not down... so why kubectl get pods --all-namespaces says that there are no resources :( Very very confused |
From what I have understood, the net.d directory can be created by docker if a privileged container tries to mount it. |
@khaldoune the kubernetes API server (and other control plane components) are deployed as static pods managed by kubelet on each master node. As long as the API dependencies like docker, etcd are running, the API server will startup on each master node and will respond to kubectl requests. You can ssh into a master node and look at the API server docker container logs; Some other master components docker containers may be running too and you can look at the logs. |
Thanks for your answer. As stated before, kubectl get psp answers no resources. If you kubectl apply -f /etc/kubernetes/manifests/pod-security-policy.yaml and wait about 10 minutes, the cluster becomes ready 🥇 I still don't know why this file does not get applied. Is it entended to be applied by the kubelet (using the kubelet --pod-manifest-path param) ? |
@pidah @CecileRobertMichon @jackfrancis I think that I have an explanation: Starting from 0.14, ensureApiserver has been replaced by ensureK8s that checks for nodes readiness. Nodes cannot be ready if there is no network installed and pods cannot be created if there is no PSP (since this admission controler is activated). If you create the PSPs in a failing cluster, it will be unlocked after 10 minutes and nodes will be ready. I've confirmed this reasoning by changing the core of ensureApiserver in acs 0.13.1 by the code of ensureK8s... The deployment has failed. Suggestion: I suggest to ensureApiServer as before, then ensure podsecuritypolicy, then wait a while and then ensure k8s. References:
master:
Could you please validate, putch and release a 0.15.2? Thanks. |
@khaldoune ah nice catch...agreed – psp needs to be applied immediately after the API server is up, but before node healthchecks. @CecileRobertMichon @jackfrancis |
@khaldoune Thanks! I'll have a PR today that we can test for a possible inclusion in a patch release. |
Is this a request for help?:
YES
Is this an ISSUE or FEATURE REQUEST? (choose one):
ISSUE
What version of acs-engine?:
0.14.X
Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
Kubernetes 1.9.x
What happened:
Nodes are not ready
What you expected to happen:
Nodes to be ready
How to reproduce it (as minimally and precisely as possible):
@CecileRobertMichon @jackfrancis
Use the following cluster definition:
vnet_prefix: 198.18.184.0/21
vnet_master_subnet: 198.18.190.0/24
vnet_worker_subnet: 198.18.189.0/24
vnet_master_first_ip: 198.18.190.50
k8s_pod_cidr: 198.18.184.0/22
k8s_service_cidr: 198.18.188.0/23
k8s_dns_service: 198.18.188.10
Here is the cluster definition:
{
"apiVersion": "vlabs",
"properties": {
"orchestratorProfile": {
"orchestratorType": "Kubernetes",
"orchestratorRelease": "1.9",
"orchestratorVersion": "1.9.6",
"kubernetesConfig": {
"networkPolicy": "calico",
"etcdDiskSizeGB": "16",
"enableAggregatedAPIs": true,
"enablePodSecurityPolicy": true,
"EnableRbac": true,
"clusterSubnet": "198.18.184.0/22",
"serviceCidr": "198.18.188.0/23",
"dnsServiceIP": "198.18.188.10",
"kubeletConfig": {
"--event-qps": "0",
"--non-masquerade-cidr": "198.18.184.0/22",
"--authentication-token-webhook": "true"
},
"controllerManagerConfig": {
"--address": "0.0.0.0",
"--profiling": "false",
"--terminated-pod-gc-threshold": "100",
"--node-cidr-mask-size": "27",
"--node-monitor-grace-period": "40s",
"--pod-eviction-timeout": "60s",
"--horizontal-pod-autoscaler-use-rest-clients": "true"
},
"cloudControllerManagerConfig": {
"--profiling": "false"
},
"apiServerConfig": {
"--profiling": "false",
"--repair-malformed-updates": "false",
"--endpoint-reconciler-type": "lease"
},
"addons": [
{
"name": "tiller",
"enabled": false
}
]
}
},
"masterProfile": {
"dnsPrefix": "k8s-noprd",
"vnetCidr": "198.18.190.0/24",
"count": 5,
"vmSize": "Standard_D2_v2",
"OSDiskSizeGB": 80,
"vnetSubnetId": "/subscriptions/xxxxxxxxxxxxxxxxxx/resourceGroups/vpod0a-apps-prd-rg/providers/Microsoft.Network/virtualNetworks/vpod0a-k8s-prd-1-vnet/subnets/master_subnet",
"firstConsecutiveStaticIP": "198.18.190.50",
"preProvisionExtension": {
"name": "setup"
}
},
"agentPoolProfiles": [
{
"name": "wbronze",
"count": 1,
"vmSize": "Standard_D2_v2",
"OSDiskSizeGB": 80,
"availabilityProfile": "AvailabilitySet",
"vnetSubnetId": "/subscriptions/xxxxxxxxxxxxxxxxxx/resourceGroups/vpod0a-apps-prd-rg/providers/Microsoft.Network/virtualNetworks/vpod0a-k8s-prd-1-vnet/subnets/worker_subnet",
"diskSizesGB": [ 50 ],
"StorageProfile": "ManagedDisks",
"preProvisionExtension": {
"name": "setup_node"
}
},
{
"name": "wsilver",
"count": 1,
"vmSize": "Standard_D2_v2",
"OSDiskSizeGB": 80,
"availabilityProfile": "AvailabilitySet",
"vnetSubnetId": "/subscriptions/xxxxxxxxxxxxxxxxxx/resourceGroups/vpod0a-apps-prd-rg/providers/Microsoft.Network/virtualNetworks/vpod0a-k8s-prd-1-vnet/subnets/worker_subnet",
"diskSizesGB": [ 50 ],
"StorageProfile": "ManagedDisks",
"preProvisionExtension": {
"name": "setup_node"
}
},
{
"name": "wgold",
"count": 1,
"vmSize": "Standard_D2_v2",
"OSDiskSizeGB": 80,
"availabilityProfile": "AvailabilitySet",
"vnetSubnetId": "/subscriptions/xxxxxxxxxxxxxxxxxx/resourceGroups/vpod0a-apps-prd-rg/providers/Microsoft.Network/virtualNetworks/vpod0a-k8s-prd-1-vnet/subnets/worker_subnet",
"diskSizesGB": [ 50 ],
"StorageProfile": "ManagedDisks",
"preProvisionExtension": {
"name": "setup_node"
}
},
{
"name": "wplatin",
"count": 1,
"vmSize": "Standard_D2_v2",
"OSDiskSizeGB": 80,
"availabilityProfile": "AvailabilitySet",
"vnetSubnetId": "/subscriptions/xxxxxxxxxxxxxxxxxx/resourceGroups/vpod0a-apps-prd-rg/providers/Microsoft.Network/virtualNetworks/vpod0a-k8s-prd-1-vnet/subnets/worker_subnet",
"diskSizesGB": [ 50 ],
"StorageProfile": "ManagedDisks",
"preProvisionExtension": {
"name": "setup_node"
}
},
{
"name": "wdiamond",
"count": 1,
"vmSize": "Standard_D2_v2",
"OSDiskSizeGB": 80,
"availabilityProfile": "AvailabilitySet",
"vnetSubnetId": "/subscriptions/xxxxxxxxxxxxxxxxxx/resourceGroups/vpod0a-apps-prd-rg/providers/Microsoft.Network/virtualNetworks/vpod0a-k8s-prd-1-vnet/subnets/worker_subnet",
"diskSizesGB": [ 50 ],
"StorageProfile": "ManagedDisks",
"preProvisionExtension": {
"name": "setup_node"
}
}
],
"linuxProfile": {
"adminUsername": "k8s",
"ssh": {
"publicKeys": [
{
"keyData": "ssh-rsa xxxxxxxxxxxxxxxxxx"
}
]
}
},
"servicePrincipalProfile": {
"clientId": "xxxxxxxxxxxxxxxxxx",
"secret": "xxxxxxxxxxxxxxxxxx"
},
"extensionProfiles": [
{
"name": "setup_node",
"version": "v1",
"script": "setup.sh",
"rootURL": "https://gitlab.com/octo-carrefour-k8s/acs-extensions/raw/master/",
"extensionParameters": "198.18.192.4 k8s-noprd.xpod.carrefour.com"
},
{
"name": "setup",
"version": "v1",
"script": "setup.sh",
"rootURL": "https://gitlab.com/octo-carrefour-k8s/acs-extensions/raw/master/",
"extensionParameters": "198.18.192.4 k8s-noprd.xpod.carrefour.com"
}
]
}
}
Anything else we need to know:
Deployment is succesful with Azur CNI
Deployment fails with one subnet for both masters and workers
I'm using 2 customs dns in my VNET and extensions in order to declare the VMs in the DNS: https://gitlab.com/octo-carrefour-k8s/acs-extensions
The text was updated successfully, but these errors were encountered: