Regression with acs 0.14.X when using Calico #2607

khaldoune · 2018-04-05T23:09:00Z

Is this a request for help?:
YES

Is this an ISSUE or FEATURE REQUEST? (choose one):
ISSUE

What version of acs-engine?:
0.14.X

Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
Kubernetes 1.9.x

What happened:
Nodes are not ready

What you expected to happen:
Nodes to be ready

How to reproduce it (as minimally and precisely as possible):

@CecileRobertMichon @jackfrancis

Use the following cluster definition:

vnet_prefix: 198.18.184.0/21
vnet_master_subnet: 198.18.190.0/24
vnet_worker_subnet: 198.18.189.0/24
vnet_master_first_ip: 198.18.190.50
k8s_pod_cidr: 198.18.184.0/22
k8s_service_cidr: 198.18.188.0/23
k8s_dns_service: 198.18.188.10

Here is the cluster definition:

{
"apiVersion": "vlabs",
"properties": {
"orchestratorProfile": {
"orchestratorType": "Kubernetes",
"orchestratorRelease": "1.9",
"orchestratorVersion": "1.9.6",
"kubernetesConfig": {
"networkPolicy": "calico",
"etcdDiskSizeGB": "16",
"enableAggregatedAPIs": true,
"enablePodSecurityPolicy": true,
"EnableRbac": true,
"clusterSubnet": "198.18.184.0/22",
"serviceCidr": "198.18.188.0/23",
"dnsServiceIP": "198.18.188.10",
"kubeletConfig": {
"--event-qps": "0",
"--non-masquerade-cidr": "198.18.184.0/22",
"--authentication-token-webhook": "true"
},
"controllerManagerConfig": {
"--address": "0.0.0.0",
"--profiling": "false",
"--terminated-pod-gc-threshold": "100",
"--node-cidr-mask-size": "27",
"--node-monitor-grace-period": "40s",
"--pod-eviction-timeout": "60s",
"--horizontal-pod-autoscaler-use-rest-clients": "true"
},
"cloudControllerManagerConfig": {
"--profiling": "false"
},
"apiServerConfig": {
"--profiling": "false",
"--repair-malformed-updates": "false",
"--endpoint-reconciler-type": "lease"
},
"addons": [
{
"name": "tiller",
"enabled": false
}
]
}
},
"masterProfile": {
"dnsPrefix": "k8s-noprd",
"vnetCidr": "198.18.190.0/24",
"count": 5,
"vmSize": "Standard_D2_v2",
"OSDiskSizeGB": 80,
"vnetSubnetId": "/subscriptions/xxxxxxxxxxxxxxxxxx/resourceGroups/vpod0a-apps-prd-rg/providers/Microsoft.Network/virtualNetworks/vpod0a-k8s-prd-1-vnet/subnets/master_subnet",
"firstConsecutiveStaticIP": "198.18.190.50",
"preProvisionExtension": {
"name": "setup"
}
},
"agentPoolProfiles": [
{
"name": "wbronze",
"count": 1,
"vmSize": "Standard_D2_v2",
"OSDiskSizeGB": 80,
"availabilityProfile": "AvailabilitySet",
"vnetSubnetId": "/subscriptions/xxxxxxxxxxxxxxxxxx/resourceGroups/vpod0a-apps-prd-rg/providers/Microsoft.Network/virtualNetworks/vpod0a-k8s-prd-1-vnet/subnets/worker_subnet",
"diskSizesGB": [ 50 ],
"StorageProfile": "ManagedDisks",
"preProvisionExtension": {
"name": "setup_node"
}
},
{
"name": "wsilver",
"count": 1,
"vmSize": "Standard_D2_v2",
"OSDiskSizeGB": 80,
"availabilityProfile": "AvailabilitySet",
"vnetSubnetId": "/subscriptions/xxxxxxxxxxxxxxxxxx/resourceGroups/vpod0a-apps-prd-rg/providers/Microsoft.Network/virtualNetworks/vpod0a-k8s-prd-1-vnet/subnets/worker_subnet",
"diskSizesGB": [ 50 ],
"StorageProfile": "ManagedDisks",
"preProvisionExtension": {
"name": "setup_node"
}
},
{
"name": "wgold",
"count": 1,
"vmSize": "Standard_D2_v2",
"OSDiskSizeGB": 80,
"availabilityProfile": "AvailabilitySet",
"vnetSubnetId": "/subscriptions/xxxxxxxxxxxxxxxxxx/resourceGroups/vpod0a-apps-prd-rg/providers/Microsoft.Network/virtualNetworks/vpod0a-k8s-prd-1-vnet/subnets/worker_subnet",
"diskSizesGB": [ 50 ],
"StorageProfile": "ManagedDisks",
"preProvisionExtension": {
"name": "setup_node"
}
},
{
"name": "wplatin",
"count": 1,
"vmSize": "Standard_D2_v2",
"OSDiskSizeGB": 80,
"availabilityProfile": "AvailabilitySet",
"vnetSubnetId": "/subscriptions/xxxxxxxxxxxxxxxxxx/resourceGroups/vpod0a-apps-prd-rg/providers/Microsoft.Network/virtualNetworks/vpod0a-k8s-prd-1-vnet/subnets/worker_subnet",
"diskSizesGB": [ 50 ],
"StorageProfile": "ManagedDisks",
"preProvisionExtension": {
"name": "setup_node"
}
},
{
"name": "wdiamond",
"count": 1,
"vmSize": "Standard_D2_v2",
"OSDiskSizeGB": 80,
"availabilityProfile": "AvailabilitySet",
"vnetSubnetId": "/subscriptions/xxxxxxxxxxxxxxxxxx/resourceGroups/vpod0a-apps-prd-rg/providers/Microsoft.Network/virtualNetworks/vpod0a-k8s-prd-1-vnet/subnets/worker_subnet",
"diskSizesGB": [ 50 ],
"StorageProfile": "ManagedDisks",
"preProvisionExtension": {
"name": "setup_node"
}
}
],
"linuxProfile": {
"adminUsername": "k8s",
"ssh": {
"publicKeys": [
{
"keyData": "ssh-rsa xxxxxxxxxxxxxxxxxx"
}
]
}
},
"servicePrincipalProfile": {
"clientId": "xxxxxxxxxxxxxxxxxx",
"secret": "xxxxxxxxxxxxxxxxxx"
},
"extensionProfiles": [
{
"name": "setup_node",
"version": "v1",
"script": "setup.sh",
"rootURL": "https://gitlab.com/octo-carrefour-k8s/acs-extensions/raw/master/",
"extensionParameters": "198.18.192.4 k8s-noprd.xpod.carrefour.com"
},
{
"name": "setup",
"version": "v1",
"script": "setup.sh",
"rootURL": "https://gitlab.com/octo-carrefour-k8s/acs-extensions/raw/master/",
"extensionParameters": "198.18.192.4 k8s-noprd.xpod.carrefour.com"
}
]
}
}

Anything else we need to know:
Deployment is succesful with Azur CNI

Deployment fails with one subnet for both masters and workers

I'm using 2 customs dns in my VNET and extensions in order to declare the VMs in the DNS: https://gitlab.com/octo-carrefour-k8s/acs-extensions

CecileRobertMichon · 2018-04-05T23:14:23Z

Thank you for opening the issue @khaldoune. I've deployed a Calico cluster successfully (with both version 0.14.5 and 0.15.0) so it's most likely not a general Calico issue, but most likely something else in your apimodel maybe isn't compatible with Calico. Let me try to find out what it is and get back to you. Let me know if you make any progress/discoveries on your side. Merci!

CecileRobertMichon · 2018-04-05T23:14:35Z

@dtzar fyi

khaldoune · 2018-04-05T23:22:15Z

@CecileRobertMichon

Thanks for your responsiveness.

Do you have a custom dns in your case? If not, I will try tomorrow without it and come back to you.

CecileRobertMichon · 2018-04-05T23:42:15Z

I do not. So maybe that'd be a good test: try to deploy a Calico cluster (https://github.com/Azure/acs-engine/blob/master/examples/networkpolicy/kubernetes-calico.json) with your custom dns and let me know what the outcome is.

khaldoune · 2018-04-06T16:43:06Z

@CecileRobertMichon @jackfrancis @dtzar

Hi,

I've found the paramater that introduces this regression: enablePodSecurityPolicy

Now you/we can start to try to find out how this parameter is incompatible with calico and acs 0.14.x.

Furthermore, I could not find enablePodSecurityPolicy in the cluster-definition documentation: https://github.com/Azure/acs-engine/blob/master/docs/clusterdefinition.md

Thanks for your assistance.

CecileRobertMichon · 2018-04-06T16:55:29Z

@khaldoune thanks for catching that! The documentation gap is definitely a miss, I will fix it today. Let's investigate to find out why it is incompatible. For reference here are the two relevant PRs: #2048, #2125

CecileRobertMichon · 2018-04-06T16:59:55Z

fyi @pidah who implemented enablePodSecurityPolicy and might have more insight on why it's not compatible with calico

khaldoune · 2018-04-06T17:36:13Z

@pidah @CecileRobertMichon @jackfrancis

kubectl get psp and kubectl get clusterrole | grep privileged give no results which is very strange.

Also, the folder /etc/cni is still empty, the kubelet complains against it...

@CecileRobertMichon

Are you able to reproduce? This issue does not seems to be related to custom dns anymore.

pidah · 2018-04-06T18:35:00Z

I am not sure why this is happening, but I suspect the default restricted policy applied might need tweaking. The nodes are not ready but is kubelet container running on the nodes and do you see anything in the logs ? If the API server is running you might see some clues in its logs too. @khaldoune @CecileRobertMichon @jackfrancis

kezzamiti · 2018-04-06T20:21:59Z

@pidah @jackfrancis @CecileRoberMichon

The kubelet starts but complains against the missing /etc/cni content (no net.d folder) and thus, no pod can start with no cni, including the api server and other master components.

There id no pod running on my cluster.

Where in the code does the cni folder gets populated?

Add to that the fact that there is no psp no clusterrole so I don't think that the psp is the problem. I'm confused

kezzamiti · 2018-04-06T20:22:48Z

@CecileRobertMichon

kezzamiti · 2018-04-06T21:21:15Z

@CecileRobertMichon @pidah

I'm not in front of my laptop but I'm thinking about what I've said: How can I get responses from kubect if the api server is down? So the api server is not down... so why kubectl get pods --all-namespaces says that there are no resources :(

Very very confused

kezzamiti · 2018-04-06T21:34:39Z

From what I have understood, the net.d directory can be created by docker if a privileged container tries to mount it.
So in this case, this directory will be created by the calico daemonset, which meens that Calico could not be started. I'll try to kubectl apply the calico manifests once in front of my laptop and update this issue.

@CecileRobertMichon @pidah

pidah · 2018-04-06T21:50:17Z

@khaldoune the kubernetes API server (and other control plane components) are deployed as static pods managed by kubelet on each master node. As long as the API dependencies like docker, etcd are running, the API server will startup on each master node and will respond to kubectl requests. You can ssh into a master node and look at the API server docker container logs; Some other master components docker containers may be running too and you can look at the logs.
However, because of the calico/network related failure, the API server is in an inconsistent state and reports that worker nodes (which may include master nodes) are not yet ready, which implies there will be no pods binding to those nodes; Note that it fails to report that itself is indeed running as a pod.
Hopefully the reason for the failure will be in some of the logs. @CecileRobertMichon @kezzamiti @jackfrancis

khaldoune · 2018-04-08T00:09:11Z

@pidah

Thanks for your answer.

As stated before, kubectl get psp answers no resources.

If you kubectl apply -f /etc/kubernetes/manifests/pod-security-policy.yaml and wait about 10 minutes, the cluster becomes ready 🥇

I still don't know why this file does not get applied. Is it entended to be applied by the kubelet (using the kubelet --pod-manifest-path param) ?

@CecileRobertMichon @jackfrancis

khaldoune · 2018-04-08T18:42:48Z

@pidah @CecileRobertMichon @jackfrancis

I think that I have an explanation: Starting from 0.14, ensureApiserver has been replaced by ensureK8s that checks for nodes readiness.

Nodes cannot be ready if there is no network installed and pods cannot be created if there is no PSP (since this admission controler is activated).

If you create the PSPs in a failing cluster, it will be unlocked after 10 minutes and nodes will be ready.

I've confirmed this reasoning by changing the core of ensureApiserver in acs 0.13.1 by the code of ensureK8s... The deployment has failed.

Suggestion: I suggest to ensureApiServer as before, then ensure podsecuritypolicy, then wait a while and then ensure k8s.

References:
acs 0.13.1:

acs-engine/parts/k8s/kubernetesmastercustomscript.sh

Line 555 in 797dd91

function ensureApiserver() {

master:

acs-engine/parts/k8s/kubernetescustomscript.sh

Line 374 in d574b10

function ensureK8s() {

Could you please validate, putch and release a 0.15.2? Thanks.

pidah · 2018-04-08T22:54:21Z

@khaldoune ah nice catch...agreed – psp needs to be applied immediately after the API server is up, but before node healthchecks. @CecileRobertMichon @jackfrancis

jackfrancis · 2018-04-09T16:33:21Z

@khaldoune Thanks! I'll have a PR today that we can test for a possible inclusion in a patch release.

khaldoune mentioned this issue Apr 5, 2018

VM has reported a failure when processing extension 'cse0' #1806

Closed

jackfrancis mentioned this issue Apr 9, 2018

apply PodSecurityPolicy before check for ready nodes #2633

Merged

3 tasks

ghost assigned jackfrancis Apr 9, 2018

ghost added the in progress label Apr 9, 2018

jackfrancis closed this as completed in #2633 Apr 9, 2018

ghost removed the in progress label Apr 9, 2018

khaldoune mentioned this issue Apr 9, 2018

Can I configure master and agent DNS via acs-engine? #2631

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression with acs 0.14.X when using Calico #2607

Regression with acs 0.14.X when using Calico #2607

khaldoune commented Apr 5, 2018

CecileRobertMichon commented Apr 5, 2018

CecileRobertMichon commented Apr 5, 2018

khaldoune commented Apr 5, 2018 •

edited

Loading

CecileRobertMichon commented Apr 5, 2018

khaldoune commented Apr 6, 2018

CecileRobertMichon commented Apr 6, 2018

CecileRobertMichon commented Apr 6, 2018

khaldoune commented Apr 6, 2018

pidah commented Apr 6, 2018

kezzamiti commented Apr 6, 2018

kezzamiti commented Apr 6, 2018

kezzamiti commented Apr 6, 2018

kezzamiti commented Apr 6, 2018

pidah commented Apr 6, 2018 •

edited

Loading

khaldoune commented Apr 8, 2018

khaldoune commented Apr 8, 2018

pidah commented Apr 8, 2018

jackfrancis commented Apr 9, 2018

Regression with acs 0.14.X when using Calico #2607

Regression with acs 0.14.X when using Calico #2607

Comments

khaldoune commented Apr 5, 2018

Is this a request for help?: YES

Is this an ISSUE or FEATURE REQUEST? (choose one): ISSUE

What version of acs-engine?: 0.14.X

CecileRobertMichon commented Apr 5, 2018

CecileRobertMichon commented Apr 5, 2018

khaldoune commented Apr 5, 2018 • edited Loading

CecileRobertMichon commented Apr 5, 2018

khaldoune commented Apr 6, 2018

CecileRobertMichon commented Apr 6, 2018

CecileRobertMichon commented Apr 6, 2018

khaldoune commented Apr 6, 2018

pidah commented Apr 6, 2018

kezzamiti commented Apr 6, 2018

kezzamiti commented Apr 6, 2018

kezzamiti commented Apr 6, 2018

kezzamiti commented Apr 6, 2018

pidah commented Apr 6, 2018 • edited Loading

khaldoune commented Apr 8, 2018

khaldoune commented Apr 8, 2018

pidah commented Apr 8, 2018

jackfrancis commented Apr 9, 2018

Is this a request for help?:
YES

Is this an ISSUE or FEATURE REQUEST? (choose one):
ISSUE

What version of acs-engine?:
0.14.X

khaldoune commented Apr 5, 2018 •

edited

Loading

pidah commented Apr 6, 2018 •

edited

Loading