Skip to content

Latest commit

 

History

History
97 lines (82 loc) · 5 KB

troubleshooting.md

File metadata and controls

97 lines (82 loc) · 5 KB

This file documents some of the common things that can go wrong when deploying OpenWhisk on Kubernetes and how to correct them.

No invokers are deployed with DockerContainerFactory

Verify that you actually have at least one node with the label openwhisk-role=invoker.

Invoker pods fail to start with volume mounting problems

To execute the containers for user actions, OpenWhisk relies on part of the underlying infrastructure that Kubernetes is running on. When deploying the Invoker for OpenWhisk, it mounts the host's Docker socket and several other system-specific directories related to Docker. This enables efficient container management, but it also also means that the default volume hostPath values assume that the Kubernetes worker node image is Ubuntu. If containers fail to start with errors related mounting/sys/fs/cgroup, /run/runc,/var/lib/docker/containers, or /var/run/docker.sock, then you will need to change the corresponding value in helm/openwhisk/templates/_invoker-helpers.yaml to match the host operating system running on your Kubernetes worker node.

Kafka, Redis, CouchDB, and Zookeeper pods stuck in Pending

These pods all mount Volumes via PersistentVolumeClaims. If there is a misconfiguration related to the dynamic provisioning of PersistentVolumes, then these pods will not be scheduled. See the Persistence section in the configuration choices documentation for more details.

Controller and Invoker cannot connect to Kafka

If services are having trouble connecting to Kafka, it may be that the Kafka service didn't actually come up successfully. One reason Kafka can fail to fully come up is that it cannot connect to itself. On minikube, fix this by saying minikube ssh -- sudo ip link set docker0 promisc on. If using kubeadm-dind-cluster, set USE_HAIRPIN=true in your environment before running 'dind-cluster.sh up. On full scale Kubernetes clusters, make sure that your kubelet's hairpin-modeis notnone`).

The usual symptom of this network misconfiguration is the controller pod being in a CrashLoopBackOff where it exits before it reports the successful creation of its completed topic.

Here's an example controller log of a successful startup:

[2018-10-18T17:53:48.129Z] [INFO] [#tid_sid_unknown] [Config] environment set value for kafka.hosts
[2018-10-18T17:53:48.130Z] [INFO] [#tid_sid_unknown] [Config] environment set value for port
[2018-10-18T17:53:49.360Z] [INFO] [#tid_sid_unknown] [KafkaMessagingProvider] created topic completed0
[2018-10-18T17:53:49.685Z] [INFO] [#tid_sid_unknown] [KafkaMessagingProvider] created topic health
[2018-10-18T17:53:49.929Z] [INFO] [#tid_sid_unknown] [KafkaMessagingProvider] created topic cacheInvalidation
[2018-10-18T17:53:50.151Z] [INFO] [#tid_sid_unknown] [KafkaMessagingProvider] created topic events

Here's what it looks like when the network is misconfigured and kafka is not really working:

[2018-10-18T17:30:37.309Z] [INFO] [#tid_sid_unknown] [Config] environment set value for kafka.hosts
[2018-10-18T17:30:37.310Z] [INFO] [#tid_sid_unknown] [Config] environment set value for port
[2018-10-18T17:30:53.433Z] [INFO] [#tid_sid_unknown] [Controller] Shutting down Kamon with coordinated shutdown

wsk cannot validate certificates error

If you installed self-signed certificates, which is the default for the OpenWhisk Helm chart, you will need to use wsk -i to suppress certificate checking. This works around cannot validate certificate errors from the wsk CLI.

nginx pod fails with host not found in resolver error

The nginx config map specifies a resolver that is used to resolve references to Kubernetes services like the controller and apigateway into ip addresses. By default, it uses kube-dns.kube-system. If your cluster instead uses coredns (or some other dns subsystem), you will need to edit the k8s.dns entry in values.yaml to an appropriate value for your cluster. A misconfigured resolver will results in the nginx pod entering a CrashLoopBackOff with an error message like the one below:

018/09/27 23:33:48 [emerg] 1#1: host not found in resolver "kube-dns.kube-system" in /etc/nginx/nginx.conf:41
nginx: [emerg] host not found in resolver "kube-dns.kube-system" in /etc/nginx/nginx.conf:41