Two alerts failing out of the box: K8SControllerManagerDown and K8SSchedulerDown #23

dmcnaught · 2016-12-20T15:48:58Z

Installed with KOPS 1.4.1, K8s 1.4.6 on AWS.

It looks to me like the query is set to alert when there is one kube-scheduler (or kube-contoller-manager), which I don't understand.

ALERT K8SSchedulerDown
  IF absent(up{job="kube-scheduler"}) or (count(up{job="kube-scheduler"} == 1) BY (cluster) == 0)

I’m pretty new to prometheus queries and I’m not really sure how the BY (cluster) == 0) relates.
Any pointers appreciated.
Thanks for the great project!
--Duncan

The text was updated successfully, but these errors were encountered:

brancz · 2016-12-20T15:58:55Z

The BY (cluster) == 0 part is unlikely important in most setups it just allows this to be used to monitor/alert for more than one Kubernetes cluster (eg. when using Federation). I'm guessing your alert is rather triggering because of the absence of kube-scheduler/kube-controller-manager jobs. Could you make sure you can find it in the /targets page? If it is not there, what is likely happening is that there is no Endpoints object that lists the kube-scheduler and kube-controller-manager, which means you either don't have the Services created from manifests/k8s/. Or your kube-scheduler and kube-controller-manager are not discoverable via those, in which case the output of the following would help:

$ kubectl get pods --all-namespace

Or if you cannot disclose all of that information this should give us all the information applicable:

$ kubectl -n monitoring get pods
$ kubectl -n kube-system get pods

We typically test the content of this repository with clusters created with bootkube, but it would be great if we can get a section/guides for kops as it's pretty widely adopted as well.

dmcnaught · 2016-12-20T16:14:38Z

Thanks for the quick response.
I don't see them on the /targets page.
Here is the info requested:

NAMESPACE            NAME                                                     READY     STATUS             RESTARTS   AGE
athena-graphql       athena-graphql-cmd-3150689734-xqkkq                      1/1       Running            0          4d
deis                 deis-builder-2759337600-en6fr                            1/1       Running            1          17d
deis                 deis-controller-873470834-re8zj                          1/1       Running            0          17d
deis                 deis-database-1712966127-plp30                           1/1       Running            0          17d
deis                 deis-logger-9212198-xjpsj                                1/1       Running            3          17d
deis                 deis-logger-fluentd-0ppwl                                1/1       Running            0          17d
deis                 deis-logger-fluentd-a8zly                                1/1       Running            0          17d
deis                 deis-logger-redis-663064164-3wq80                        1/1       Running            0          17d
deis                 deis-monitor-grafana-432364990-joygg                     1/1       Running            0          17d
deis                 deis-monitor-influxdb-2729526471-7npow                   1/1       Running            0          17d
deis                 deis-monitor-telegraf-2poea                              0/1       CrashLoopBackOff   112        17d
deis                 deis-monitor-telegraf-cy218                              1/1       Running            0          17d
deis                 deis-nsqd-3264449345-c1t0c                               1/1       Running            0          17d
deis                 deis-registry-680832981-64b4s                            1/1       Running            0          17d
deis                 deis-registry-proxy-94w6s                                1/1       Running            0          17d
deis                 deis-registry-proxy-p46bn                                1/1       Running            0          17d
deis                 deis-router-2457652422-sx6c7                             1/1       Running            0          17d
deis                 deis-workflow-manager-2210821749-ggzm3                   1/1       Running            0          17d
hades-graphql        hades-graphql-cmd-1371319327-6sqb6                       1/1       Running            0          8m
kube-system          dns-controller-2613152787-l8bj4                          1/1       Running            0          17d
kube-system          etcd-server-events-ip-10-101-115-158.ec2.internal        1/1       Running            0          17d
kube-system          etcd-server-ip-10-101-115-158.ec2.internal               1/1       Running            0          17d
kube-system          kube-apiserver-ip-10-101-115-158.ec2.internal            1/1       Running            2          17d
kube-system          kube-controller-manager-ip-10-101-115-158.ec2.internal   1/1       Running            0          17d
kube-system          kube-dns-v20-3531996453-ban95                            3/3       Running            0          17d
kube-system          kube-dns-v20-3531996453-v66h9                            3/3       Running            0          17d
kube-system          kube-proxy-ip-10-101-115-158.ec2.internal                1/1       Running            0          17d
kube-system          kube-proxy-ip-10-101-175-18.ec2.internal                 1/1       Running            0          17d
kube-system          kube-scheduler-ip-10-101-115-158.ec2.internal            1/1       Running            0          17d
monitoring           alertmanager-main-0                                      2/2       Running            0          4d
monitoring           alertmanager-main-1                                      2/2       Running            0          4d
monitoring           alertmanager-main-2                                      2/2       Running            0          4d
monitoring           grafana-874468113-0atmz                                  2/2       Running            0          4d
monitoring           kube-state-metrics-3229993571-aqo7z                      1/1       Running            0          4d
monitoring           node-exporter-5xyxb                                      1/1       Running            0          4d
monitoring           node-exporter-xwgn6                                      1/1       Running            0          4d
monitoring           prometheus-k8s-0                                         3/3       Running            0          4d
monitoring           prometheus-operator-479044303-ris0n                      1/1       Running            0          4d
splunkspout          k8ssplunkspout-nonprod-2ykwk                             1/1       Running            0          17d
splunkspout          k8ssplunkspout-nonprod-xdp7d                             1/1       Running            0          17d
styleguide           styleguide-cmd-685725177-pwl1c                           1/1       Running            0          4d
styleguide-staging   styleguide-staging-cmd-2993321210-1cgmo                  1/1       Running            0          2h
wellbot              wellbot-web-3878855632-34s4e                             1/1       Running            0          15d
welltok-arch-k8s     welltok-arch-k8s-1857575956-ky06l                        1/1       Running            0          14d```

brancz · 2016-12-21T09:32:33Z

I believe I have seen this before, the problem I think is that kops doesn't label the kubernetes component pods correctly with k8s-app=<component-name>. To confirm that can you give me the output of kubectl -n kube-system get kube-controller-manager-ip-10-101-115-158.ec2.internal -oyaml and kubectl -n kube-system get kube-scheduler-ip-10-101-115-158.ec2.internal -oyaml (in case they got rescheduled the pods that now start with kube-scheduler or kube-controller-manager respectively)

If what I am guessing is correct then we should push on kops side to use upstream manifests like bootkube does.

dmcnaught · 2016-12-21T10:42:50Z

kube-controller-manager.txt
kube-scheduler.txt

dmcnaught · 2016-12-21T10:47:45Z

This was a similar issue - to add that label to kube-proxy: kubernetes/kops#617

brancz · 2016-12-21T11:11:11Z

I didn't see that one, thanks for pointing it out! I opened kubernetes/kops#1226 to start a discussion on it. Hopefully we will get those soon then. In the mean time I think you'll have to either ssh onto those servers, change the templates and restart them (which makes the objects be recreated from the templates IIRC; disclaimer: not super familiar with kops) or comment out/remove those alerts for now.

(also remember that changes to single machines will disappear when recreating machines from the ASG unless you make the changes to the ASG)

dmcnaught · 2016-12-23T16:20:35Z

I just noticed etcd is not appearing in my prometheus targets either.

dmcnaught · 2016-12-23T16:54:36Z

Oh, and kube-dns. Should we update kubernetes/kops#1226 ?

brancz · 2016-12-23T16:59:31Z

It seems like we won't have an answer before the holidays. So I'll keep pushing in the new year. But yes I will keep pushing for a consistent labelling strategy, we'll add the respective manifests for Prometheus to properly discover the components here once we have that consistent labelling. I don't mind maintaining a set of manifests for kops, bootkube, etc. as long as each of those labelling strategies make sense and exist.

So far so goot :) happy holidays!

dmcnaught · 2016-12-29T14:30:46Z

I added the labels on the master (/etc/kubernetes/manifests/kube-controller-manager,kube-scheduler), and then ran kubectl create -f manifests/k8s/self-hosted
I ran hack/cluster-monitoring/teardown and then hack/cluster-monitoring/deploy and that fixed the alerts for kube-scheduler and kube-controller-manager.
kube-dns now has four endpoints listed under /targets, and they are all getting error: getsockopt: connection refused
I'd also like to add etcd, but don't find any explicit instructions on that. It would be great to add a kops default setup in manifests/k8s/kops/

brancz · 2016-12-29T15:37:11Z

Yep that's the plan as soon as we have consistent labeling in upstream kops.

dmcnaught · 2016-12-30T19:39:28Z

Labelling added to kops: kubernetes/kops#1314

brancz · 2017-01-02T08:21:22Z

I don't have a v1.5.x kops cluster handy, but I'll create the manifests with a best effort and then it would be great if you could test them.

dmcnaught · 2017-01-02T08:22:27Z

With pleasure. Thanks

brancz · 2017-01-02T08:34:32Z

In fact I think the manifests from manifests/k8s/self-hosted are suitable for kops when using the latest master, am I mistaken? Except kube-dns as the manifest is slightly outdated from the upstream manifest, but there are no alerts for kube-dns metrics yet so that wouldn't be a problem for now. Can you confirm that?

brancz · 2017-01-02T08:44:28Z

Actually it seems that the kube-dns manifest for v1.5.0 has been updated so in that case it should appear as well.

dmcnaught · 2017-01-02T10:09:14Z

I'll create a 1.5.x K8s cluster with the latest KOPS soon to test, thanks.

Right now I updated the labels on my 1.4.6 master and it looks good except:

kube-dns now has four failing endpoints listed under /targets:

kube-dns
Endpoint	State	Labels	Last Scrape	Error
http://100.96.1.2:10054/metrics
DOWN	instance="100.96.1.2:10054"	198ms ago	Get http://100.96.1.2:10054/metrics: dial tcp 100.96.1.2:10054: getsockopt: connection refused
http://100.96.1.2:10055/metrics
DOWN	instance="100.96.1.2:10055"	14.087s ago	Get http://100.96.1.2:10055/metrics: dial tcp 100.96.1.2:10055: getsockopt: connection refused
http://100.96.1.3:10054/metrics
DOWN	instance="100.96.1.3:10054"	9.708s ago	Get http://100.96.1.3:10054/metrics: dial tcp 100.96.1.3:10054: getsockopt: connection refused
http://100.96.1.3:10055/metrics
DOWN	instance="100.96.1.3:10055"	327ms ago	Get http://100.96.1.3:10055/metrics: dial tcp 100.96.1.3:10055: getsockopt: connection refused

kubernetes target is also failing (and this is causing K8SApiserverDown alert):

kubernetes
Endpoint	State	Labels	Last Scrape	Error
https://10.101.115.158:443/metrics
DOWN	instance="10.101.115.158:443"	11.904s ago	Get https://10.101.115.158:443/metrics: x509: certificate is valid for 100.64.0.1, not 10.101.115.158

brancz · 2017-01-02T11:36:52Z

The kube-dns failure is likely due to an old kube-dns manifest .. I can see that the 1.4.x manifest has not been updated to expose metrics in the upstream kops repo.

The kubernetes target failure is a bit more tricky, best would be if the certificate were to be created with the requested IP in the additional names section. (inspect your cert with openssl x509 -text -in your.crt) The other option is to "manually" maintain an Endpoints object through a headless Service as done for etcd (see manifests/etcd), that way the correct IP will be used to perform the request.

dmcnaught · 2017-01-03T12:27:40Z

@brancz I don't know whether we want to continue this thread on kops/kube-prometheus work - let me know if there's a better place. Maybe we should open a new issue.
I used kops latest (master: Version git-8b21ace) and Kubernetes 1.5.1 to create a new cluster in AWS.
Running hack/cluster-monitoring/deploy

--- github/kube-prometheus ‹master› » ./hack/cluster-monitoring/deploy
namespace "monitoring" created
deployment "prometheus-operator" created
the server doesn't have a resource type "servicemonitor"
the server doesn't have a resource type "servicemonitor"
the server doesn't have a resource type "servicemonitor"
No resources found.
No resources found.
No resources found.
deployment "kube-state-metrics" created
service "kube-state-metrics" created
daemonset "node-exporter" created
service "node-exporter" created
configmap "grafana-dashboards" created
deployment "grafana" created
service "grafana" created
configmap "prometheus-k8s" created
configmap "prometheus-k8s-rules" created
service "prometheus-k8s" created
prometheus "prometheus-k8s" created
configmap "alertmanager-main" created
service "alertmanager-main" created
alertmanager "alertmanager-main" created

brancz · 2017-01-24T15:44:45Z

Has this been solved on upstream kops? @dmcnaught

dmcnaught · 2017-01-24T15:46:00Z

I'm going to start with the kops - kube-prometheus config when kops 1.5 has been released.

brancz · 2017-01-24T15:52:57Z

Great thanks for the update! Are you aware of an ETA?

dmcnaught · 2017-01-24T15:53:49Z

I've heard "soon" - it's currently in alpha4: https://github.com/kubernetes/kops/releases

brancz · 2017-01-24T16:06:52Z

Great! Looking forward to "soon" 🙂

dmcnaught · 2017-01-24T16:31:15Z

Me too. I thought it would be "sooner" 😉

dmcnaught · 2017-02-07T00:24:33Z

Getting close with kops 1.5.0-alpha2 and k8s 1.5.2 ^ Just the api cert issue to go. 😄

rocketraman · 2018-04-03T19:29:42Z

Looks like this is also the case with clusters created via acs-engine on Azure. The labels on the controller-manager pod are:

  labels:
    component: kube-controller-manager
    tier: control-plane

yann-soubeyrand · 2018-04-23T11:29:40Z

Same with a cluster created using Kubeadm.

brancz · 2018-04-23T12:58:10Z

@yann-soubeyrand for kubeadm clusters you need to enable the controller manager and scheduler to listen on all or at least the pod networking interface/ip.

yann-soubeyrand · 2018-04-24T14:22:42Z

@brancz Thanks for the tip on modifying listening addresses which saved me some time ;-) However, I was mentioning the fact that the labeling done by Kubeadm is like rocketraman wrote above and therefore kube-prometheus was not able to discover the controller manager neither the scheduler nor etcd.

rawkode · 2018-07-02T17:00:37Z

@brancz Can confirm what @yann-soubeyrand and @rocketraman have said, kubeadm and gke use component: kube-scheduler, not k8s-app

KeithTt · 2020-12-09T09:54:56Z

@yann-soubeyrand for kubeadm clusters you need to enable the controller manager and scheduler to listen on all or at least the pod networking interface/ip.

I changed the bind address of controller manager and scheduler to 0.0.0.0, but they are still not up on prometheus.

Also, there is no data in grafana..

dmcnaught mentioned this issue Feb 7, 2017

KOPS 1.5.0-beta2, Kubernetes 1.5.2, AWS: Only api cert error TODO from deploy, self-hosted setup. #35

Closed

dmcnaught closed this as completed Feb 7, 2017

rutsky mentioned this issue Mar 10, 2017

Alerts failing out of the box: K8SControllerManagerDown and K8SSchedulerDown (on Kargo) #61

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Two alerts failing out of the box: K8SControllerManagerDown and K8SSchedulerDown #23

Two alerts failing out of the box: K8SControllerManagerDown and K8SSchedulerDown #23

dmcnaught commented Dec 20, 2016

brancz commented Dec 20, 2016 •

edited

Loading

dmcnaught commented Dec 20, 2016

brancz commented Dec 21, 2016

dmcnaught commented Dec 21, 2016

dmcnaught commented Dec 21, 2016

brancz commented Dec 21, 2016 •

edited

Loading

dmcnaught commented Dec 23, 2016

dmcnaught commented Dec 23, 2016

brancz commented Dec 23, 2016

dmcnaught commented Dec 29, 2016

brancz commented Dec 29, 2016

dmcnaught commented Dec 30, 2016

brancz commented Jan 2, 2017

dmcnaught commented Jan 2, 2017

brancz commented Jan 2, 2017

brancz commented Jan 2, 2017

dmcnaught commented Jan 2, 2017

brancz commented Jan 2, 2017

dmcnaught commented Jan 3, 2017

brancz commented Jan 24, 2017

dmcnaught commented Jan 24, 2017

brancz commented Jan 24, 2017

dmcnaught commented Jan 24, 2017

brancz commented Jan 24, 2017

dmcnaught commented Jan 24, 2017

dmcnaught commented Feb 7, 2017

rocketraman commented Apr 3, 2018

yann-soubeyrand commented Apr 23, 2018

brancz commented Apr 23, 2018

yann-soubeyrand commented Apr 24, 2018

rawkode commented Jul 2, 2018

KeithTt commented Dec 9, 2020 •

edited

Loading

Two alerts failing out of the box: K8SControllerManagerDown and K8SSchedulerDown #23

Two alerts failing out of the box: K8SControllerManagerDown and K8SSchedulerDown #23

Comments

dmcnaught commented Dec 20, 2016

brancz commented Dec 20, 2016 • edited Loading

dmcnaught commented Dec 20, 2016

brancz commented Dec 21, 2016

dmcnaught commented Dec 21, 2016

dmcnaught commented Dec 21, 2016

brancz commented Dec 21, 2016 • edited Loading

dmcnaught commented Dec 23, 2016

dmcnaught commented Dec 23, 2016

brancz commented Dec 23, 2016

dmcnaught commented Dec 29, 2016

brancz commented Dec 29, 2016

dmcnaught commented Dec 30, 2016

brancz commented Jan 2, 2017

dmcnaught commented Jan 2, 2017

brancz commented Jan 2, 2017

brancz commented Jan 2, 2017

dmcnaught commented Jan 2, 2017

brancz commented Jan 2, 2017

dmcnaught commented Jan 3, 2017

brancz commented Jan 24, 2017

dmcnaught commented Jan 24, 2017

brancz commented Jan 24, 2017

dmcnaught commented Jan 24, 2017

brancz commented Jan 24, 2017

dmcnaught commented Jan 24, 2017

dmcnaught commented Feb 7, 2017

rocketraman commented Apr 3, 2018

yann-soubeyrand commented Apr 23, 2018

brancz commented Apr 23, 2018

yann-soubeyrand commented Apr 24, 2018

rawkode commented Jul 2, 2018

KeithTt commented Dec 9, 2020 • edited Loading

brancz commented Dec 20, 2016 •

edited

Loading

brancz commented Dec 21, 2016 •

edited

Loading

KeithTt commented Dec 9, 2020 •

edited

Loading