Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two alerts failing out of the box: K8SControllerManagerDown and K8SSchedulerDown #23

Closed
dmcnaught opened this issue Dec 20, 2016 · 32 comments

Comments

@dmcnaught
Copy link
Contributor

Installed with KOPS 1.4.1, K8s 1.4.6 on AWS.

It looks to me like the query is set to alert when there is one kube-scheduler (or kube-contoller-manager), which I don't understand.

ALERT K8SSchedulerDown
  IF absent(up{job="kube-scheduler"}) or (count(up{job="kube-scheduler"} == 1) BY (cluster) == 0)

I’m pretty new to prometheus queries and I’m not really sure how the BY (cluster) == 0) relates.
Any pointers appreciated.
Thanks for the great project!
--Duncan

@brancz
Copy link
Collaborator

brancz commented Dec 20, 2016

The BY (cluster) == 0 part is unlikely important in most setups it just allows this to be used to monitor/alert for more than one Kubernetes cluster (eg. when using Federation). I'm guessing your alert is rather triggering because of the absence of kube-scheduler/kube-controller-manager jobs. Could you make sure you can find it in the /targets page? If it is not there, what is likely happening is that there is no Endpoints object that lists the kube-scheduler and kube-controller-manager, which means you either don't have the Services created from manifests/k8s/. Or your kube-scheduler and kube-controller-manager are not discoverable via those, in which case the output of the following would help:

$ kubectl get pods --all-namespace

Or if you cannot disclose all of that information this should give us all the information applicable:

$ kubectl -n monitoring get pods
$ kubectl -n kube-system get pods

We typically test the content of this repository with clusters created with bootkube, but it would be great if we can get a section/guides for kops as it's pretty widely adopted as well.

@dmcnaught
Copy link
Contributor Author

Thanks for the quick response.
I don't see them on the /targets page.
Here is the info requested:

NAMESPACE            NAME                                                     READY     STATUS             RESTARTS   AGE
athena-graphql       athena-graphql-cmd-3150689734-xqkkq                      1/1       Running            0          4d
deis                 deis-builder-2759337600-en6fr                            1/1       Running            1          17d
deis                 deis-controller-873470834-re8zj                          1/1       Running            0          17d
deis                 deis-database-1712966127-plp30                           1/1       Running            0          17d
deis                 deis-logger-9212198-xjpsj                                1/1       Running            3          17d
deis                 deis-logger-fluentd-0ppwl                                1/1       Running            0          17d
deis                 deis-logger-fluentd-a8zly                                1/1       Running            0          17d
deis                 deis-logger-redis-663064164-3wq80                        1/1       Running            0          17d
deis                 deis-monitor-grafana-432364990-joygg                     1/1       Running            0          17d
deis                 deis-monitor-influxdb-2729526471-7npow                   1/1       Running            0          17d
deis                 deis-monitor-telegraf-2poea                              0/1       CrashLoopBackOff   112        17d
deis                 deis-monitor-telegraf-cy218                              1/1       Running            0          17d
deis                 deis-nsqd-3264449345-c1t0c                               1/1       Running            0          17d
deis                 deis-registry-680832981-64b4s                            1/1       Running            0          17d
deis                 deis-registry-proxy-94w6s                                1/1       Running            0          17d
deis                 deis-registry-proxy-p46bn                                1/1       Running            0          17d
deis                 deis-router-2457652422-sx6c7                             1/1       Running            0          17d
deis                 deis-workflow-manager-2210821749-ggzm3                   1/1       Running            0          17d
hades-graphql        hades-graphql-cmd-1371319327-6sqb6                       1/1       Running            0          8m
kube-system          dns-controller-2613152787-l8bj4                          1/1       Running            0          17d
kube-system          etcd-server-events-ip-10-101-115-158.ec2.internal        1/1       Running            0          17d
kube-system          etcd-server-ip-10-101-115-158.ec2.internal               1/1       Running            0          17d
kube-system          kube-apiserver-ip-10-101-115-158.ec2.internal            1/1       Running            2          17d
kube-system          kube-controller-manager-ip-10-101-115-158.ec2.internal   1/1       Running            0          17d
kube-system          kube-dns-v20-3531996453-ban95                            3/3       Running            0          17d
kube-system          kube-dns-v20-3531996453-v66h9                            3/3       Running            0          17d
kube-system          kube-proxy-ip-10-101-115-158.ec2.internal                1/1       Running            0          17d
kube-system          kube-proxy-ip-10-101-175-18.ec2.internal                 1/1       Running            0          17d
kube-system          kube-scheduler-ip-10-101-115-158.ec2.internal            1/1       Running            0          17d
monitoring           alertmanager-main-0                                      2/2       Running            0          4d
monitoring           alertmanager-main-1                                      2/2       Running            0          4d
monitoring           alertmanager-main-2                                      2/2       Running            0          4d
monitoring           grafana-874468113-0atmz                                  2/2       Running            0          4d
monitoring           kube-state-metrics-3229993571-aqo7z                      1/1       Running            0          4d
monitoring           node-exporter-5xyxb                                      1/1       Running            0          4d
monitoring           node-exporter-xwgn6                                      1/1       Running            0          4d
monitoring           prometheus-k8s-0                                         3/3       Running            0          4d
monitoring           prometheus-operator-479044303-ris0n                      1/1       Running            0          4d
splunkspout          k8ssplunkspout-nonprod-2ykwk                             1/1       Running            0          17d
splunkspout          k8ssplunkspout-nonprod-xdp7d                             1/1       Running            0          17d
styleguide           styleguide-cmd-685725177-pwl1c                           1/1       Running            0          4d
styleguide-staging   styleguide-staging-cmd-2993321210-1cgmo                  1/1       Running            0          2h
wellbot              wellbot-web-3878855632-34s4e                             1/1       Running            0          15d
welltok-arch-k8s     welltok-arch-k8s-1857575956-ky06l                        1/1       Running            0          14d```

@brancz
Copy link
Collaborator

brancz commented Dec 21, 2016

I believe I have seen this before, the problem I think is that kops doesn't label the kubernetes component pods correctly with k8s-app=<component-name>. To confirm that can you give me the output of kubectl -n kube-system get kube-controller-manager-ip-10-101-115-158.ec2.internal -oyaml and kubectl -n kube-system get kube-scheduler-ip-10-101-115-158.ec2.internal -oyaml (in case they got rescheduled the pods that now start with kube-scheduler or kube-controller-manager respectively)

If what I am guessing is correct then we should push on kops side to use upstream manifests like bootkube does.

@dmcnaught
Copy link
Contributor Author

@dmcnaught
Copy link
Contributor Author

This was a similar issue - to add that label to kube-proxy: kubernetes/kops#617

@brancz
Copy link
Collaborator

brancz commented Dec 21, 2016

I didn't see that one, thanks for pointing it out! I opened kubernetes/kops#1226 to start a discussion on it. Hopefully we will get those soon then. In the mean time I think you'll have to either ssh onto those servers, change the templates and restart them (which makes the objects be recreated from the templates IIRC; disclaimer: not super familiar with kops) or comment out/remove those alerts for now.

(also remember that changes to single machines will disappear when recreating machines from the ASG unless you make the changes to the ASG)

@dmcnaught
Copy link
Contributor Author

I just noticed etcd is not appearing in my prometheus targets either.

@dmcnaught
Copy link
Contributor Author

Oh, and kube-dns. Should we update kubernetes/kops#1226 ?

@brancz
Copy link
Collaborator

brancz commented Dec 23, 2016

It seems like we won't have an answer before the holidays. So I'll keep pushing in the new year. But yes I will keep pushing for a consistent labelling strategy, we'll add the respective manifests for Prometheus to properly discover the components here once we have that consistent labelling. I don't mind maintaining a set of manifests for kops, bootkube, etc. as long as each of those labelling strategies make sense and exist.

So far so goot :) happy holidays!

@dmcnaught
Copy link
Contributor Author

I added the labels on the master (/etc/kubernetes/manifests/kube-controller-manager,kube-scheduler), and then ran kubectl create -f manifests/k8s/self-hosted
I ran hack/cluster-monitoring/teardown and then hack/cluster-monitoring/deploy and that fixed the alerts for kube-scheduler and kube-controller-manager.
kube-dns now has four endpoints listed under /targets, and they are all getting error: getsockopt: connection refused
I'd also like to add etcd, but don't find any explicit instructions on that. It would be great to add a kops default setup in manifests/k8s/kops/

@brancz
Copy link
Collaborator

brancz commented Dec 29, 2016

Yep that's the plan as soon as we have consistent labeling in upstream kops.

@dmcnaught
Copy link
Contributor Author

Labelling added to kops: kubernetes/kops#1314

@brancz
Copy link
Collaborator

brancz commented Jan 2, 2017

I don't have a v1.5.x kops cluster handy, but I'll create the manifests with a best effort and then it would be great if you could test them.

@dmcnaught
Copy link
Contributor Author

With pleasure. Thanks

@brancz
Copy link
Collaborator

brancz commented Jan 2, 2017

In fact I think the manifests from manifests/k8s/self-hosted are suitable for kops when using the latest master, am I mistaken? Except kube-dns as the manifest is slightly outdated from the upstream manifest, but there are no alerts for kube-dns metrics yet so that wouldn't be a problem for now. Can you confirm that?

@brancz
Copy link
Collaborator

brancz commented Jan 2, 2017

Actually it seems that the kube-dns manifest for v1.5.0 has been updated so in that case it should appear as well.

@dmcnaught
Copy link
Contributor Author

I'll create a 1.5.x K8s cluster with the latest KOPS soon to test, thanks.

Right now I updated the labels on my 1.4.6 master and it looks good except:

  • kube-dns now has four failing endpoints listed under /targets:
kube-dns
Endpoint	State	Labels	Last Scrape	Error
http://100.96.1.2:10054/metrics
DOWN	instance="100.96.1.2:10054"	198ms ago	Get http://100.96.1.2:10054/metrics: dial tcp 100.96.1.2:10054: getsockopt: connection refused
http://100.96.1.2:10055/metrics
DOWN	instance="100.96.1.2:10055"	14.087s ago	Get http://100.96.1.2:10055/metrics: dial tcp 100.96.1.2:10055: getsockopt: connection refused
http://100.96.1.3:10054/metrics
DOWN	instance="100.96.1.3:10054"	9.708s ago	Get http://100.96.1.3:10054/metrics: dial tcp 100.96.1.3:10054: getsockopt: connection refused
http://100.96.1.3:10055/metrics
DOWN	instance="100.96.1.3:10055"	327ms ago	Get http://100.96.1.3:10055/metrics: dial tcp 100.96.1.3:10055: getsockopt: connection refused
  • kubernetes target is also failing (and this is causing K8SApiserverDown alert):
kubernetes
Endpoint	State	Labels	Last Scrape	Error
https://10.101.115.158:443/metrics
DOWN	instance="10.101.115.158:443"	11.904s ago	Get https://10.101.115.158:443/metrics: x509: certificate is valid for 100.64.0.1, not 10.101.115.158

@brancz
Copy link
Collaborator

brancz commented Jan 2, 2017

The kube-dns failure is likely due to an old kube-dns manifest .. I can see that the 1.4.x manifest has not been updated to expose metrics in the upstream kops repo.

The kubernetes target failure is a bit more tricky, best would be if the certificate were to be created with the requested IP in the additional names section. (inspect your cert with openssl x509 -text -in your.crt) The other option is to "manually" maintain an Endpoints object through a headless Service as done for etcd (see manifests/etcd), that way the correct IP will be used to perform the request.

@dmcnaught
Copy link
Contributor Author

@brancz I don't know whether we want to continue this thread on kops/kube-prometheus work - let me know if there's a better place. Maybe we should open a new issue.
I used kops latest (master: Version git-8b21ace) and Kubernetes 1.5.1 to create a new cluster in AWS.
Running hack/cluster-monitoring/deploy

--- github/kube-prometheus ‹master› » ./hack/cluster-monitoring/deploy
namespace "monitoring" created
deployment "prometheus-operator" created
the server doesn't have a resource type "servicemonitor"
the server doesn't have a resource type "servicemonitor"
the server doesn't have a resource type "servicemonitor"
No resources found.
No resources found.
No resources found.
deployment "kube-state-metrics" created
service "kube-state-metrics" created
daemonset "node-exporter" created
service "node-exporter" created
configmap "grafana-dashboards" created
deployment "grafana" created
service "grafana" created
configmap "prometheus-k8s" created
configmap "prometheus-k8s-rules" created
service "prometheus-k8s" created
prometheus "prometheus-k8s" created
configmap "alertmanager-main" created
service "alertmanager-main" created
alertmanager "alertmanager-main" created

@brancz
Copy link
Collaborator

brancz commented Jan 24, 2017

Has this been solved on upstream kops? @dmcnaught

@dmcnaught
Copy link
Contributor Author

I'm going to start with the kops - kube-prometheus config when kops 1.5 has been released.

@brancz
Copy link
Collaborator

brancz commented Jan 24, 2017

Great thanks for the update! Are you aware of an ETA?

@dmcnaught
Copy link
Contributor Author

I've heard "soon" - it's currently in alpha4: https://github.com/kubernetes/kops/releases

@brancz
Copy link
Collaborator

brancz commented Jan 24, 2017

Great! Looking forward to "soon" 🙂

@dmcnaught
Copy link
Contributor Author

Me too. I thought it would be "sooner" 😉

@dmcnaught
Copy link
Contributor Author

Getting close with kops 1.5.0-alpha2 and k8s 1.5.2 ^ Just the api cert issue to go. 😄

@rocketraman
Copy link

Looks like this is also the case with clusters created via acs-engine on Azure. The labels on the controller-manager pod are:

  labels:
    component: kube-controller-manager
    tier: control-plane

@yann-soubeyrand
Copy link

Same with a cluster created using Kubeadm.

@brancz
Copy link
Collaborator

brancz commented Apr 23, 2018

@yann-soubeyrand for kubeadm clusters you need to enable the controller manager and scheduler to listen on all or at least the pod networking interface/ip.

@yann-soubeyrand
Copy link

@brancz Thanks for the tip on modifying listening addresses which saved me some time ;-) However, I was mentioning the fact that the labeling done by Kubeadm is like rocketraman wrote above and therefore kube-prometheus was not able to discover the controller manager neither the scheduler nor etcd.

@rawkode
Copy link

rawkode commented Jul 2, 2018

@brancz Can confirm what @yann-soubeyrand and @rocketraman have said, kubeadm and gke use component: kube-scheduler, not k8s-app

@KeithTt
Copy link

KeithTt commented Dec 9, 2020

@yann-soubeyrand for kubeadm clusters you need to enable the controller manager and scheduler to listen on all or at least the pod networking interface/ip.

I changed the bind address of controller manager and scheduler to 0.0.0.0, but they are still not up on prometheus.

image

Also, there is no data in grafana..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants