Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix metrics server addon #6201

Merged
merged 2 commits into from
Mar 17, 2019
Merged

Fix metrics server addon #6201

merged 2 commits into from
Mar 17, 2019

Conversation

itskingori
Copy link
Member

@itskingori itskingori commented Dec 12, 2018

All the details are in the commit messages. I think this:

Fixes to issues caused by changes in v0.3.x, see release notes:

Main issue is that in 0.3.x, the secure kubelet port with auth is enabled by default. Use of the insecure port is deprecated and may even be removed as soon as the next release.

That said, metrics-server now uses webhook authentication so kubelet would need webhook authentication enabled (which we don’t). This flag enables, that serviceaccount tokens to be used to authenticate against the kubelet: See:

Currently, even if we were to use certificates, this provides a challenge since metrics server will generate it’s own self-signed certificates since we don’t set --tls-cert-file and --tls-private-key-file. Related:

That said, based on the comment in 8ba93ee, do you think it's worth enabling webhook authentication on kubelet instead of doing it this way?

So, v0.3.0 changed how the options are specified, so it makes sense that
we needed to make this change. See:
* kubernetes/kubernetes#44540 (comment)
* kubernetes/kubernetes#44540 (comment)

From checking the address preferences configured on our API server i.e.
and then mirroring those options to metrics-server. See:
* kubernetes-sigs/metrics-server#67 (comment)
In 0.3.x, the secure kubelet port with auth is enabled by default. Use
of the insecure port is deprecated and may even be removed as soon as
the next release. See:
* https://github.com/kubernetes-incubator/metrics-server/releases/tag/v0.3.0
* https://github.com/kubernetes-incubator/metrics-server/releases/tag/v0.3.1

That said, metrics-server now uses webhook authentication so kubelet
would need webhook authentication enabled (which we don’t). This flag
enables, that serviceaccount tokens to be used to authenticate against
the kubelet: See:
* https://kubernetes.io/docs/reference/access-authn-authz/webhook/
* #5508
* #5176 (comment)
* kubernetes-sigs/metrics-server#175

Currently, even if we were to use certificates, this provides a
challenge since metrics server will generate it’s own self-signed
certificates since we don’t set `--tls-cert-file` and
`--tls-private-key-file`. Related:
* kubernetes-sigs/metrics-server#25
* kubernetes-sigs/metrics-server#146
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Dec 12, 2018
@itskingori itskingori changed the title Fix metrics server Fix metrics server addon Dec 12, 2018
@k8s-ci-robot
Copy link
Contributor

@prageethw: GitHub didn't allow me to assign the following users: justins.

Note that only kubernetes members and repo collaborators can be assigned.
For more information please see the contributor guide

In response to this:

/assign @Justins

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@prageethw
Copy link

/assign @justinsb

@itskingori
Copy link
Member Author

I've added more information to the PR description for a better understanding of the issue and the solution therein. Also, kubernetes-sigs/metrics-server#131 (comment) is a question I posed to one of the main developers of metrics-server and I got a response with some more information.

Copy link
Contributor

@Cryptophobia Cryptophobia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

This will fix problems users are experiencing with HPA and metric-server right now.

Copy link
Contributor

@jmthvt jmthvt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@prageethw
Copy link

prageethw commented Jan 26, 2019

@itskingori I don't think this fix works anymore after kops 1.11.0, seems to be working on kops 1.10.0 with metrics server 3.1

@Cryptophobia
Copy link
Contributor

@prageethw, can you give us more detail on what problem you are experiencing? This fix works for me on k8s 1.11 and k8s 1.12. Might be something related to your specific environment.

@prageethw
Copy link

@prageethw, can you give us more detail on what problem you are experiencing? This fix works for me on k8s 1.11 and k8s 1.12. Might be something related to your specific environment.

it works fine in kops 1.10.0 (in fact it works just with --kubelet-insecure-tls), but does not work in kops 1.11.0, the same unauthorized error appears I think the issue is not with K8s but kops. Can you confirm you are running kops 1.11.0?

@itskingori
Copy link
Member Author

@prageethw Hmmmm. Used this in v1.10.11, just migrated to v1.11.6 and metrics server is still working (I think, will have to check). What error are you getting?

@Cryptophobia
Copy link
Contributor

@prageethw , when you kubectl log command for the metrics-server, does it fail too? There was another separate issue for a missing clusterrolebinding kube-api-server when upgrading with kops.

Link to issue: #5706

I can confirm I am running 7 clusters on k8s 1.11.x and they all work with this fix.

@prageethw
Copy link

prageethw commented Jan 28, 2019

@prageethw Hmmmm. Used this in v1.10.11, just migrated to v1.11.6 and metrics server is still working (I think, will have to check). What error are you getting?

@itskingori @Cryptophobia
I think you are talking about K8s versions and yes with kops 1.10.0 and k8s v1.11.7 also I can confirm it works as expected with just --kubelet-insecure-tls .

my helm chart and kops and k8s versions below which works like a charm.

k8s version

GitVersion:"v1.11.7"

kops version

Version 1.10.0
helm install stable/metrics-server \
    --name metrics-server \
    --version 2.0.4 \
    --set replicas=2 \
    --namespace metrics \
    --set args={"--kubelet-insecure-tls=true"} \
    --set resources.limits.cpu="100m",resources.limits.memory="50Mi"

here the case it fails and metrics logs

metrics logs

unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:ip-172-20-67-59.us-east-2.compute.internal: unable to fetch metrics from Kubelet ip-172-20-67-59.us-east-2.compute.internal (ip-172-20-67-59.us-east-2.compute.internal): request failed - "401 Unauthorized", response: "Unauthorized", unable to fully scrape metrics from source kubelet_summary:ip-172-20-89-40.us-east-2.compute.internal: unable to fetch metrics from Kubelet ip-172-20-89-40.us-east-2.compute.internal (ip-172-20-89-40.us-east-2.compute.internal): request failed - "401 Unauthorized", response: "Unauthorized", unable to fully scrape metrics from source kubelet_summary:ip-172-20-109-206.us-east-2.compute.internal: unable to fetch metrics from Kubelet ip-172-20-109-206.us-east-2.compute.internal (ip-172-20-109-206.us-east-2.compute.internal): request failed - "401 Unauthorized", response: "Unauthorized", unable to fully scrape metrics from source kubelet_summary:ip-172-20-43-140.us-east-2.compute.internal: unable to fetch metrics from Kubelet ip-172-20-43-140.us-east-2.compute.internal (ip-172-20-43-140.us-east-2.compute.internal): request failed - "401 Unauthorized", response: "Unauthorized", unable to fully scrape metrics from source kubelet_summary:ip-172-20-112-175.us-east-2.compute.internal: unable to fetch metrics from Kubelet ip-172-20-112-175.us-east-2.compute.internal (ip-172-20-112-175.us-east-2.compute.internal): request failed - "401 Unauthorized", response: "Unauthorized"]

k8s version

GitVersion:"v1.11.7"

kops version

Version 1.11.0
helm install stable/metrics-server \
    --name metrics-server \
    --version 2.0.4 \
    --set replicas=2 \
    --namespace metrics \
    --set args={"--kubelet-insecure-tls=true"} \
    --set resources.limits.cpu="100m",resources.limits.memory="50Mi"

with exact config that fix suggested it produce even a different error
metrics logs

unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:ip-172-20-112-175.us-east-2.compute.internal: unable to fetch metrics from Kubelet ip-172-20-112-175.us-east-2.compute.internal (172.20.112.175): Get https://172.20.112.175:10250/stats/summary/: x509: cannot validate certificate for 172.20.112.175 because it doesn't contain any IP SANs, unable to fully scrape metrics from source kubelet_summary:ip-172-20-67-59.us-east-2.compute.internal: unable to fetch metrics from Kubelet ip-172-20-67-59.us-east-2.compute.internal (172.20.67.59): Get https://172.20.67.59:10250/stats/summary/: x509: cannot validate certificate for 172.20.67.59 because it doesn't contain any IP SANs, unable to fully scrape metrics from source kubelet_summary:ip-172-20-43-140.us-east-2.compute.internal: unable to fetch metrics from Kubelet ip-172-20-43-140.us-east-2.compute.internal (172.20.43.140): Get https://172.20.43.140:10250/stats/summary/: x509: cannot validate certificate for 172.20.43.140 because it doesn't contain any IP SANs, unable to fully scrape metrics from source kubelet_summary:ip-172-20-109-206.us-east-2.compute.internal: unable to fetch metrics from Kubelet ip-172-20-109-206.us-east-2.compute.internal (172.20.109.206): Get https://172.20.109.206:10250/stats/summary/: x509: cannot validate certificate for 172.20.109.206 because it doesn't contain any IP SANs, unable to fully scrape metrics from source kubelet_summary:ip-172-20-89-40.us-east-2.compute.internal: unable to fetch metrics from Kubelet ip-172-20-89-40.us-east-2.compute.internal (172.20.89.40): Get https://172.20.89.40:10250/stats/summary/: x509: cannot validate certificate for 172.20.89.40 because it doesn't contain any IP SANs]

helm

helm install stable/metrics-server \
    --name metrics-server \
    --version 2.0.4 \
    --set replicas=2 \
    --namespace metrics \
    --set args={"--kubelet-insecure-tls=true,--kubelet-preferred-address-types=InternalIP\,Hostname\,ExternalIP"} \
    --set resources.limits.cpu="100m",resources.limits.memory="50Mi"

hence my personal take is the issue is in KOPS not in K8s, when kops bumped up the version from 1.10.0 to 1.11.0 something has changed the way kubelets communicate with API server which i could not find in release notes either.

@prageethw
Copy link

prageethw commented Jan 28, 2019

@prageethw , when you kubectl log command for the metrics-server, does it fail too? There was another separate issue for a missing clusterrolebinding kube-api-server when upgrading with kops.

Link to issue: #5706

I can confirm I am running 7 clusters on k8s 1.11.x and they all work with this fix.

@Cryptophobia
Thanks for the confirmation,
whats the kops version you using? it seems to me issue on Kops not in K8s. I have attached logs above.

@itskingori
Copy link
Member Author

@prageethw I see. I'm conflating kops and kubernetes 🤦‍♂️ ... we also upgraded using kops 1.11.0 to kubernetes 1.11.x and it does seems broken. I've been able to find similar errors in my logs too.

hence my personal take is the issue is in KOPS not in K8s, when kops bumped up the version from 1.10.0 to 1.11.0 something has changed the way kubelets communicate with API server which i could not find in release notes either.

Agreed!

@Cryptophobia
Copy link
Contributor

@prageethw , when you kubectl log command for the metrics-server, does it fail too? There was another separate issue for a missing clusterrolebinding kube-api-server when upgrading with kops.
Link to issue: #5706
I can confirm I am running 7 clusters on k8s 1.11.x and they all work with this fix.

@Cryptophobia
Thanks for the confirmation,
whats the kops version you using? it seems to me issue on Kops not in K8s. I have attached logs above.

I am running kops version Version 1.11.0 (git-2c2042465)

Kops version should be paired with the k8s version. For example when you are upgrading from 1.10.x to 1.11.x you should get the newest kops version which would be 1.11.x . Not sure if that is always the requirement, but that was a requirement from before.

First version of kops to support 1.11.x k8s is kops 1.11.0. https://github.com/kubernetes/kops/releases/tag/1.11.0

@prageethw
Copy link

prageethw commented Jan 28, 2019

@prageethw , when you kubectl log command for the metrics-server, does it fail too? There was another separate issue for a missing clusterrolebinding kube-api-server when upgrading with kops.
Link to issue: #5706
I can confirm I am running 7 clusters on k8s 1.11.x and they all work with this fix.

@Cryptophobia
Thanks for the confirmation,
whats the kops version you using? it seems to me issue on Kops not in K8s. I have attached logs above.

I am running kops version Version 1.11.0 (git-2c2042465)

Kops version should be paired with the k8s version. For example when you are upgrading from 1.10.x to 1.11.x you should get the newest kops version which would be 1.11.x . Not sure if that is always the requirement, but that was a requirement from before.

First version of kops to support 1.11.x k8s is kops 1.11.0. https://github.com/kubernetes/kops/releases/tag/1.11.0

@Cryptophobia @itskingori
Thanks for the info yep I understood that by default when kops used it installs the most stable version of k8s I guess, seems to be v1.11.6 and I can confirm the same issue exists with that k8s version too.

to give you guys a context I run cluster creation and destruction in CI (with above fix as it was already flagged metrics server not allowing self signed certs anymore) to make sure to catch when things get break asap in general. , so I find out it has broken the moment I bumped kops from 1.10.0 to 1.11.0

hadrianvalentine
hadrianvalentine approved these changes Jan 28, 2019
@Bragegs
Copy link

Bragegs commented Feb 4, 2019

I'll just leave my experience here:

k8 version: 1.11.6
Kops version: 1.11.0

  • Installed metrics-server
  • kubectl get hpa my-hpa
NAME      REFERENCE                             TARGETS  MINPODS   MAXPODS   REPLICAS   AGE
my-hpa   Deployment/server-deployment   <"unknown">/80%   1         5         2          2h

Had to open for webhooks:

  • kops edit cluster
  • Change following:
 kubelet:
    anonymousAuth: false
    authenticationTokenWebhook: true
    authorizationMode: Webhook

Update cluster:

  • kops update cluster my-cluster
  • kops rolling-update cluster

I then had to add this role binding:

kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: kubelet-api-admin
subjects:
- kind: User
  name: kubelet-api
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: system:kubelet-api-admin
  apiGroup: rbac.authorization.k8s.io

I then had to change metrics-server commands:

containers:
- name: metrics-server
image: k8s.gcr.io/metrics-server-amd64:v0.3.1
command:
  - /metrics-server
  - --kubelet-insecure-tls
  - --kubelet-preferred-address-types=InternalIP
  • kubectl get hpa my-hpa
NAME      REFERENCE                     TARGETS  MINPODS   MAXPODS   REPLICAS   AGE
my-hpa   Deployment/server-deployment   20%/80%   1         5         2          2h

Hpa now working as expected, but still getting some metrics-server warnings:

E0204 09:01:22.062520       1 reststorage.go:144] unable to fetch pod metrics for pod namesp/server-deployment-5334b7499d-wsi26: no metrics known for pod

@itskingori
Copy link
Member Author

@prageethw While I was surprised by the log lines you mentioned also showing up in my setup, seems like this still works (or else I'd have had bigger problems). Not sure how to proceed.

@prageethw
Copy link

prageethw commented Feb 5, 2019

@prageethw While I was surprised by the log lines you mentioned also showing up in my setup, seems like this still works (or else I'd have had bigger problems). Not sure how to proceed.

it works fine for me in kops 1.10.0.

@Bragegs
I tried above on kops 1.11.0 with exact configs you described However still unable to get it working. It will be good someone from Kops team describe whats the difference in kops 1.10.0 and 1.11.0, why metrics server fails on kops 1.11.0 with --kubelet-insecure-tls while it works on kops 1.10.0

Note: above versions are Kops, not K8s, I think the issue is in Kops, not k8s.

kops version

Version 1.11.0

kubectl version

Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.3", GitCommit:"721bfa751924da8d1680787490c54b9179b1fed0", GitTreeState:"clean", BuildDate:"2019-02-04T04:48:03Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.7", GitCommit:"65ecaf0671341311ce6aea0edab46ee69f65d59e", GitTreeState:"clean", BuildDate:"2019-01-24T19:22:45Z", GoVersion:"go1.10.7", Compiler:"gc", Platform:"linux/amd64"}

metrics sever logs

 unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:ip-172-20-48-65.us-east-2.compute.internal: unable to fetch metrics from Kubelet ip-172-20-48-65.us-east-2.compute.internal (172.20.48.65): Get https://172.20.48.65:10250/stats/summary/: x509: cannot validate certificate for 172.20.48.65 because it doesn't contain any IP SANs, unable to fully scrape metrics from source kubelet_summary:ip-172-20-118-10.us-east-2.compute.internal: unable to fetch metrics from Kubelet ip-172-20-118-10.us-east-2.compute.internal (172.20.118.10): Get https://172.20.118.10:10250/stats/summary/: x509: cannot validate certificate for 172.20.118.10 because it doesn't contain any IP SANs, unable to fully scrape metrics from source kubelet_summary:ip-172-20-120-50.us-east-2.compute.internal: unable to fetch metrics from Kubelet ip-172-20-120-50.us-east-2.compute.internal (172.20.120.50): Get https://172.20.120.50:10250/stats/summary/: x509: cannot validate certificate for 172.20.120.50 because it doesn't contain any IP SANs, unable to fully scrape metrics from source kubelet_summary:ip-172-20-65-192.us-east-2.compute.internal: unable to fetch metrics from Kubelet ip-172-20-65-192.us-east-2.compute.internal (172.20.65.192): Get https://172.20.65.192:10250/stats/summary/: x509: cannot validate certificate for 172.20.65.192 because it doesn't contain any IP SANs, unable to fully scrape metrics from source kubelet_summary:ip-172-20-77-20.us-east-2.compute.internal: unable to fetch metrics from Kubelet ip-172-20-77-20.us-east-2.compute.internal (172.20.77.20): Get https://172.20.77.20:10250/stats/summary/: x509: cannot validate certificate for 172.20.77.20 because it doesn't contain any IP SANs]

I will need to stick with kops 1.10.0 with below helm untill I know what exacty required for 1.11.0 get work and whats the difference.

....
    --set args={"--kubelet-insecure-tls=true"} \
....
sh

@bensussman
Copy link

I am on kops Version 1.11.0, kubectl version 1.11.6, and kubernetes version 1.11.3 and without this PR my metrics server was erroring with:

unable to fetch metrics from Kubelet ip-*.us-west-2.compute.internal (ip-*.us-west-2.compute.internal): Get https://ip-*us-west-2.compute.internal:10250/stats/summary/: x509: certificate signed by unknown authority]

This PR fixed it for me, thanks @itskingori !

However, I would now like to promote this to production (the goal is to get Horizontal Pod Autoscaling working in production in a future-proof way a.k.a using metrics-server and not Heapster) and I am concerned that turning off tls is a security concern? I don't fully understand what would be involved to provide my own signed certificates (where in KOPS would I provide them such that they would be distributed and served to solve this problem?) or to use webhook authentication.

Is this the preferred solution for metrics-server currently, even in production? Or is this a stopgap that is not secure enough for production use? Is there a guide or PR available for alternative solutions for kops to get metrics-server working? If not, I will go back to heapster and come back when metrics-server is fully supported.

@prageethw
Copy link

prageethw commented Feb 7, 2019

update: after few attempts, I have got above webhook method specified by @Bragegs working too.
so both webhook + this PR seems to be required to get it to work after kops 1.11.0

@sudermanjr
Copy link

sudermanjr commented Feb 16, 2019

Implementing the method in the above comment #6201 (comment) seems to have worked for me as well.

@bensussman
Copy link

@itskingori Any chance you could answer my questions? In general, the big idea is: "is this production ready"? Can / should you use the changes in this PR (which are necessary for it to work in our cluster) in production? If not, can you point me to a guide for what to do instead?

@Cryptophobia
Copy link
Contributor

@bensussman As far as I know turning off tls on the metric server means that it will NOT verify the tls certificates when communicating to kubelets of the nodes. This is not very secure but if your subnets are configured to be private, then it shouldn't be too much of a concern. If you are hosting other people's code or containers, and you cannot trust your own network, then maybe --kubelet-insecure-tls is not a good option. However, kubelet/control plane should still be secured using the webhook auth mode option.

In the future, PR here will allow us to switch out the CA that metrics-server uses to verify TLS CA certs on the kubelets: kubernetes-sigs/metrics-server#183

@itskingori
Copy link
Member Author

@bensussman Sure.

Any chance you could answer my questions?

Sure. Must have missed your comment.

Is this the preferred solution for metrics-server currently, even in production? Or is this a stopgap that is not secure enough for production use?

It's a stopgap for me i.e. acceptable level of technical debt. The value of having metrics-server far outweighs not having it because of lack of TLS. I can afford the risk because my clusters are not multi-tenant. Sometimes you just have to take what you have now and make a note of what needs to be fixed later once the fix is available (I'm tracking all relevant issues).

#6201 (comment) explains the situation in a little more detail.

And I didn't want to enable webhook authentication.

Is there a guide or PR available for alternative solutions for kops to get metrics-server working?

Not that I'm aware of.

If not, I will go back to heapster and come back when metrics-server is fully supported.

AFAIK, Heapster is deprecated. I chose not to invest in it. See:
https://github.com/kubernetes-retired/heapster/blob/master/docs/deprecation.md

@justinsb justinsb added this to the 1.12 milestone Mar 14, 2019
@chrisz100
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 17, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chrisz100, itskingori

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 17, 2019
@k8s-ci-robot k8s-ci-robot merged commit 728e3b5 into kubernetes:master Mar 17, 2019
@itskingori itskingori deleted the fix_metrics_server branch March 18, 2019 10:23
@drewwells
Copy link

This was merged without the required cluster_spec changes. Should those become the default? At the very least, we need documentation to support metrics-server. metrics-server is checked in as an addon but has no documentation on how to install it.

@drewwells
Copy link

Kubelet surprisingly has 'AlwaysAllow' authentication on, so the minimum things I had to change to enable metrics server were:

kubectl --namespace kube-system apply -f https://raw.githubusercontent.com/kubernetes/kops/master/addons/metrics-server/v1.8.x.yaml

cluster yaml

   kubelet:
     authenticationTokenWebhook: true

@luarx
Copy link

luarx commented May 6, 2019

Now metrics-server v0.3.2 version supports kubelet-certificate-authority flag https://github.com/kubernetes-incubator/metrics-server/releases

How should I configure it to not use kubelet-insecure-tls flag?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/addon-manager cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Kubectl top nodes not working with the metrics server Support metrics-server 0.3