feat: upgrade metrics server to v0.3.4 #1109

andyzhangx · 2019-04-22T08:28:42Z

Reason for Change:

This PR upgrades metrics server from v0.2.1 to v0.3.4

BTW, we cannot disable --read-only-port now since AKS also depends on this port and metrics server on AKS is still v0.2.x

metric server(v0.2.1) is still using KubeletReadOnlyPort(10255), we cannot switch to KubeletPort(10250) now, after upgrade to metric server v0.3.4, it uses 10250 port instead, then we could move on to merge PR fix: disable ReadOnlyPort

This PR only applies for k8s v1.16.0:

upgrade metric server to v0.3.4

Tested on k8s v1.16.0-beta.1, test result:

$ kubectl top no
NAME                        CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
6133k8s010                  56m          2%     445Mi           8%
k8s-linuxpool1-61339693-0   54m          2%     692Mi           11%
k8s-master-61339693-0       92m          4%     1065Mi          17%

Issue Fixed:

Fixes #1172

Requirements:

uses conventional commit messages
includes documentation
adds unit tests
tested upgrade from previous version

Notes:

cc @feiskyer
/azp run pr-e2e

codecov · 2019-04-22T08:53:13Z

Codecov Report

Merging #1109 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #1109      +/-   ##
==========================================
+ Coverage   76.67%   76.67%   +<.01%     
==========================================
  Files         135      135              
  Lines       20536    20544       +8     
==========================================
+ Hits        15745    15753       +8     
  Misses       3873     3873              
  Partials      918      918

andyzhangx · 2019-04-22T11:12:44Z

/hold
need to adjust the metrics server parameters to make it totally work

zhiweiv · 2019-04-30T01:20:44Z

Metrics server 0.3.x doesn’t work with Windows nodes yet,
kubernetes/kubernetes#75934
kubernetes/kubernetes#76740

andyzhangx · 2019-04-30T04:46:49Z

Metrics server 0.3.x doesn’t work with Windows nodes yet,
kubernetes/kubernetes#75934
kubernetes/kubernetes#76740

thanks for reminder, I think we can upgrade to Metrics server 0.3.0 on Linux node first.

jackfrancis · 2019-04-30T17:00:54Z

I would suggest we introduce this as a 1.15 feature and not backport to pre-existing version (and thus, pre-existing clusters built on pre-1.15).

If we do want to introduce this to older versions, we'll need thorough upgrade tests.

andyzhangx · 2019-05-20T13:34:09Z

parts/k8s/containeraddons/kubernetesmasteraddons-metrics-server-deployment.yaml

@@ -126,7 +126,6 @@ spec:
        imagePullPolicy: IfNotPresent
        command:
        - /metrics-server
-        - --source=kubernetes.summary_api:''


@jackfrancis how do we handle command parameter change in kubernetesmasteraddons_xxx.yaml file in aks-engine?

I am planning to upgrade metrics-server from v0.2.0 to v0.3.0 in k8s v1.15.0, while that require command parameters change, do you know what's the correct way to do this? One way could be move original kubernetesmasteraddons-metrics-server-deployment.yaml file to new folder like 1.14, 1.13, 1.12, ... , looks like I need to create a lot of such version folder, is that the correct way?

Correct, the current way to define k8s version-specific changes is to create a new directory with the minor.major version of the first version that these changes work for. If there is only a 1.15 directory, for example (and not a 1.16 directory), then the logic assumes the spec is valid for 1.15 and above. So every time there is a forward looking version-breaking change, we create a new dir to store those with the same name as the version that introduces the change.

Hope that makes sense!

andyzhangx · 2019-05-31T06:32:21Z

Finally I have time to complete this PR, this PR only applies for k8s v1.15.0:

upgrade metric server to v0.3.1
disable ReadOnlyPort in kubelet

andyzhangx · 2019-05-31T07:49:17Z

/hold
Let's hold for now since for metrics server 0.3.x, it could not get CPU metrics due to issue: kubernetes/kubernetes#75934

$ kubectl top nodes
NAME                    CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
8538k8s000              0m           0%     399Mi           7%
8538k8s001              0m           0%     421Mi           8%
k8s-master-85384734-0   112m         5%     1243Mi          20%

manager.go:102] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:8538k8s001: unable to get CPU for node "8538k8s001", discarding data: missing cpu usage metric, unable to fully scrape metrics from source kubelet_summary:8538k8s000: unable to get CPU for node "8538k8s000", discarding data: missing cpu usage metric

stale · 2019-07-11T06:43:44Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

pkg/api/k8s_versions.go

PatrickLang · 2019-09-11T19:27:09Z

Windows 1.16 tests failed the first attempt for unrelated reasons. Restarted it.
"Allocation failed. We do not have sufficient capacity for the requested VM size in this region. Read more about improving likelihood of allocation success at http://aka.ms/allocation-guidance

jackfrancis · 2019-09-11T20:25:30Z

Rekicked the tests, though the HPA failure in Linux test is almost certainly related to metrics-server non-functionality

jackfrancis · 2019-09-11T21:08:22Z

Peeking at the E2E tests in progress: it looks like the repro is HPA doesn't scale pods back down.

PatrickLang · 2019-09-11T21:58:28Z

Autoscaler tests are failing on Linux:


    ‌Expected error:‌
        <*errors.fundamental | 0xc0003fc060>: {
            msg: "Timeout exceeded (20m0s) while waiting for minimum -1 and maximum 1 Pod replicas from Deployment php-apache-long-running",
            stack: [0x8e64a9, 0x45e861],
        }
        Timeout exceeded (20m0s) while waiting for minimum -1 and maximum 1 Pod replicas from Deployment php-apache-long-running
    not to have occurred‌

    /__w/1/s/gopath/src/github.com/Azure/aks-engine/test/e2e/kubernetes/kubernetes_test.go:1128
------------------------------‌
S‌S‌S‌S‌S‌S‌S‌S‌S‌S‌S‌S‌

Summarizing 1 Failure:‌

[Fail] ‌Azure Container Cluster using the Kubernetes Orchestrator ‌with a linux agent pool ‌[It] should be able to autoscale ‌
&#x2F;__w&#x2F;1&#x2F;s&#x2F;gopath&#x2F;src&#x2F;github.com&#x2F;Azure&#x2F;aks-engine&#x2F;test&#x2F;e2e&#x2F;kubernetes&#x2F;kubernetes_test.go:1128‌

Ran 32 of 47 Specs in 1995.981 seconds‌
FAIL!‌ -- ‌31 Passed‌ | ‌1 Failed‌ | ‌0 Pending‌ | ‌15 Skipped‌
--- FAIL: TestKubernetes (1995.98s)
FAIL
****

PatrickLang · 2019-09-11T23:26:09Z

Windows nodes are autoscaling successfully:

$ kubectl describe hpa iis-2019
Name:                                                  iis-2019
Namespace:                                             default
Labels:                                                <none>
Annotations:                                           <none>
CreationTimestamp:                                     Wed, 11 Sep 2019 22:41:30 +0000
Reference:                                             Deployment/iis-2019
Metrics:                                               ( current / target )
  resource cpu on pods  (as a percentage of request):  0% (0) / 10%
Min replicas:                                          2
Max replicas:                                          4
Deployment pods:                                       2 current / 2 desired
Conditions:
  Type            Status  Reason               Message
  ----            ------  ------               -------
  AbleToScale     True    ScaleDownStabilized  recent recommendations were higher than current one, applying the highest recent recommendation
  ScalingActive   True    ValidMetricFound     the HPA was able to successfully calculate a replica count from cpu resource utilization (percentage of request)
  ScalingLimited  False   DesiredWithinRange   the desired count is within the acceptable range
Events:
  Type     Reason                        Age                 From                       Message
  ----     ------                        ----                ----                       -------
  Normal   SuccessfulRescale             43m                 horizontal-pod-autoscaler  New size: 2; reason: Current number of replicas below Spec.MinReplicas
  Warning  FailedGetResourceMetric       40m (x12 over 43m)  horizontal-pod-autoscaler  unable to get metrics for resource cpu: no metrics returned from resource metrics API
  Warning  FailedComputeMetricsReplicas  40m (x12 over 43m)  horizontal-pod-autoscaler  invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API
  Normal   SuccessfulRescale             10m (x2 over 37m)   horizontal-pod-autoscaler  New size: 4; reason: cpu resource utilization (percentage of request) above target
  Normal   SuccessfulRescale             25s (x2 over 27m)   horizontal-pod-autoscaler  New size: 2; reason: All metrics below target

Linux failing in my manual tests:

$ kubectl describe hpa php-apache
Name:                                                  php-apache
Namespace:                                             default
Labels:                                                <none>
Annotations:                                           <none>
CreationTimestamp:                                     Wed, 11 Sep 2019 22:59:53 +0000
Reference:                                             Deployment/php-apache
Metrics:                                               ( current / target )
  resource cpu on pods  (as a percentage of request):  <unknown> / 30%
Min replicas:                                          2
Max replicas:                                          4
Deployment pods:                                       2 current / 2 desired
Conditions:
  Type           Status  Reason                   Message
  ----           ------  ------                   -------
  AbleToScale    True    SucceededGetScale        the HPA controller was able to get the target's current scale
  ScalingActive  False   FailedGetResourceMetric  the HPA was unable to compute the replica count: missing request for cpu
Events:
  Type     Reason                        Age                 From                       Message
  ----     ------                        ----                ----                       -------
  Normal   SuccessfulRescale             25m                 horizontal-pod-autoscaler  New size: 2; reason: Current number of replicas below Spec.MinReplicas
  Warning  FailedGetResourceMetric       23m (x11 over 25m)  horizontal-pod-autoscaler  unable to get metrics for resource cpu: no metrics returned from resource metrics API
  Warning  FailedComputeMetricsReplicas  23m (x11 over 25m)  horizontal-pod-autoscaler  invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API
  Warning  FailedComputeMetricsReplicas  22m                 horizontal-pod-autoscaler  invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: missing request for cpu
  Warning  FailedGetResourceMetric       38s (x89 over 22m)  horizontal-pod-autoscaler  missing request for cpu

PatrickLang · 2019-09-11T23:38:32Z

Actually I had a mistake in my Linux deployment. I didn't have a cpu request set, which needs to be there for metrics to be served per kubernetes/kubernetes#79365 (comment)

after that it scales up and back down based on cpu

$ kubectl describe hpa php-apache
Name:                                                  php-apache
Namespace:                                             default
Labels:                                                <none>
Annotations:                                           <none>
CreationTimestamp:                                     Wed, 11 Sep 2019 22:59:53 +0000
Reference:                                             Deployment/php-apache
Metrics:                                               ( current / target )
  resource cpu on pods  (as a percentage of request):  1000% (1) / 30%
Min replicas:                                          2
Max replicas:                                          4
Deployment pods:                                       4 current / 4 desired
Conditions:
  Type            Status  Reason            Message
  ----            ------  ------            -------
  AbleToScale     True    ReadyForNewScale  recommended size matches current size
  ScalingActive   True    ValidMetricFound  the HPA was able to successfully calculate a replica count from cpu resource utilization (percentage of request)
  ScalingLimited  True    TooManyReplicas   the desired replica count is more than the maximum replica count
Events:
  Type     Reason                        Age                  From                       Message
  ----     ------                        ----                 ----                       -------
  Normal   SuccessfulRescale             59m                  horizontal-pod-autoscaler  New size: 2; reason: Current number of replicas below Spec.MinReplicas
  Warning  FailedComputeMetricsReplicas  56m (x11 over 59m)   horizontal-pod-autoscaler  invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API
  Warning  FailedComputeMetricsReplicas  56m                  horizontal-pod-autoscaler  invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: missing request for cpu
  Warning  FailedGetResourceMetric       29m (x109 over 56m)  horizontal-pod-autoscaler  missing request for cpu
  Warning  FailedGetResourceMetric       24m (x12 over 59m)   horizontal-pod-autoscaler  unable to get metrics for resource cpu: no metrics returned from resource metrics API
  Normal   SuccessfulRescale             16m                  horizontal-pod-autoscaler  New size: 2; reason: All metrics below target
  Normal   SuccessfulRescale             18s (x2 over 23m)    horizontal-pod-autoscaler  New size: 4; reason: cpu resource utilization (percentage of request) above target

Seems like a test bug

PatrickLang · 2019-09-11T23:56:01Z

At this point I need to stop work on this. Can someone else pick it up? I have higher priority stuff waiting on me and the other maintainers should be able to handle it from here. I see no blockers from a Windows standpoint and my manual tests are good.

fix: test failure

jackfrancis

/lgtm

acs-bot · 2019-09-16T22:03:27Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyzhangx, jackfrancis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jackfrancis]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

PatrickLang · 2019-09-16T22:07:06Z

/hold cancel
Windows updates are in, no hold needed

acs-bot added the size/XS label Apr 22, 2019

andyzhangx changed the title ~~chore(CIS): upgrade metrics server to v0.3.0~~ [WIP] chore(CIS): upgrade metrics server to v0.3.0 Apr 22, 2019

acs-bot added the do-not-merge/work-in-progress label Apr 22, 2019

acs-bot added the do-not-merge/hold label Apr 22, 2019

This was referenced Apr 29, 2019

Upgrade metrics-server to v0.3.x #1172

Closed

feat: add support for Kubernetes v1.15.0-alpha.1 #1140

Merged

andyzhangx mentioned this pull request May 13, 2019

Create CIS kube-bench report for 1.11 cluster #461

Closed

andyzhangx commented May 20, 2019

View reviewed changes

andyzhangx changed the title ~~[WIP] chore(CIS): upgrade metrics server to v0.3.0~~ [WIP] chore(CIS): upgrade metrics server to v0.3.1 May 30, 2019

mboersma removed the do-not-merge/hold label May 30, 2019

andyzhangx force-pushed the upgrade-metrics-server branch from 44a84df to cd703c7 Compare May 31, 2019 04:49

acs-bot added size/XXL and removed size/XS labels May 31, 2019

andyzhangx changed the title ~~[WIP] chore(CIS): upgrade metrics server to v0.3.1~~ chore(CIS): upgrade metrics server to v0.3.1 May 31, 2019

acs-bot removed the do-not-merge/work-in-progress label May 31, 2019

andyzhangx changed the title ~~chore(CIS): upgrade metrics server to v0.3.1~~ chore(CIS): upgrade metrics server to v0.3.1 and disable ReadOnlyPort May 31, 2019

acs-bot added the do-not-merge/hold label May 31, 2019

mboersma added the needs-rebase Changes in the target branch require a `git rebase` and `git push -f` label Jul 11, 2019

stale bot added the stale label Jul 11, 2019

stale bot closed this Jul 18, 2019

pierluigilenoci mentioned this pull request Aug 1, 2019

Update metrics-server from 0.2.1 to 0.3.3 don't work properly. #1707

Closed

PatrickLang reopened this Aug 23, 2019

PatrickLang reviewed Sep 11, 2019

View reviewed changes

pkg/api/k8s_versions.go Outdated Show resolved Hide resolved

PatrickLang added the failing tests label Sep 11, 2019

jackfrancis changed the title ~~feat: upgrade metrics server to v0.3.1~~ feat: upgrade metrics server to v0.3.4 Sep 12, 2019

andyzhangx and others added 4 commits September 13, 2019 09:16

chore(CIS): upgrade metrics server to v0.3.1

f52f3a8

chore(CIS): disable read-only-port in kubelet

8521803

fix: kubelet-insecure-tls metrics-server v0.3.1

3f62c07

fix: test failure

chore: Updating to metrics-server v0.3.4 for k8s 1.16

43a2450

jackfrancis force-pushed the upgrade-metrics-server branch from 6854051 to 43a2450 Compare September 13, 2019 16:17

test: Update HPA threshold to 80% of pod reserve, not 5%

c430f09

jackfrancis approved these changes Sep 16, 2019

View reviewed changes

acs-bot assigned jackfrancis Sep 16, 2019

acs-bot added the lgtm label Sep 16, 2019

acs-bot added the approved label Sep 16, 2019

PatrickLang removed the failing tests label Sep 16, 2019

acs-bot removed the do-not-merge/hold label Sep 16, 2019

acs-bot merged commit 7a29e45 into Azure:master Sep 16, 2019

This was referenced Sep 16, 2019

chore: upgrade to metrics-server v0.3.1 for all k8s versions < 1.16 #1957

Closed

chore: install metrics-server v0.3.4 via VHD #1959

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: upgrade metrics server to v0.3.4 #1109

feat: upgrade metrics server to v0.3.4 #1109

andyzhangx commented Apr 22, 2019 •

edited by jackfrancis

Loading

codecov bot commented Apr 22, 2019 •

edited

Loading

andyzhangx commented Apr 22, 2019

zhiweiv commented Apr 30, 2019

andyzhangx commented Apr 30, 2019

jackfrancis commented Apr 30, 2019

andyzhangx May 20, 2019

jackfrancis May 20, 2019

andyzhangx commented May 31, 2019

andyzhangx commented May 31, 2019 •

edited

Loading

stale bot commented Jul 11, 2019

PatrickLang commented Sep 11, 2019

jackfrancis commented Sep 11, 2019

jackfrancis commented Sep 11, 2019

PatrickLang commented Sep 11, 2019

PatrickLang commented Sep 11, 2019

PatrickLang commented Sep 11, 2019 •

edited

Loading

PatrickLang commented Sep 11, 2019

jackfrancis left a comment

acs-bot commented Sep 16, 2019

PatrickLang commented Sep 16, 2019

feat: upgrade metrics server to v0.3.4 #1109

feat: upgrade metrics server to v0.3.4 #1109

Conversation

andyzhangx commented Apr 22, 2019 • edited by jackfrancis Loading

codecov bot commented Apr 22, 2019 • edited Loading

Codecov Report

andyzhangx commented Apr 22, 2019

zhiweiv commented Apr 30, 2019

andyzhangx commented Apr 30, 2019

jackfrancis commented Apr 30, 2019

andyzhangx May 20, 2019

Choose a reason for hiding this comment

jackfrancis May 20, 2019

Choose a reason for hiding this comment

andyzhangx commented May 31, 2019

andyzhangx commented May 31, 2019 • edited Loading

stale bot commented Jul 11, 2019

PatrickLang commented Sep 11, 2019

jackfrancis commented Sep 11, 2019

jackfrancis commented Sep 11, 2019

PatrickLang commented Sep 11, 2019

PatrickLang commented Sep 11, 2019

PatrickLang commented Sep 11, 2019 • edited Loading

PatrickLang commented Sep 11, 2019

jackfrancis left a comment

Choose a reason for hiding this comment

acs-bot commented Sep 16, 2019

PatrickLang commented Sep 16, 2019

andyzhangx commented Apr 22, 2019 •

edited by jackfrancis

Loading

codecov bot commented Apr 22, 2019 •

edited

Loading

andyzhangx commented May 31, 2019 •

edited

Loading

PatrickLang commented Sep 11, 2019 •

edited

Loading