Skip to content
This repository has been archived by the owner on Oct 24, 2023. It is now read-only.

feat: upgrade metrics server to v0.3.4 #1109

Merged
merged 5 commits into from
Sep 16, 2019

Conversation

andyzhangx
Copy link
Contributor

@andyzhangx andyzhangx commented Apr 22, 2019

Reason for Change:

This PR upgrades metrics server from v0.2.1 to v0.3.4

BTW, we cannot disable --read-only-port now since AKS also depends on this port and metrics server on AKS is still v0.2.x

metric server(v0.2.1) is still using KubeletReadOnlyPort(10255), we cannot switch to KubeletPort(10250) now, after upgrade to metric server v0.3.4, it uses 10250 port instead, then we could move on to merge PR fix: disable ReadOnlyPort

This PR only applies for k8s v1.16.0:

  • upgrade metric server to v0.3.4

Tested on k8s v1.16.0-beta.1, test result:

$ kubectl top no
NAME                        CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
6133k8s010                  56m          2%     445Mi           8%
k8s-linuxpool1-61339693-0   54m          2%     692Mi           11%
k8s-master-61339693-0       92m          4%     1065Mi          17%

Issue Fixed:

Fixes #1172

Requirements:

Notes:

cc @feiskyer
/azp run pr-e2e

@codecov
Copy link

codecov bot commented Apr 22, 2019

Codecov Report

Merging #1109 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #1109      +/-   ##
==========================================
+ Coverage   76.67%   76.67%   +<.01%     
==========================================
  Files         135      135              
  Lines       20536    20544       +8     
==========================================
+ Hits        15745    15753       +8     
  Misses       3873     3873              
  Partials      918      918

@andyzhangx andyzhangx changed the title chore(CIS): upgrade metrics server to v0.3.0 [WIP] chore(CIS): upgrade metrics server to v0.3.0 Apr 22, 2019
@andyzhangx
Copy link
Contributor Author

/hold
need to adjust the metrics server parameters to make it totally work

@zhiweiv
Copy link
Contributor

zhiweiv commented Apr 30, 2019

Metrics server 0.3.x doesn’t work with Windows nodes yet,
kubernetes/kubernetes#75934
kubernetes/kubernetes#76740

@andyzhangx
Copy link
Contributor Author

Metrics server 0.3.x doesn’t work with Windows nodes yet,
kubernetes/kubernetes#75934
kubernetes/kubernetes#76740

thanks for reminder, I think we can upgrade to Metrics server 0.3.0 on Linux node first.

@jackfrancis
Copy link
Member

I would suggest we introduce this as a 1.15 feature and not backport to pre-existing version (and thus, pre-existing clusters built on pre-1.15).

If we do want to introduce this to older versions, we'll need thorough upgrade tests.

@@ -126,7 +126,6 @@ spec:
imagePullPolicy: IfNotPresent
command:
- /metrics-server
- --source=kubernetes.summary_api:''
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackfrancis how do we handle command parameter change in kubernetesmasteraddons_xxx.yaml file in aks-engine?

I am planning to upgrade metrics-server from v0.2.0 to v0.3.0 in k8s v1.15.0, while that require command parameters change, do you know what's the correct way to do this? One way could be move original kubernetesmasteraddons-metrics-server-deployment.yaml file to new folder like 1.14, 1.13, 1.12, ... , looks like I need to create a lot of such version folder, is that the correct way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, the current way to define k8s version-specific changes is to create a new directory with the minor.major version of the first version that these changes work for. If there is only a 1.15 directory, for example (and not a 1.16 directory), then the logic assumes the spec is valid for 1.15 and above. So every time there is a forward looking version-breaking change, we create a new dir to store those with the same name as the version that introduces the change.

Hope that makes sense!

@andyzhangx andyzhangx changed the title [WIP] chore(CIS): upgrade metrics server to v0.3.0 [WIP] chore(CIS): upgrade metrics server to v0.3.1 May 30, 2019
@andyzhangx andyzhangx force-pushed the upgrade-metrics-server branch from 44a84df to cd703c7 Compare May 31, 2019 04:49
@acs-bot acs-bot added size/XXL and removed size/XS labels May 31, 2019
@andyzhangx andyzhangx changed the title [WIP] chore(CIS): upgrade metrics server to v0.3.1 chore(CIS): upgrade metrics server to v0.3.1 May 31, 2019
@andyzhangx andyzhangx changed the title chore(CIS): upgrade metrics server to v0.3.1 chore(CIS): upgrade metrics server to v0.3.1 and disable ReadOnlyPort May 31, 2019
@andyzhangx
Copy link
Contributor Author

Finally I have time to complete this PR, this PR only applies for k8s v1.15.0:

  • upgrade metric server to v0.3.1
  • disable ReadOnlyPort in kubelet

@andyzhangx
Copy link
Contributor Author

andyzhangx commented May 31, 2019

/hold
Let's hold for now since for metrics server 0.3.x, it could not get CPU metrics due to issue: kubernetes/kubernetes#75934

$ kubectl top nodes
NAME                    CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
8538k8s000              0m           0%     399Mi           7%
8538k8s001              0m           0%     421Mi           8%
k8s-master-85384734-0   112m         5%     1243Mi          20%

manager.go:102] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:8538k8s001: unable to get CPU for node "8538k8s001", discarding data: missing cpu usage metric, unable to fully scrape metrics from source kubelet_summary:8538k8s000: unable to get CPU for node "8538k8s000", discarding data: missing cpu usage metric

@mboersma mboersma added the needs-rebase Changes in the target branch require a `git rebase` and `git push -f` label Jul 11, 2019
@stale
Copy link

stale bot commented Jul 11, 2019

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@PatrickLang
Copy link
Contributor

Windows 1.16 tests failed the first attempt for unrelated reasons. Restarted it.
"Allocation failed. We do not have sufficient capacity for the requested VM size in this region. Read more about improving likelihood of allocation success at http://aka.ms/allocation-guidance

@jackfrancis
Copy link
Member

Rekicked the tests, though the HPA failure in Linux test is almost certainly related to metrics-server non-functionality

@jackfrancis
Copy link
Member

Peeking at the E2E tests in progress: it looks like the repro is HPA doesn't scale pods back down.

@PatrickLang
Copy link
Contributor

Autoscaler tests are failing on Linux:


    ‌Expected error:‌
        <*errors.fundamental | 0xc0003fc060>: {
            msg: "Timeout exceeded (20m0s) while waiting for minimum -1 and maximum 1 Pod replicas from Deployment php-apache-long-running",
            stack: [0x8e64a9, 0x45e861],
        }
        Timeout exceeded (20m0s) while waiting for minimum -1 and maximum 1 Pod replicas from Deployment php-apache-long-running
    not to have occurred‌

    /__w/1/s/gopath/src/github.com/Azure/aks-engine/test/e2e/kubernetes/kubernetes_test.go:1128
------------------------------‌
S‌S‌S‌S‌S‌S‌S‌S‌S‌S‌S‌S‌

Summarizing 1 Failure:‌

[Fail] ‌Azure Container Cluster using the Kubernetes Orchestrator ‌with a linux agent pool ‌[It] should be able to autoscale ‌
&#x2F;__w&#x2F;1&#x2F;s&#x2F;gopath&#x2F;src&#x2F;github.com&#x2F;Azure&#x2F;aks-engine&#x2F;test&#x2F;e2e&#x2F;kubernetes&#x2F;kubernetes_test.go:1128‌

Ran 32 of 47 Specs in 1995.981 seconds‌
FAIL!‌ -- ‌31 Passed‌ | ‌1 Failed‌ | ‌0 Pending‌ | ‌15 Skipped‌
--- FAIL: TestKubernetes (1995.98s)
FAIL
****

@PatrickLang
Copy link
Contributor

Windows nodes are autoscaling successfully:

$ kubectl describe hpa iis-2019
Name:                                                  iis-2019
Namespace:                                             default
Labels:                                                <none>
Annotations:                                           <none>
CreationTimestamp:                                     Wed, 11 Sep 2019 22:41:30 +0000
Reference:                                             Deployment/iis-2019
Metrics:                                               ( current / target )
  resource cpu on pods  (as a percentage of request):  0% (0) / 10%
Min replicas:                                          2
Max replicas:                                          4
Deployment pods:                                       2 current / 2 desired
Conditions:
  Type            Status  Reason               Message
  ----            ------  ------               -------
  AbleToScale     True    ScaleDownStabilized  recent recommendations were higher than current one, applying the highest recent recommendation
  ScalingActive   True    ValidMetricFound     the HPA was able to successfully calculate a replica count from cpu resource utilization (percentage of request)
  ScalingLimited  False   DesiredWithinRange   the desired count is within the acceptable range
Events:
  Type     Reason                        Age                 From                       Message
  ----     ------                        ----                ----                       -------
  Normal   SuccessfulRescale             43m                 horizontal-pod-autoscaler  New size: 2; reason: Current number of replicas below Spec.MinReplicas
  Warning  FailedGetResourceMetric       40m (x12 over 43m)  horizontal-pod-autoscaler  unable to get metrics for resource cpu: no metrics returned from resource metrics API
  Warning  FailedComputeMetricsReplicas  40m (x12 over 43m)  horizontal-pod-autoscaler  invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API
  Normal   SuccessfulRescale             10m (x2 over 37m)   horizontal-pod-autoscaler  New size: 4; reason: cpu resource utilization (percentage of request) above target
  Normal   SuccessfulRescale             25s (x2 over 27m)   horizontal-pod-autoscaler  New size: 2; reason: All metrics below target

Linux failing in my manual tests:

$ kubectl describe hpa php-apache
Name:                                                  php-apache
Namespace:                                             default
Labels:                                                <none>
Annotations:                                           <none>
CreationTimestamp:                                     Wed, 11 Sep 2019 22:59:53 +0000
Reference:                                             Deployment/php-apache
Metrics:                                               ( current / target )
  resource cpu on pods  (as a percentage of request):  <unknown> / 30%
Min replicas:                                          2
Max replicas:                                          4
Deployment pods:                                       2 current / 2 desired
Conditions:
  Type           Status  Reason                   Message
  ----           ------  ------                   -------
  AbleToScale    True    SucceededGetScale        the HPA controller was able to get the target's current scale
  ScalingActive  False   FailedGetResourceMetric  the HPA was unable to compute the replica count: missing request for cpu
Events:
  Type     Reason                        Age                 From                       Message
  ----     ------                        ----                ----                       -------
  Normal   SuccessfulRescale             25m                 horizontal-pod-autoscaler  New size: 2; reason: Current number of replicas below Spec.MinReplicas
  Warning  FailedGetResourceMetric       23m (x11 over 25m)  horizontal-pod-autoscaler  unable to get metrics for resource cpu: no metrics returned from resource metrics API
  Warning  FailedComputeMetricsReplicas  23m (x11 over 25m)  horizontal-pod-autoscaler  invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API
  Warning  FailedComputeMetricsReplicas  22m                 horizontal-pod-autoscaler  invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: missing request for cpu
  Warning  FailedGetResourceMetric       38s (x89 over 22m)  horizontal-pod-autoscaler  missing request for cpu

@PatrickLang
Copy link
Contributor

PatrickLang commented Sep 11, 2019

Actually I had a mistake in my Linux deployment. I didn't have a cpu request set, which needs to be there for metrics to be served per kubernetes/kubernetes#79365 (comment)

after that it scales up and back down based on cpu

$ kubectl describe hpa php-apache
Name:                                                  php-apache
Namespace:                                             default
Labels:                                                <none>
Annotations:                                           <none>
CreationTimestamp:                                     Wed, 11 Sep 2019 22:59:53 +0000
Reference:                                             Deployment/php-apache
Metrics:                                               ( current / target )
  resource cpu on pods  (as a percentage of request):  1000% (1) / 30%
Min replicas:                                          2
Max replicas:                                          4
Deployment pods:                                       4 current / 4 desired
Conditions:
  Type            Status  Reason            Message
  ----            ------  ------            -------
  AbleToScale     True    ReadyForNewScale  recommended size matches current size
  ScalingActive   True    ValidMetricFound  the HPA was able to successfully calculate a replica count from cpu resource utilization (percentage of request)
  ScalingLimited  True    TooManyReplicas   the desired replica count is more than the maximum replica count
Events:
  Type     Reason                        Age                  From                       Message
  ----     ------                        ----                 ----                       -------
  Normal   SuccessfulRescale             59m                  horizontal-pod-autoscaler  New size: 2; reason: Current number of replicas below Spec.MinReplicas
  Warning  FailedComputeMetricsReplicas  56m (x11 over 59m)   horizontal-pod-autoscaler  invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API
  Warning  FailedComputeMetricsReplicas  56m                  horizontal-pod-autoscaler  invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: missing request for cpu
  Warning  FailedGetResourceMetric       29m (x109 over 56m)  horizontal-pod-autoscaler  missing request for cpu
  Warning  FailedGetResourceMetric       24m (x12 over 59m)   horizontal-pod-autoscaler  unable to get metrics for resource cpu: no metrics returned from resource metrics API
  Normal   SuccessfulRescale             16m                  horizontal-pod-autoscaler  New size: 2; reason: All metrics below target
  Normal   SuccessfulRescale             18s (x2 over 23m)    horizontal-pod-autoscaler  New size: 4; reason: cpu resource utilization (percentage of request) above target

Seems like a test bug

@PatrickLang
Copy link
Contributor

At this point I need to stop work on this. Can someone else pick it up? I have higher priority stuff waiting on me and the other maintainers should be able to handle it from here. I see no blockers from a Windows standpoint and my manual tests are good.

@jackfrancis jackfrancis changed the title feat: upgrade metrics server to v0.3.1 feat: upgrade metrics server to v0.3.4 Sep 12, 2019
@jackfrancis jackfrancis force-pushed the upgrade-metrics-server branch from 6854051 to 43a2450 Compare September 13, 2019 16:17
Copy link
Member

@jackfrancis jackfrancis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@acs-bot
Copy link

acs-bot commented Sep 16, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyzhangx, jackfrancis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@PatrickLang
Copy link
Contributor

/hold cancel
Windows updates are in, no hold needed

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Upgrade metrics-server to v0.3.x
7 participants