Skip to content

Commit

Permalink
Merge branch 'kubernetes:master' into fix-rc-ca-sched
Browse files Browse the repository at this point in the history
  • Loading branch information
damikag authored Aug 29, 2023
2 parents 949ded5 + c6d72f2 commit 83ef332
Show file tree
Hide file tree
Showing 237 changed files with 17,726 additions and 3,913 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# KEP-5546: Automatic reload of nanny configuration when updated

<!-- toc -->
- [Summary](#summary)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Proposal](#proposal)
- [Notes](#notes)
- [Risks and Mitigations](#risks-and-mitigations)
- [Design Details](#design-details)
- [Test Plan](#test-plan)
<!-- /toc -->

Sure, here's the enhancement proposal in the requested format:

## Summary
- **Goals:** The goal of this enhancement is to improve the user experience for applying nanny configuration changes in the addon-resizer 1.8 when used with the metrics server. The proposed solution involves automatically reloading the nanny configuration whenever changes occur, eliminating the need for manual intervention and sidecar containers.
- **Non-Goals:** This proposal does not aim to update the functional behavior of the addon-resizer.

## Proposal
The proposed solution involves updating the addon-resizer with the following steps:
- Create a file system watcher using `fsnotify` under `utils/fswatcher` to watch nanny configurations' changes. It should run as a goroutine in the background.
- Detect changes of the nanny configurations' file using the created `fswatcher` trigger the reloading process when configuration changes are detected. Events should be sent in a channel.
- Re-execute the method responsible for building the NannyConfiguration `loadNannyConfiguration` to apply the updated configuration to the addon-resizer.
- Proper error handling should be implemented to manage scenarios where the configuration file is temporarily inaccessible or if there are parsing errors in the configuration file.

### Risks and Mitigations
- There is a potential risk of filesystem-related issues causing the file watcher to malfunction. Proper testing and error handling should be implemented to handle such scenarios gracefully.
- Errors in the configuration file could lead to unexpected behavior or crashes. The addon-resizer should handle parsing errors and fall back to the previous working configuration if necessary.

## Design Details
- Create a new package for the `fswatcher` under `utils/fswatcher`. It would contain the `fswatcher` struct and methods and unit-tests.
- `FsWatcher` struct would look similar to this:
```go
type FsWatcher struct {
*fsnotify.Watcher

Events chan struct{}
ratelimit time.Duration
names []string
paths map[string]struct{}
}
```
- Implement the following functions:
- `CreateFsWatcher`: Instantiates a new `FsWatcher` and start watching on file system.
- `initWatcher`: Initializes the `fsnotify` watcher and initialize the `paths` that would be watched.
- `add`: Adds a new file to watch.
- `reset`: Re-initializes the `FsWatcher`.
- `watch`: watches for the configured files.
- In the main function, we create a new `FsWatcher` and then we wait in an infinite loop to receive events indicating
filesystem changes. Based on these changes, we re-execute `loadNannyConfiguration` function.

> **Note:** The expected configuration file format is YAML. It has the same structure as the NannyConfiguration CRD.

### Test Plan
To ensure the proper functioning of the enhanced addon-resizer, the following test plan should be executed:
1. **Unit Tests:** Write unit tests to validate the file watcher's functionality and ensure it triggers events when the configuration file changes.
2. **Manual e2e Tests:** Deploy the addon-resizer with `BaseMemory` of `300Mi` and then we change the `BaseMemory` to `100Mi`. We should observer changes in the behavior of watched pod.
[fsnotify]: https://github.com/fsnotify/fsnotify
6 changes: 6 additions & 0 deletions balancer/deploy/controller.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,12 @@ rules:
- watch
- patch
- update
- apiGroups:
- balancer.x-k8s.io
resources:
- balancers/status
verbs:
- update
- apiGroups:
- ""
resources:
Expand Down
2 changes: 1 addition & 1 deletion charts/cluster-autoscaler/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,4 @@ name: cluster-autoscaler
sources:
- https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler
type: application
version: 9.29.1
version: 9.29.3
1 change: 1 addition & 0 deletions charts/cluster-autoscaler/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -411,6 +411,7 @@ vpa:
| rbac.serviceAccount.name | string | `""` | The name of the ServiceAccount to use. If not set and create is `true`, a name is generated using the fullname template. |
| replicaCount | int | `1` | Desired number of pods |
| resources | object | `{}` | Pod resource requests and limits. |
| secretKeyRefNameOverride | string | `""` | Overrides the name of the Secret to use when loading the secretKeyRef for AWS and Azure env variables |
| securityContext | object | `{}` | [Security context for pod](https://kubernetes.io/docs/tasks/configure-pod-container/security-context/) |
| service.annotations | object | `{}` | Annotations to add to service |
| service.create | bool | `true` | If `true`, a Service will be created. |
Expand Down
11 changes: 9 additions & 2 deletions charts/cluster-autoscaler/templates/clusterrole.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -151,15 +151,22 @@ rules:
- cluster.x-k8s.io
resources:
- machinedeployments
- machinedeployments/scale
- machinepools
- machinepools/scale
- machines
- machinesets
verbs:
- get
- list
- update
- watch
- apiGroups:
- cluster.x-k8s.io
resources:
- machinedeployments/scale
- machinepools/scale
verbs:
- get
- patch
- update
{{- end }}
{{- end -}}
22 changes: 11 additions & 11 deletions charts/cluster-autoscaler/templates/deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ spec:
- --node-group-auto-discovery=mig:namePrefix={{ .name }},min={{ .minSize }},max={{ .maxSize }}
{{- end }}
{{- end }}
{{- if eq .Values.cloudProvider "oci-oke" }}
{{- if eq .Values.cloudProvider "oci" }}
{{- if .Values.cloudConfigPath }}
- --nodes={{ .minSize }}:{{ .maxSize }}:{{ .name }}
- --balance-similar-node-groups
Expand Down Expand Up @@ -132,36 +132,36 @@ spec:
valueFrom:
secretKeyRef:
key: AwsAccessKeyId
name: {{ template "cluster-autoscaler.fullname" . }}
name: {{ default (include "cluster-autoscaler.fullname" .) .Values.secretKeyRefNameOverride }}
{{- end }}
{{- if .Values.awsSecretAccessKey }}
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
key: AwsSecretAccessKey
name: {{ template "cluster-autoscaler.fullname" . }}
name: {{ default (include "cluster-autoscaler.fullname" .) .Values.secretKeyRefNameOverride }}
{{- end }}
{{- else if eq .Values.cloudProvider "azure" }}
- name: ARM_SUBSCRIPTION_ID
valueFrom:
secretKeyRef:
key: SubscriptionID
name: {{ template "cluster-autoscaler.fullname" . }}
name: {{ default (include "cluster-autoscaler.fullname" .) .Values.secretKeyRefNameOverride }}
- name: ARM_RESOURCE_GROUP
valueFrom:
secretKeyRef:
key: ResourceGroup
name: {{ template "cluster-autoscaler.fullname" . }}
name: {{ default (include "cluster-autoscaler.fullname" .) .Values.secretKeyRefNameOverride }}
- name: ARM_VM_TYPE
valueFrom:
secretKeyRef:
key: VMType
name: {{ template "cluster-autoscaler.fullname" . }}
name: {{ default (include "cluster-autoscaler.fullname" .) .Values.secretKeyRefNameOverride }}
- name: AZURE_CLUSTER_NAME
valueFrom:
secretKeyRef:
key: ClusterName
name: {{ template "cluster-autoscaler.fullname" . }}
name: {{ default (include "cluster-autoscaler.fullname" .) .Values.secretKeyRefNameOverride }}
{{- if .Values.azureUseWorkloadIdentityExtension }}
- name: ARM_USE_WORKLOAD_IDENTITY_EXTENSION
value: "true"
Expand All @@ -173,22 +173,22 @@ spec:
valueFrom:
secretKeyRef:
key: TenantID
name: {{ template "cluster-autoscaler.fullname" . }}
name: {{ default (include "cluster-autoscaler.fullname" .) .Values.secretKeyRefNameOverride }}
- name: ARM_CLIENT_ID
valueFrom:
secretKeyRef:
key: ClientID
name: {{ template "cluster-autoscaler.fullname" . }}
name: {{ default (include "cluster-autoscaler.fullname" .) .Values.secretKeyRefNameOverride }}
- name: ARM_CLIENT_SECRET
valueFrom:
secretKeyRef:
key: ClientSecret
name: {{ template "cluster-autoscaler.fullname" . }}
name: {{ default (include "cluster-autoscaler.fullname" .) .Values.secretKeyRefNameOverride }}
- name: AZURE_NODE_RESOURCE_GROUP
valueFrom:
secretKeyRef:
key: NodeResourceGroup
name: {{ template "cluster-autoscaler.fullname" . }}
name: {{ default (include "cluster-autoscaler.fullname" .) .Values.secretKeyRefNameOverride }}
{{- end }}
{{- end }}
{{- range $key, $value := .Values.extraEnv }}
Expand Down
11 changes: 9 additions & 2 deletions charts/cluster-autoscaler/templates/role.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -49,16 +49,23 @@ rules:
- cluster.x-k8s.io
resources:
- machinedeployments
- machinedeployments/scale
- machinepools
- machinepools/scale
- machines
- machinesets
verbs:
- get
- list
- update
- watch
- apiGroups:
- cluster.x-k8s.io
resources:
- machinedeployments/scale
- machinepools/scale
verbs:
- get
- patch
- update
{{- end }}
{{- if ( not .Values.rbac.clusterScoped ) }}
- apiGroups:
Expand Down
5 changes: 4 additions & 1 deletion charts/cluster-autoscaler/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ affinity: {}
additionalLabels: {}

autoDiscovery:
# cloudProviders `aws`, `gce`, `azure`, `magnum` and `clusterapi` `oci-oke` are supported by auto-discovery at this time
# cloudProviders `aws`, `gce`, `azure`, `magnum`, `clusterapi` and `oci` are supported by auto-discovery at this time
# AWS: Set tags as described in https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/README.md#auto-discovery-setup

# autoDiscovery.clusterName -- Enable autodiscovery for `cloudProvider=aws`, for groups matching `autoDiscovery.tags`.
Expand Down Expand Up @@ -396,3 +396,6 @@ vpa:
updateMode: "Auto"
# vpa.containerPolicy -- [ContainerResourcePolicy](https://github.com/kubernetes/autoscaler/blob/vertical-pod-autoscaler/v0.13.0/vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go#L159). The containerName is always et to the deployment's container name. This value is required if VPA is enabled.
containerPolicy: {}

# secretKeyRefNameOverride -- Overrides the name of the Secret to use when loading the secretKeyRef for AWS and Azure env variables
secretKeyRefNameOverride: ""
1 change: 1 addition & 0 deletions cluster-autoscaler/.gitignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
cluster-autoscaler
cluster-autoscaler-amd64
cluster-autoscaler-arm64
cluster-autoscaler-s390x
cluster_autoscaler
.cover

Expand Down
20 changes: 20 additions & 0 deletions cluster-autoscaler/Dockerfile.s390x
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Copyright 2016 The Kubernetes Authors. All rights reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
ARG BASEIMAGE=gcr.io/distroless/static:nonroot-s390x
FROM $BASEIMAGE
LABEL maintainer="Marcin Wielgus <[email protected]>"

COPY cluster-autoscaler-s390x /cluster-autoscaler
WORKDIR /
CMD ["/cluster-autoscaler"]
21 changes: 18 additions & 3 deletions cluster-autoscaler/FAQ.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ this document:
* [I'm running cluster with nodes in multiple zones for HA purposes. Is that supported by Cluster Autoscaler?](#im-running-cluster-with-nodes-in-multiple-zones-for-ha-purposes-is-that-supported-by-cluster-autoscaler)
* [How can I monitor Cluster Autoscaler?](#how-can-i-monitor-cluster-autoscaler)
* [How can I increase the information that the CA is logging?](#how-can-i-increase-the-information-that-the-ca-is-logging)
* [How can I change the log format that the CA outputs?](#how-can-i-change-the-log-format-that-the-ca-outputs)
* [How can I see all the events from Cluster Autoscaler?](#how-can-i-see-all-events-from-cluster-autoscaler)
* [How can I scale my cluster to just 1 node?](#how-can-i-scale-my-cluster-to-just-1-node)
* [How can I scale a node group to 0?](#how-can-i-scale-a-node-group-to-0)
Expand Down Expand Up @@ -125,8 +126,8 @@ Since version 1.0.0 we consider CA as GA. It means that:
* We have enough confidence that it does what it is expected to do. Each commit goes through a big suite of unit tests
with more than 75% coverage (on average). We have a series of e2e tests that validate that CA works well on
[GCE](https://k8s-testgrid.appspot.com/sig-autoscaling#gce-autoscaling)
and [GKE](https://k8s-testgrid.appspot.com/sig-autoscaling#gke-autoscaling).
[GCE](https://testgrid.k8s.io/sig-autoscaling#gce-autoscaling)
and [GKE](https://testgrid.k8s.io/sig-autoscaling#gke-autoscaling).
Due to the missing testing infrastructure, AWS (or any other cloud provider) compatibility
tests are not the part of the standard development or release procedure.
However there is a number of AWS users who run CA in their production environment and submit new code, patches and bug reports.
Expand Down Expand Up @@ -862,7 +863,7 @@ Events:
```

This limitation was solved with
[volume topological scheduling](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/volume-topology-scheduling.md)
[volume topological scheduling](https://github.com/kubernetes/design-proposals-archive/blob/main/storage/volume-topology-scheduling.md)
introduced as beta in Kubernetes 1.11 and planned for GA in 1.13.
To allow CA to take advantage of topological scheduling, use separate node groups per zone.
This way CA knows exactly which node group will create nodes in the required zone rather than relying on the cloud provider choosing a zone for a new node in a multi-zone node group.
Expand Down Expand Up @@ -923,6 +924,20 @@ or infrastructure endpoints, then setting a value of `--v=9` will show all the i
HTTP calls made. Be aware that using verbosity levels higher than `--v=1` will generate
an increased amount of logs, prepare your deployments and storage accordingly.

### How Can I change the log format that the CA outputs?

There are 2 log format options, `text` and `json`. By default (`text`), the Cluster Autoscaler will output
logs in the [klog native format](https://kubernetes.io/docs/concepts/cluster-administration/system-logs/#klog-output).
```
I0823 17:15:11.472183 29944 main.go:569] Cluster Autoscaler 1.28.0-beta.0
```

Alternatively, adding the flag `--logging-format=json` changes the
[log output to json](https://kubernetes.io/docs/concepts/cluster-administration/system-logs/#klog-output).
```
{"ts":1692825334994.433,"caller":"cluster-autoscaler/main.go:569","msg":"Cluster Autoscaler 1.28.0-beta.0\n","v":1}
```

### What events are emitted by CA?

Whenever Cluster Autoscaler adds or removes nodes it will create events
Expand Down
2 changes: 1 addition & 1 deletion cluster-autoscaler/Makefile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
ALL_ARCH = amd64 arm64
ALL_ARCH = amd64 arm64 s390x
all: $(addprefix build-arch-,$(ALL_ARCH))

TAG?=dev
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ rules:
resources: ["statefulsets", "replicasets", "daemonsets"]
verbs: ["watch", "list", "get"]
- apiGroups: ["storage.k8s.io"]
resources: ["storageclasses", "csinodes"]
resources: ["storageclasses", "csinodes", "csidrivers", "csistoragecapacities"]
verbs: ["watch", "list", "get"]
- apiGroups: ["batch", "extensions"]
resources: ["jobs"]
Expand Down
2 changes: 1 addition & 1 deletion cluster-autoscaler/cloudprovider/externalgrpc/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# External gRPC Cloud Provider

The Exteral gRPC Cloud Provider provides a plugin system to support out-of-tree cloud provider implementations.
The External gRPC Cloud Provider provides a plugin system to support out-of-tree cloud provider implementations.

Cluster Autoscaler adds or removes nodes from the cluster by creating or deleting VMs. To separate the autoscaling logic (the same for all clouds) from the API calls required to execute it (different for each cloud), the latter are hidden behind an interface, `CloudProvider`. Each supported cloud has its own implementation in this repository and `--cloud-provider` flag determines which one will be used.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,7 @@ type AutoscalingGceClient interface {
FetchZones(region string) ([]string, error)
FetchAvailableCpuPlatforms() (map[string][]string, error)
FetchReservations() ([]*gce.Reservation, error)
FetchReservationsInProject(projectId string) ([]*gce.Reservation, error)

// modifying resources
ResizeMig(GceRef, int64) error
Expand Down Expand Up @@ -550,8 +551,12 @@ func (client *autoscalingGceClientV1) FetchMigsWithName(zone string, name *regex
}

func (client *autoscalingGceClientV1) FetchReservations() ([]*gce.Reservation, error) {
return client.FetchReservationsInProject(client.projectId)
}

func (client *autoscalingGceClientV1) FetchReservationsInProject(projectId string) ([]*gce.Reservation, error) {
reservations := make([]*gce.Reservation, 0)
call := client.gceService.Reservations.AggregatedList(client.projectId)
call := client.gceService.Reservations.AggregatedList(projectId)
err := call.Pages(context.TODO(), func(ls *gce.ReservationAggregatedList) error {
for _, items := range ls.Items {
reservations = append(reservations, items.Reservations...)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,10 @@ func (client *mockAutoscalingGceClient) FetchReservations() ([]*gce.Reservation,
return nil, nil
}

func (client *mockAutoscalingGceClient) FetchReservationsInProject(_ string) ([]*gce.Reservation, error) {
return nil, nil
}

func (client *mockAutoscalingGceClient) ResizeMig(_ GceRef, _ int64) error {
return nil
}
Expand Down
Loading

0 comments on commit 83ef332

Please sign in to comment.