Skip to content

Commit

Permalink
feat: implement kwok cloudprovider
Browse files Browse the repository at this point in the history
feat: wip implement `CloudProvider` interface boilerplate for `kwok` provider
Signed-off-by: vadasambar <[email protected]>

feat: add builder for `kwok`
- add logic to scale up and scale down nodes in `kwok` provider
Signed-off-by: vadasambar <[email protected]>

feat: wip parse node templates from file
Signed-off-by: vadasambar <[email protected]>

docs: add short README
Signed-off-by: vadasambar <[email protected]>

feat: implement remaining things
- to get the provider in a somewhat working state
Signed-off-by: vadasambar <[email protected]>

docs: add in-cluster `kwok` as pre-requisite in the README
Signed-off-by: vadasambar <[email protected]>

fix: templates file not correctly marshalling into node list
Signed-off-by: vadasambar <[email protected]>

fix: `invalid leading UTF-8 octet` error during template parsing
- remove encoding using `gob`
- not required
Signed-off-by: vadasambar <[email protected]>

fix: use lister to get and list
- instead of uncached kube client
- add lister as a field on the provider and nodegroup struct
Signed-off-by: vadasambar <[email protected]>

fix: `did not find nodegroup annotation` error
- CA was thinking the annotation is not present even though it is
- fix a bug with parsing annotation
Signed-off-by: vadasambar <[email protected]>

fix: CA node recognizing fake nodegroups
- add provider ID to nodes in the format `kwok:<node-name>`
- fix invalid `KwokManagedAnnotation`
- sanitize template nodes (remove `resourceVersion` etc.,)
- not sanitizing the node leads to error during creation of new nodes
- abstract code to get NG name into a separate function `getNGNameFromAnnotation`
Signed-off-by: vadasambar <[email protected]>

fix: node not getting deleted
Signed-off-by: vadasambar <[email protected]>

test: add empty test file
Signed-off-by: vadasambar <[email protected]>

chore: add OWNERS file
Signed-off-by: vadasambar <[email protected]>

feat: wip kwok provider config
- add samples for static and dynamic template nodes
Signed-off-by: vadasambar <[email protected]>

feat: wip implement pulling node templates from cluster
- add status field to kwok provider config
- this is to capture how the nodes would be grouped by (can be annotation or label)
- use kwok provider config status to get ng name from the node template
Signed-off-by: vadasambar <[email protected]>

fix: syntax error in calling `loadNodeTemplatesFromCluster`
Signed-off-by: vadasambar <[email protected]>

feat: first draft of dynamic node templates
- this allows node templates to be pulled from the cluster
- instead of having to specify static templates manually
Signed-off-by: vadasambar <[email protected]>

fix: syntax error
Signed-off-by: vadasambar <[email protected]>

refactor: abstract out related code into separate files
- use named constants instead of hardcoded values
Signed-off-by: vadasambar <[email protected]>

feat: cleanup kwok nodes when CA is exiting
- so that the user doesn't have to cleanup the fake nodes themselves
Signed-off-by: vadasambar <[email protected]>

refactor: return `nil` instead of err for `HasInstance`
- because there is no underlying cloud provider (hence no reason to return `cloudprovider.ErrNotImplemented`
Signed-off-by: vadasambar <[email protected]>

test: start working on tests for kwok provider config
Signed-off-by: vadasambar <[email protected]>

feat: add `gpuLabelKey` under `nodes` field in kwok provider config
- fix validation for kwok provider config
Signed-off-by: vadasambar <[email protected]>

docs: add motivation doc
- update README with more details
Signed-off-by: vadasambar <[email protected]>

feat: update kwok provider config example to support pulling gpu labels and types from existing providers
- still needs to be implemented in the code
Signed-off-by: vadasambar <[email protected]>

feat: wip update kwok provider config to get gpu label and available types
Signed-off-by: vadasambar <[email protected]>

feat: wip read gpu label and available types from specified provider
- add available gpu types in kwok provider config status
Signed-off-by: vadasambar <[email protected]>

feat: add validation for gpu fields in kwok provider config
- load gpu related fields in kwok provider config status
Signed-off-by: vadasambar <[email protected]>

feat: implement `GetAvailableGPUTypes`
Signed-off-by: vadasambar <[email protected]>

feat: add support to install and uninstall kwok
- add option to disable installation
- add option to manually specify kwok release tag
- add future scope in readme
Signed-off-by: vadasambar <[email protected]>

docs: add future scope 'evaluate adding support to check if kwok controller already exists'
Signed-off-by: vadasambar <[email protected]>

fix: vendor conflict and cyclic import
- remove support to get gpu config from the specified provider (can't be used because leads to cyclic import)
Signed-off-by: vadasambar <[email protected]>

docs: add a TODO 'get gpu config from other providers'
Signed-off-by: vadasambar <[email protected]>

refactor: rename `file` -> `configmap`
- load config and templates from configmap instead of file
- move `nodes` and `nodegroups` config to top level
- add helper to encode configmap data into `[]bytes`
- add helper to get current pod namespace
Signed-off-by: vadasambar <[email protected]>

feat: add new options to the kwok provider config
- auto install kwok only if the version is >= v0.4.0
- add test for `GPULabel()`
- use `kubectl apply` way of installing kwok instead of kustomize
- add test for kwok helpers
- add test for kwok config
- inject service account name in CA deployment
- add example configmap for node templates and kwok provider config in CA helm chart
- add permission to create `clusterrolebinding` (so that kwok provider can create a clusterrolebinding with `cluster-admin` role and create/delete upstream manifests)
- update kwok provider sample configs
- update `README`
Signed-off-by: vadasambar <[email protected]>

chore: update go.mod to use v1.28 packages
Signed-off-by: vadasambar <[email protected]>

chore: `go mod tidy` and `go mod vendor` (again)
Signed-off-by: vadasambar <[email protected]>

refactor: kwok installation code
- add functions to create and delete clusterrolebinding to create kwok resources
- refactor kwok install and uninstall fns
- delete manifests in the opposite order of install ]
- add cleaning up left-over kwok installation to future scope
Signed-off-by: vadasambar <[email protected]>

fix: nil ptr error
- add `TODO` in README for adding docs around kwok config fields
Signed-off-by: vadasambar <[email protected]>

refactor: remove code to automatically install and uninstall `kwok`
- installing/uninstalling requires strong permissions to be granted to `kwok`
- granting strong permissions to `kwok` means granting strong permissions to the entire CA codebase
- this can pose a security risk
- I have removed the code related to install and uninstall for now
- will proceed after discussion with the community
Signed-off-by: vadasambar <[email protected]>

chore: run `go mod tidy` and `go mod vendor`
Signed-off-by: vadasambar <[email protected]>

fix: add permission to create nodes
- to fix permissions error for kwok provider
Signed-off-by: vadasambar <[email protected]>

test: add more unit tests
- add tests for kwok helpers
- fix and update kwok config tests
- fix a bug where gpu label was getting assigned to `kwokConfig.status.key`
- expose `loadConfigFile` -> `LoadConfigFile`
- throw error if templates configmap does not have `templates` key (value of which is node templates)
- finish test for `GPULabel()`
- add tests for `NodeGroupForNode()`
- expose `loadNodeTemplatesFromConfigMap` -> `LoadNodeTemplatesFromConfigMap`
- fix `KwokCloudProvider`'s kwok config was empty (this caused `GPULabel()` to return empty)
Signed-off-by: vadasambar <[email protected]>

refactor: abstract provider ID code into `getProviderID` fn
- fix provider name in test `kwok` -> `kwok:kind-worker-xxx`
Signed-off-by: vadasambar <[email protected]>

chore: run `go mod vendor` and `go mod tidy
Signed-off-by: vadasambar <[email protected]>

docs(cloudprovider/kwok): update info on creating nodegroups based on `hostname/label`
Signed-off-by: vadasambar <[email protected]>

refactor(charts): replace fromLabelKey value `"kubernetes.io/hostname"` -> `"kwok-nodegroup"`
- `"kubernetes.io/hostname"` leads to infinite scale-up
Signed-off-by: vadasambar <[email protected]>

feat: support running CA with kwok provider locally
Signed-off-by: vadasambar <[email protected]>

refactor: use global informer factory
Signed-off-by: vadasambar <[email protected]>

refactor: use `fromNodeLabelKey: "kwok-nodegroup"` in test templates
Signed-off-by: vadasambar <[email protected]>

refactor: `Cleanup()` logic
- clean up only nodes managed by the kwok provider
Signed-off-by: vadasambar <[email protected]>

fix/refactor: nodegroup creation logic
- fix issue where fake node was getting created which caused fatal error
- use ng annotation to keep track of nodegroups
- (when creating nodegroups) don't process nodes which don't have the right ng nabel
- suffix ng name with unix timestamp
Signed-off-by: vadasambar <[email protected]>

refactor/test(cloudprovider/kwok): write tests for `BuildKwokProvider` and `Cleanup`
- pass only the required node lister to cloud provider instead of the entire informer factory
- pass the required configmap name to `LoadNodeTemplatesFromConfigMap` instead of passing the entire kwok provider config
- implement fake node lister for testing
Signed-off-by: vadasambar <[email protected]>

test: add test case for dynamic templates in `TestNodeGroupForNode`
- remove non-required fields from template node
Signed-off-by: vadasambar <[email protected]>

test: add tests for `NodeGroups()`
- add extra node template without ng selector label to add more variability in the test
Signed-off-by: vadasambar <[email protected]>

test: write tests for `GetNodeGpuConfig()`
Signed-off-by: vadasambar <[email protected]>

test: add test for `GetAvailableGPUTypes`
Signed-off-by: vadasambar <[email protected]>

test: add test for `GetResourceLimiter()`
Signed-off-by: vadasambar <[email protected]>

test(cloudprovider/kwok): add tests for nodegroup's `IncreaseSize()`
- abstract error msgs into variables to use them in tests
Signed-off-by: vadasambar <[email protected]>

test(cloudprovider/kwok): add test for ng `DeleteNodes()` fn
- add check for deleting too many nodes
- rename err msg var names to make them consistent
Signed-off-by: vadasambar <[email protected]>

test(cloudprovider/kwok): add tests for ng `DecreaseTargetSize()`
- abstract error msgs into variables (for easy use in tests)
Signed-off-by: vadasambar <[email protected]>

test(cloudprovider/kwok): add test for ng `Nodes()`
- add extra test case for `DecreaseTargetSize()` to check lister error
Signed-off-by: vadasambar <[email protected]>

test(cloudprovider/kwok): add test for ng `TemplateNodeInfo`
Signed-off-by: vadasambar <[email protected]>

test(cloudprovider/kwok): improve tests for `BuildKwokProvider()`
- add more test cases
- refactor lister for `TestBuildKwokProvider()` and `TestCleanUp()`
Signed-off-by: vadasambar <[email protected]>

test(cloudprovider/kwok): add test for ng `GetOptions`
Signed-off-by: vadasambar <[email protected]>

test(cloudprovider/kwok): unset `KWOK_CONFIG_MAP_NAME` at the end of the test
- not doing so leads to failure in other tests
- remove `kwokRelease` field from kwok config (not used anymore) - this was causing the tests to fail
Signed-off-by: vadasambar <[email protected]>

chore: bump CA chart version
- this is because of changes made related to kwok
- fix type `everwhere` -> `everywhere`
Signed-off-by: vadasambar <[email protected]>

chore: fix linting checks
Signed-off-by: vadasambar <[email protected]>

chore: address CI lint errors
Signed-off-by: vadasambar <[email protected]>

chore: generate helm docs for `kwokConfigMapName`
- remove `KWOK_CONFIG_MAP_KEY` (not being used in the code)
- bump helm chart version
Signed-off-by: vadasambar <[email protected]>

docs: revise the outline for README
- add AEP link to the motivation doc
Signed-off-by: vadasambar <[email protected]>

docs: wip create an outline for the README
- remove `kwok` field from examples (not needed right now)
Signed-off-by: vadasambar <[email protected]>

docs: add outline for ascii gifs
Signed-off-by: vadasambar <[email protected]>

refactor: rename env variable `KWOK_CONFIG_MAP_NAME` -> `KWOK_PROVIDER_CONFIGMAP`
Signed-off-by: vadasambar <[email protected]>

docs: update README with info around installation and benefits of using kwok provider
- add `Kwok` as a provider in main CA README
Signed-off-by: vadasambar <[email protected]>

chore: run `go mod vendor`
- remove TODOs that are not needed anymore
Signed-off-by: vadasambar <[email protected]>

docs: finish first draft of README
Signed-off-by: vadasambar <[email protected]>

fix: env variable in chart `KWOK_CONFIG_MAP_NAME` -> `KWOK_PROVIDER_CONFIGMAP`
Signed-off-by: vadasambar <[email protected]>

refactor: remove redundant/deprecated code
Signed-off-by: vadasambar <[email protected]>

chore: bump chart version `9.30.1` -> `9.30.2`
- because of kwok provider related changes
Signed-off-by: vadasambar <[email protected]>

chore: fix typo `offical` -> `official`
Signed-off-by: vadasambar <[email protected]>

chore: remove debug log msg
Signed-off-by: vadasambar <[email protected]>

docs: add links for getting help
Signed-off-by: vadasambar <[email protected]>

refactor: fix type in log `external cluster` -> `cluster`
Signed-off-by: vadasambar <[email protected]>

chore: add newline in chart.yaml to fix CI lint
Signed-off-by: vadasambar <[email protected]>

docs: fix mistake `sig-kwok` -> `sig-scheduling`
- kwok is a part if sig-scheduling (there is no sig-kwok)
Signed-off-by: vadasambar <[email protected]>

docs: fix type `release"` -> `"release"`
Signed-off-by: vadasambar <[email protected]>

refactor: pass informer instead of lister to cloud provider builder fn
Signed-off-by: vadasambar <[email protected]>
  • Loading branch information
vadasambar committed Nov 24, 2023
1 parent 8de60c9 commit cfbee9a
Show file tree
Hide file tree
Showing 43 changed files with 4,757 additions and 14 deletions.
2 changes: 1 addition & 1 deletion charts/cluster-autoscaler/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,4 @@ name: cluster-autoscaler
sources:
- https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler
type: application
version: 9.32.0
version: 9.32.1
1 change: 1 addition & 0 deletions charts/cluster-autoscaler/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -419,6 +419,7 @@ vpa:
| image.repository | string | `"registry.k8s.io/autoscaling/cluster-autoscaler"` | Image repository |
| image.tag | string | `"v1.27.2"` | Image tag |
| kubeTargetVersionOverride | string | `""` | Allow overriding the `.Capabilities.KubeVersion.GitVersion` check. Useful for `helm template` commands. |
| kwokConfigMapName | string | `"kwok-provider-config"` | configmap for configuring kwok provider |
| magnumCABundlePath | string | `"/etc/kubernetes/ca-bundle.crt"` | Path to the host's CA bundle, from `ca-file` in the cloud-config file. |
| magnumClusterName | string | `""` | Cluster name or ID in Magnum. Required if `cloudProvider=magnum` and not setting `autoDiscovery.clusterName`. |
| nameOverride | string | `""` | String to partially override `cluster-autoscaler.fullname` template (will maintain the release name) |
Expand Down
3 changes: 3 additions & 0 deletions charts/cluster-autoscaler/templates/clusterrole.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,8 @@ rules:
verbs:
- watch
- list
- create
- delete
- get
- update
- apiGroups:
Expand Down Expand Up @@ -120,6 +122,7 @@ rules:
verbs:
- list
- watch
- get
- apiGroups:
- coordination.k8s.io
resources:
Expand Down
Loading

0 comments on commit cfbee9a

Please sign in to comment.