Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CA expander plugin proposal #4134

Merged
merged 2 commits into from
Jul 10, 2022

Conversation

evansheng
Copy link
Contributor

Proposal for an expander plugin to allow users to iterate on custom expander strategies out of band with Cluster Autoscaler

Heavily based off of https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/plugable-provider-grpc.md for a plugable cloud provider.

@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


  • If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
  • If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.
  • If you have done the above and are still having issues with the CLA being reported as unsigned, please log a ticket with the Linux Foundation Helpdesk: https://support.linuxfoundation.org/
  • Should you encounter any issues with the Linux Foundation Helpdesk, send a message to the backup e-mail support address at: [email protected]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Jun 10, 2021
@k8s-ci-robot
Copy link
Contributor

Welcome @evansheng!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 10, 2021
@evansheng
Copy link
Contributor Author

I signed it

@marwanad
Copy link
Member

@evansheng you probably want to squash your commits. I think the CLA bot is complaining because some still have another email tied to them:
image

}
```

To communicate with the external gRPC server, CA needs new flags to expose details about the server.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we planning on handling errors? Just fallback to the random expander on failures?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, was planning on handling similarly to the other current expanders, and falling back to the random strategy
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/expander/priority/priority.go#L59

@evansheng evansheng force-pushed the es--expander-plugin-proposal branch from 79d6001 to a956b9c Compare June 14, 2021 18:47
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jun 14, 2021
@evansheng
Copy link
Contributor Author

/assign @MaciekPytel

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the external grpc expander plugin sounds really cool. i'm curious though, given this discussion about a new node processor, if we wouldn't want to follow a similar pattern with expanders (assuming adding more is acceptable) ?

if we had something similar that allowed for custom expanders to be added and maintained in repo, then adding the grpc-based expander could follow that pattern. (similar in some respects to this proposal for a grpc provider)

curious to hear what others think

@elmiko
Copy link
Contributor

elmiko commented Jun 24, 2021

hey, i talked with @evansheng on slack, just wanted to clarify my comment.

i like the idea for the grpc expander, no major objection to the proposal. when it comes to implementation details though, i think we should be following the patterns in the cloud providers and node processors to ensure we have a nice way of doing this in the future if/when others wish to contribute in this area.

@evansheng
Copy link
Contributor Author

@MaciekPytel bump! any updates on this proposal review?
I've started to implement this proposal internally and will put up a PR soon (ish) as well once we finish an internal end to end test.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 17, 2021
Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this generally makes sense to me and i can see the value in adding it, i have a few suggestions to the text and i am also curious about alternatives. has there been any discussion about approaches that don't use grpc? and if so, i think it would be nice to enumerate other ideas that might have come up when thinking about this.


## Proposal

We will extend CA to utilize a pluggable external expander. The design for this expander plugin is heavily based off of this [proposal](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/plugable-provider-grpc.md) to CA, for a pluggable cloud provider interface.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would depersonalize this a little, perhaps something like this:

Suggested change
We will extend CA to utilize a pluggable external expander. The design for this expander plugin is heavily based off of this [proposal](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/plugable-provider-grpc.md) to CA, for a pluggable cloud provider interface.
Extend the cluster autoscaler to utilize a pluggable external expander. The design for this expander plugin is heavily based off of this [proposal](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/plugable-provider-grpc.md) to CA, for a pluggable cloud provider interface.

![Figure](./images/external-expander-plugin-grpc.jpg)

The gRPC server must implement the API of the [expander strategy interface](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/expander/expander.go#L50) in CA, which only has one method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it would be really cool to see a small example of the main function for an expander server


To communicate with the external gRPC server, CA needs new flags to expose details about the server.

We’ll add a new option to the expander flag: `--expander=externalgrpc`, and inntroduce a new flag `--expander-plugin-url=https://external-grpc-url/server` to reach the gRPC server.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would depersonalize here as well, and also a small spelling nit

Suggested change
We’ll add a new option to the expander flag: `--expander=externalgrpc`, and inntroduce a new flag `--expander-plugin-url=https://external-grpc-url/server` to reach the gRPC server.
Add a new option to the expander flag: `--expander=externalgrpc`, and introduce a new flag `--expander-plugin-url=https://external-grpc-url/server` to reach the gRPC server.


We’ll add a new option to the expander flag: `--expander=externalgrpc`, and inntroduce a new flag `--expander-plugin-url=https://external-grpc-url/server` to reach the gRPC server.

Additionally, we’ll need to use TLS for secure communication. An additional flag `--external-expander-cert=~/path/to/cert` will be exposed to specify the path to the certificate authority bundle used to validate the TLS cert used by the server.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would clean this up a little too

Suggested change
Additionally, we’ll need to use TLS for secure communication. An additional flag `--external-expander-cert=~/path/to/cert` will be exposed to specify the path to the certificate authority bundle used to validate the TLS cert used by the server.
When needing to use TLS for secure communications, an additional flag `--external-expander-cert=~/path/to/cert` will be exposed to specify the path to the certificate authority bundle used to validate the TLS cert used by the server.

@evansheng
Copy link
Contributor Author

Update on this proposal - I have a working implementation version in PR here: #4452

And a corresponding internal gRPC expander running as a separate application.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 13, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@evansheng
Copy link
Contributor Author

@elmiko implementation is finished and merged here: #4452

@elmiko
Copy link
Contributor

elmiko commented Apr 14, 2022

thanks for the reminder @evansheng, i think we should merge the enhancement if only to add the documentation as @gjtempleton mentioned.

@evansheng
Copy link
Contributor Author

/reopen

@k8s-ci-robot
Copy link
Contributor

@evansheng: Reopened this PR.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Apr 22, 2022
@evansheng
Copy link
Contributor Author

@gjtempleton @elmiko re-opened, and should be good to merge once approved

@elmiko
Copy link
Contributor

elmiko commented Apr 27, 2022

adding lgtm to get this merged as we already have the implementation. i think it would be nice to clean this up a little, but having it is better than not having it.

thanks again @evansheng

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 27, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@evansheng
Copy link
Contributor Author

/reopen

@k8s-ci-robot k8s-ci-robot reopened this Jun 8, 2022
@k8s-ci-robot
Copy link
Contributor

@evansheng: Reopened this PR.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@elmiko
Copy link
Contributor

elmiko commented Jul 8, 2022

/reopen

@gjtempleton @mwielgus could we merge this so that we can capture the documentation?

@k8s-ci-robot k8s-ci-robot reopened this Jul 8, 2022
@k8s-ci-robot
Copy link
Contributor

@elmiko: Reopened this PR.

In response to this:

/reopen

@gjtempleton @mwielgus could we merge this so that we can capture the documentation?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@elmiko
Copy link
Contributor

elmiko commented Jul 8, 2022

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 8, 2022
@gjtempleton
Copy link
Member

gjtempleton commented Jul 10, 2022

Agree with most of elmiko's suggestion/nits, however given this is already implemented, let's get it merged and do any clean-up in a follow-up PR.

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: evansheng, gjtempleton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 10, 2022
@k8s-ci-robot k8s-ci-robot merged commit f990344 into kubernetes:master Jul 10, 2022
navinjoy pushed a commit to navinjoy/autoscaler that referenced this pull request Oct 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants