-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add gRPC expander plugin #4452
Add gRPC expander plugin #4452
Conversation
I've built this and deployed it locally, alongside an internal gRPC server - verified that this is working e2e successfully. I'm able to return options successfully via |
@@ -48,6 +49,7 @@ type AutoscalerOptions struct { | |||
EstimatorBuilder estimator.EstimatorBuilder | |||
Processors *ca_processors.AutoscalingProcessors | |||
Backoff backoff.Backoff | |||
GRPCOpts grpcplugin.GRPCOpts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is needed here. It's not used by anything except the grpc plugin and I think it's just leaking implementation details of this specific expander.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -101,9 +103,12 @@ func initializeDefaultOptions(opts *AutoscalerOptions) error { | |||
if opts.CloudProvider == nil { | |||
opts.CloudProvider = cloudBuilder.NewCloudProvider(opts.AutoscalingOptions) | |||
} | |||
if strings.Contains(opts.ExpanderNames, expander.GRPCExpanderName) && opts.GRPCExpanderURL != "" && opts.GRPCExpanderCert != "" { | |||
opts.GRPCOpts = grpcplugin.GRPCOpts{opts.GRPCExpanderCert, opts.GRPCExpanderURL} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the comment above - I don't think code in core/ should contain any logic specific to a particular expander. I'd much rather pass AutoscalingOptions to expander factory (so it has access to all flags) or just define expander-specific flags directly in the expander.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opted for passing the grpc expander flags as two more args, instead of either creating an import cycle (core & expander factory), or moving the AutoscalerOptions struct to config
cluster-autoscaler/go.mod
Outdated
@@ -25,6 +25,8 @@ require ( | |||
golang.org/x/crypto v0.0.0-20210513164829-c07d793c2f9a | |||
golang.org/x/oauth2 v0.0.0-20210514164344-f6687ab2804c | |||
google.golang.org/api v0.46.0 | |||
google.golang.org/grpc v1.38.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kubernetes already uses grpc in a different version: https://github.com/kubernetes/kubernetes/blob/master/go.mod#L102, which makes it a transitive dependency of CA. In my experience is that different grpc versions don't always play well together. Right now the difference in version number is pretty small, but is there some way in which we can guarantee the same grpc version is used by CA and kubernetes going forward?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can use the same version that that kubernetes repo is using pretty easily, but do you have suggestions on how to keep it synced? I did a cursory search across other grpc uses in the k8s repo and found many repos seem to use many different versions:
- https://github.com/kubernetes/node-problem-detector/blob/fc0edbd222e16b775ec58a253366a41822ecb3d5/vendor/modules.txt#L348
- https://github.com/kubernetes/ingress-gce/blob/b1b62298cfc7adb08154402f2b26607d1a94d65e/vendor/modules.txt#L191
- https://github.com/kubernetes/kops/blob/e40fd308b26676ef3f054e702640d32fecaa9271/vendor/modules.txt#L946
- https://github.com/kubernetes/cloud-provider-openstack/blob/197604cb32c5d1850aeb8c903b7ea280f9bc0e1e/go.mod#L32
etc...
schedulerframework "k8s.io/kubernetes/pkg/scheduler/framework" | ||
|
||
"google.golang.org/grpc" | ||
"google.golang.org/grpc/credentials" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a big dependency to push on anyone forking CA (which is how a lot of cloudproviders work), especially given that they're quite likely to already use grpc, possibly in some other way. Can we try and make it easy to get rid of this dependency by tightly controlling where we import this expander from (ideally try and avoid importing it directly anywhere outside of expander/)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call - removing the grpcOpts struct and passing flags directly to expander factory directly fixes this
added the autogenerated protobuf code to exclusion from golint |
/assign @aleksandra-malinowska |
@MaciekPytel I believe this PR is ready for another round of review when you get the chance |
269d9d0
to
d821603
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey @evansheng , i'm still reviewing this code, but one thing i am wondering about is documenting how to use this expander. is the enhancement meant to be the main documentation or have you considered adding detailed usage instructions to one of our READMEs?
i would like to try this out if possible, but i am still understanding how to setup the grpc expander. perhaps a small "fake" expander would be nice to demonstrate how a user could build one. or is the mock expander meant to be an example?
@elmiko added a README on how to use this new expander, and added some starter code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @evansheng , that readme and example are exactly what i was looking for. i'd like to test drive this a little more, but i'm +1 for merging.
@MaciekPytel friendly ping for another review on this pr |
message BestOptionsRequest { | ||
repeated Option options = 1; | ||
// key is node id from options | ||
map<string, k8s.io.api.core.v1.Node> nodeInfoMap = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So over gRPC nodeInfoMap
is really a Node map: is that required to save rpc calls throughput and simplify the proto?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a very good point. If it's a node map we should name it nodeMap (or maybe templateNodeMap?) and not nodeInfoMap. NodeInfo is a well defined concept that represent a Node + pods running on that node.
The limitation of passing just the nodes is that any information about daemonsets and static pods is no longer available to expander. It's probably not a blocker (most expander implementations can probably live with that), but it's probably worth calling out in the README.
Alternatively you could consider actually making it an actual nodeInfoMap by including both the node and pods running on it in the value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
called out in readme, and renamed variable to be more appropriate
@@ -67,6 +70,9 @@ func ExpanderStrategyFromStrings(expanderFlags []string, cloudProvider cloudprov | |||
stopChannel := make(chan struct{}) | |||
lister := kubernetes.NewConfigMapListerForNamespace(kubeClient, stopChannel, configNamespace) | |||
filters = append(filters, priority.NewFilter(lister.ConfigMaps(configNamespace), autoscalingKubeClients.Recorder)) | |||
case expander.GRPCExpanderName: | |||
klog.V(1).Info("GRPC expander chosen") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: that feels somewhat redundant, we already log all flag values by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
// copy the protos/expander.pb.go to your other application's repo, so it has access to the protobuf definitions | ||
|
||
// Serve should be called by the main() function in main.go of the Expander Server repo to start serving | ||
func Serve(certPath string, keyPath string, port uint) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to move this file to a separate package (grpcplugin/example?) to make it clear it's an example implementation and not part of CA integration.
Also this is already really helpful, but if you already have all the logic maybe also consider adding the main function and a simple Makefile so it's a working example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
including a simple main function but leaving out the makefile, as it is up to the user how they would like to structure their project repo
message BestOptionsRequest { | ||
repeated Option options = 1; | ||
// key is node id from options | ||
map<string, k8s.io.api.core.v1.Node> nodeInfoMap = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a very good point. If it's a node map we should name it nodeMap (or maybe templateNodeMap?) and not nodeInfoMap. NodeInfo is a well defined concept that represent a Node + pods running on that node.
The limitation of passing just the nodes is that any information about daemonsets and static pods is no longer available to expander. It's probably not a blocker (most expander implementations can probably live with that), but it's probably worth calling out in the README.
Alternatively you could consider actually making it an actual nodeInfoMap by including both the node and pods running on it in the value.
|
||
package grpcplugin; | ||
import "k8s.io/api/core/v1/generated.proto"; | ||
//import "google/protobuf/struct.proto"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove if unused.
|
||
// if no Cert file specified, use insecure | ||
if expanderCert == "" { | ||
dialOpt = grpc.WithInsecure() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems like an unnecessary risk. Kubernetes itself moved away from allowing insecure connections and I think we also shouldn't allow them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed this option
|
||
// call gRPC server to get BestOption | ||
klog.V(2).Info("GPRC call of best options to server with ", len(nodeGroupIDOptionMap), " options") | ||
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider making 5s a const
} | ||
} | ||
|
||
// populateNodeInfoForGRPC modifies the nodeInfo object, and replaces it with the v1.Node to pass through grpc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That comment is misleading, original nodeInfos are not modified (and they must not be!).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rewrote comment so it is more accurate
for _, option := range bestOptionsResponseOptions { | ||
if option == nil { | ||
klog.Errorf("gRPC server returned nil Option") | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't continue make more sense here? Other options may not be nil.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, fixed
options = append(options, nodeGroupIDOptionMap[option.NodeGroupId]) | ||
} else { | ||
klog.Errorf("gRPC server returned invalid nodeGroup ID: ", option.NodeGroupId) | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, continue seems more natural.
} | ||
) | ||
|
||
func TestPopulateOptionsForGrpc(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider a table-based test to cover various inputs, including edge cases (ex. empty list of options).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added some tests, as requested
4a607d9
to
9855a94
Compare
@MaciekPytel addressed your comments, PTAL! sorry for the delay |
we talked about this pr at the sig meeting today, i think things are mostly looking good, but there was a suggestion from the sig that these commits should be squashed before merging. ideally to the minimum number of logical, atomic, commits as possible. |
9855a94
to
a2b24e0
Compare
@elmiko @MaciekPytel - addressed all comments, and squashed into two commits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: evansheng, mwielgus The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Add gRPC expander plugin
Add gRPC expander plugin
Add gRPC expander plugin
Add gRPC expander plugin
Add gRPC expander plugin
Adds a new expander
grpc
, implementing the gRPC plugin outlined in: #4134This PR adds a gRPC client as an expander in CA, and calls out to a gRPC server with config specified by two new flags
--grpcExpanderCert
and--grpcExpanderUrl
. Should initialization of the client fail, or any returned response be invalid or malformed, the expander fails open, and doesn't filter any Options.