Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support autodetection of GCE managed instance groups by name prefix #462

Merged
merged 4 commits into from
Dec 18, 2017

Conversation

negz
Copy link
Contributor

@negz negz commented Nov 11, 2017

This commit adds a new usage of the --node-group-auto-discovery flag intended for use with the GCE cloud provider. GCE instance groups can be automatically discovered based on a prefix of their group name. Example usage:

--node-group-auto-discovery=mig:prefix=k8s-mig,minNodes=0,maxNodes=10

Note that unlike the existing AWS ASG autodetection functionality we must specify the min and max nodes in the flag. This is because MIGs store only a target size in the GCE API - they do not have a min and max size we can infer via the API.

In order to alleviate this limitation a little we allow multiple uses of the autodiscovery flag. For example to discover two classes (big and small) of instance groups with different size limits:

./cluster-autoscaler \
  --node-group-auto-discovery=mig:prefix=k8s-a-small,minNodes=1,maxNodes=10 \
  --node-group-auto-discovery=mig:prefix=k8s-a-big,minNodes=1,maxNodes=100

Zonal clusters (i.e. multizone = false in the cloud config) will detect all managed instance groups within the cluster's zone. Regional clusters will detect all matching (zonal) managed instance groups within any of the cluster's region's zones.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 11, 2017
@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://github.com/kubernetes/kubernetes/wiki/CLA-FAQ to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


  • If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
  • If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.
  • If you have done the above and are still having issues with the CLA being reported as unsigned, please email the CNCF helpdesk: [email protected]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Nov 11, 2017
@negz
Copy link
Contributor Author

negz commented Nov 11, 2017

@k8s-ci-robot Signed.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Nov 11, 2017
@negz
Copy link
Contributor Author

negz commented Nov 11, 2017

Some elaboration on the use case for this PR: at Planet Labs we deploy a cluster on GCE as a regional managed instance group of N master nodes, with one or more zonal worker managed instance group per zone in the region. We've previously used the AWS implementation of --node-group-auto-discovery in order to allow us to add and remove worker pools without having to update the autoscaler configuration. We'd like an approximation of that functionality in GCE.

I've tested this by deploying the following:

---
apiVersion: apps/v1beta2
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
  labels:
    component: cluster-autoscaler
spec:
  replicas: 1
  selector:
    matchLabels:
      component: cluster-autoscaler
  template:
    metadata:
      labels:
        component: cluster-autoscaler
    spec:
      containers:
        - name: cluster-autoscaler
          image: gcr.io/.../cluster-autoscaler-planet:81b6010e
          command:
            - /cluster-autoscaler
            - --v=3
            - --stderrthreshold=info
            - --cloud-provider=gce
            - --cloud-config=/etc/kubernetes/gce.conf
            - --skip-nodes-with-local-storage=false
            - --expander=price
            - --balance-similar-node-groups=true
            - --node-group-auto-discovery=mig:prefix=tfk-negz-,min=0,max=200
          readinessProbe:
            httpGet:
              path: /health-check
              port: 8085
              scheme: HTTP
            initialDelaySeconds: 3
            timeoutSeconds: 5
          volumeMounts:
            - name: etc-kubernetes
              mountPath: /etc/kubernetes
              readOnly: true
            - name: ssl-certs
              mountPath: /etc/ssl/certs/ca-certificates.crt
              readOnly: true
      volumes:
        - name: ssl-certs
          hostPath:
            path: "/etc/ssl/certs/ca-certificates.crt"
        - name: etc-kubernetes
          hostPath:
            path: /etc/kubernetes

I see the following in the logs:

I1111 07:24:16.237101       1 gce_manager.go:952] autodiscovered managed instance group tfk-negz-wrk-duw3 using regexp ^tfk-negz-.+
I1111 07:24:16.363479       1 gce_manager.go:952] autodiscovered managed instance group tfk-negz-wrk-9dff using regexp ^tfk-negz-.+
I1111 07:24:16.460511       1 gce_manager.go:952] autodiscovered managed instance group tfk-negz-wrk-maor using regexp ^tfk-negz-.+
I1111 07:24:16.730295       1 gce_manager.go:952] autodiscovered managed instance group tfk-negz-wrk-cpuy using regexp ^tfk-negz-.+
I1111 07:24:16.730346       1 gce_manager.go:516] Registering planet-k8s-staging/us-central1-a/tfk-negz-wrk-duw3
W1111 07:24:17.016510       1 templates.go:202] could not extract kube-reserved from kubeEnv for mig "tfk-negz-wrk-duw3", setting allocatable to capacity.
I1111 07:24:17.016605       1 gce_manager.go:516] Registering planet-k8s-staging/us-central1-b/tfk-negz-wrk-9dff
W1111 07:24:17.407610       1 templates.go:202] could not extract kube-reserved from kubeEnv for mig "tfk-negz-wrk-9dff", setting allocatable to capacity.
I1111 07:24:17.407670       1 gce_manager.go:516] Registering planet-k8s-staging/us-central1-c/tfk-negz-wrk-maor
W1111 07:24:17.631241       1 templates.go:202] could not extract kube-reserved from kubeEnv for mig "tfk-negz-wrk-maor", setting allocatable to capacity.
I1111 07:24:17.631304       1 gce_manager.go:516] Registering planet-k8s-staging/us-central1-f/tfk-negz-wrk-cpuy

The autoscaler then proceeded to scale my idle node pools down to zero as expected.

@MaciekPytel
Copy link
Contributor

Hi @negz,
I'm very much +1 for adding this functionality to GCE and I like the idea of allowing multiple prefixes with min/max sizes. Implementation wise though I'd rather tackle it somewhat differently. We (CA maintainers) don't own PollingAutoscaler and I've seen a lot of bugs related to it, so I generally advise against using it. Ideally we want to completely get rid of it, once someone volunteers to migrate --node-group-auto-discovery for AWS to our new approach.

This alternative approach, that we think is the way to go, is using CloudProvider.Refresh() method to poll cloudprovider and update the list of NodeGroups. We already use this approach for GKE, so we know CA can handle NodeGroups dynamically changing this way (this was not true at the time PollingAutoscaler was implemented).

We believe this is a better approach than re-creating StaticAutoscaler every loop, which is expensive and potentially has hard to predict side effects. Also this way we avoid losing all internal state once the config actually changes (think resetting unneeded nodes timers, but also losing internal caches impacting performance).

Implementing this feature would require changing your code so it doesn't use PollingAutoscaler, storing autoDiscovererConfig in GceCloudProvider or some other place in gce cloudprovider code and calling method running a loop similar to what you have in buildAutoDiscoveringProvider in Refresh() (preferably not every loop to limit performance impact, like https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/gce/gce_manager.go#L815).

I'm happy to have a more detailed discussion, or help you in any way required. Feel free to ping me on slack if you want to have a chat (warning: me and all other CA maintainers are in CET timezone, so we may have limited overlap).

@negz
Copy link
Contributor Author

negz commented Nov 14, 2017

@MaciekPytel That sounds like a better implementation. I'm happy to take on migrating both the existing AWS and proposed GCE autodetection code to your proposed pattern. I'll do so in a separate commit in this PR for now.

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 18, 2017
@negz negz force-pushed the gcedisco branch 2 times, most recently from e7e4a11 to 45317a6 Compare November 18, 2017 04:32
@negz
Copy link
Contributor Author

negz commented Nov 19, 2017

@MaciekPytel I've removed the PollingAutoscaler and moved both the existing AWS and proposed GCE node group autodiscovery code into the Refresh() loop. Apologies for the PR going from L to XXL in the process.

All node group autodiscovery (--node-group-auto-discovery) and 'explicit' discovery (--nodes) now happens at the cloud provider manager level. The CA can now run with both discovery types at the same time. i.e. You can (with the exception of GKE) explicitly configure some node groups using --nodes and also discover further node groups with --node-group-auto-discovery. Automatically discovered node groups will be unregistered at Refresh() time if they no longer exist according to AWS/GCE. Explicitly discovered node groups will never be unregistered, even if they no longer exist.

I've tested this on both AWS and GCE. I see that node groups are autodiscovered and/or explicitly registered as configured, and are successfully scaled up/down as needed.

@MaciekPytel
Copy link
Contributor

@negz I've started reading this PR, but it will take me a few days given the size of it. Also can you describe your testing in some more detail (an example of what I mean is in FAQ: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-should-i-test-my-code-before-submitting-pr).

@mumoshu Do you want to review this as well?

@negz
Copy link
Contributor Author

negz commented Nov 20, 2017

@MaciekPytel Understood.

Testing wise I've completed steps 2 and 3 of the steps you linked. In more detail:

  1. I built a Docker image of this branch of the CA.
  2. I spun up two clusters - one in GCE and one in AWS. Each cluster had one node group per zone in the region.
  3. I confirmed that the CA could successfully register all relevant node groups in both cloud providers when:
    a. Using only the --node-group-auto-discovery flag
    b. Using only the --nodes flag.
    c. Using a combination of the two flags.
  4. I created a dummy deployment in each cloud provider, scaled it up past what would fit on the clusters, and confirmed that an autoscaling event occurred. In the case of GCE I also confirmed that the node groups scaled down to zero when unutilised.

In GCE I ran the CA with these flags:

            - /cluster-autoscaler
            - -v=4
            - --stderrthreshold=info
            - --cloud-provider=gce
            - --cloud-config=/etc/kubernetes/gce.conf
            - --skip-nodes-with-local-storage=false
            - --expander=price
            - --balance-similar-node-groups=true
            - --node-group-auto-discovery=mig:prefix=tfk-ca-,min=0,max=200

In AWS I used:

            - ./cluster-autoscaler
            - --v=4
            - --stderrthreshold=info
            - --cloud-provider=aws
            - --skip-nodes-with-local-storage=false
            - --scale-down-enabled=false
            - --expander=least-waste
            - --balance-similar-node-groups=true
            - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,kubernetes.io/cluster/kk-c69d
            - --nodes=1:20:kk-c69d-pool-worker-d0f2-AutoScalingGroup-WJFONSVW6JS8

I used the following dummy deployment to trigger scale up:

apiVersion: apps/v1beta2
kind: Deployment
metadata:
  name: test
  labels:
    component: test
spec:
  replicas: 0
  selector:
    matchLabels:
      component: test
  template:
    metadata:
      labels:
        component: test
    spec:
      containers:
      - name: test
        image: gcr.io/google-containers/toolbox:latest
        command: ['sh', '-c', 'while true; do sleep 10; done']
        # The autoscaler is rated for 1,000 nodes with 30 pods each
        # https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/scalability_tests.md
        # Aiming for 30 pods per n1-standard-1.
        resources:
          limits:
            cpu: 40m
            memory: 150Mi
          requests:
            cpu: 40m
            memory: 150Mi

@negz
Copy link
Contributor Author

negz commented Nov 20, 2017

Here's some logs from AWS demonstrating the node groups being discovered using a combination of --node-group-auto-discovery and --nodes:

$ kubectl --context snorlax-admin -n kube-system logs -f cluster-autoscaler-3787437471-khbr2                                                           
I1119 22:38:18.664860       1 flags.go:52] FLAG: --address=":8085"                                                                                                         
I1119 22:38:18.665604       1 flags.go:52] FLAG: --alsologtostderr="false"                                                 
I1119 22:38:18.665621       1 flags.go:52] FLAG: --azure-container-registry-config=""                                                                                      
I1119 22:38:18.665633       1 flags.go:52] FLAG: --balance-similar-node-groups="true"                                           
I1119 22:38:18.665672       1 flags.go:52] FLAG: --cloud-config=""                                                                                                         
I1119 22:38:18.665694       1 flags.go:52] FLAG: --cloud-provider="aws"                                                                                                    I1119 22:38:18.665705       1 flags.go:52] FLAG: --cloud-provider-gce-lb-src-cidrs="35.191.0.0/16,209.85.152.0/22,209.85.204.0/22,130.211.0.0/22"                          
I1119 22:38:18.665730       1 flags.go:52] FLAG: --cluster-name=""                                                                              
I1119 22:38:18.665738       1 flags.go:52] FLAG: --configmap=""                                                                                                            
I1119 22:38:18.665744       1 flags.go:52] FLAG: --cores-total="0:320000"                                     
I1119 22:38:18.665768       1 flags.go:52] FLAG: --estimator="binpacking"                                                                                                  
I1119 22:38:18.666820       1 flags.go:52] FLAG: --expander="least-waste"                                     
I1119 22:38:18.666839       1 flags.go:52] FLAG: --expendable-pods-priority-cutoff="0"                                                                                     
I1119 22:38:18.666847       1 flags.go:52] FLAG: --gke-api-endpoint=""                                        
I1119 22:38:18.666854       1 flags.go:52] FLAG: --google-json-key=""                                                                                                      
I1119 22:38:18.666860       1 flags.go:52] FLAG: --httptest.serve=""                                                                                                       I1119 22:38:18.666866       1 flags.go:52] FLAG: --kubeconfig=""                                                                                                           
I1119 22:38:18.666871       1 flags.go:52] FLAG: --kubernetes=""                                                                                                           I1119 22:38:18.666878       1 flags.go:52] FLAG: --leader-elect="true"                                                                                                   
I1119 22:38:18.666888       1 flags.go:52] FLAG: --leader-elect-lease-duration="15s"                                                                                       I1119 22:38:18.666899       1 flags.go:52] FLAG: --leader-elect-renew-deadline="10s"                                                                                      
I1119 22:38:18.666905       1 flags.go:52] FLAG: --leader-elect-resource-lock="endpoints"                                                                                  I1119 22:38:18.666912       1 flags.go:52] FLAG: --leader-elect-retry-period="2s"                          
I1119 22:38:18.666918       1 flags.go:52] FLAG: --log-backtrace-at=":0"                                                                                               
I1119 22:38:18.666929       1 flags.go:52] FLAG: --log-dir=""                                                                                                             
I1119 22:38:18.666935       1 flags.go:52] FLAG: --logtostderr="false"                                                                                        
I1119 22:38:18.666942       1 flags.go:52] FLAG: --max-autoprovisioned-node-group-count="15"                                                                               
I1119 22:38:18.666948       1 flags.go:52] FLAG: --max-empty-bulk-delete="10"                                                                                              
I1119 22:38:18.666954       1 flags.go:52] FLAG: --max-failing-time="15m0s"                                   
I1119 22:38:18.666961       1 flags.go:52] FLAG: --max-graceful-termination-sec="600"                                                                                      
I1119 22:38:18.666968       1 flags.go:52] FLAG: --max-inactivity="10m0s"                                                                                    
I1119 22:38:18.666974       1 flags.go:52] FLAG: --max-node-provision-time="15m0s"                                                                    
I1119 22:38:18.666981       1 flags.go:52] FLAG: --max-nodes-total="0"                                                                                                  
I1119 22:38:18.666987       1 flags.go:52] FLAG: --max-total-unready-percentage="33"                                                                             
I1119 22:38:18.666996       1 flags.go:52] FLAG: --memory-total="0:6400000"                                                                                  
I1119 22:38:18.667002       1 flags.go:52] FLAG: --min-replica-count="0"                                                                              
I1119 22:38:18.667008       1 flags.go:52] FLAG: --namespace="kube-system"
I1119 22:38:18.667015       1 flags.go:52] FLAG: --node-autoprovisioning-enabled="false"
I1119 22:38:18.667021       1 flags.go:52] FLAG: --node-group-auto-discovery="[asg:tag=k8s.io/cluster-autoscaler/enabled,kubernetes.io/cluster/kk-c69d]"
I1119 22:38:18.667037       1 flags.go:52] FLAG: --nodes="[1:20:kk-c69d-pool-worker-d0f2-AutoScalingGroup-WJFONSVW6JS8]"
I1119 22:38:18.667046       1 flags.go:52] FLAG: --ok-total-unready-count="3"
I1119 22:38:18.667052       1 flags.go:52] FLAG: --scale-down-candidates-pool-min-count="50"
I1119 22:38:18.667059       1 flags.go:52] FLAG: --scale-down-candidates-pool-ratio="0.1"
I1119 22:38:18.667066       1 flags.go:52] FLAG: --scale-down-delay-after-add="10m0s"                                                                                      
I1119 22:38:18.667074       1 flags.go:52] FLAG: --scale-down-delay-after-delete="10s"                                                                                     
I1119 22:38:18.667080       1 flags.go:52] FLAG: --scale-down-delay-after-failure="3m0s"                                                                                   
I1119 22:38:18.667086       1 flags.go:52] FLAG: --scale-down-enabled="false"                                                                                              
I1119 22:38:18.667093       1 flags.go:52] FLAG: --scale-down-non-empty-candidates-count="30"                              
I1119 22:38:18.667099       1 flags.go:52] FLAG: --scale-down-unneeded-time="10m0s"                                                                                        
I1119 22:38:18.667106       1 flags.go:52] FLAG: --scale-down-unready-time="20m0s"                                              
I1119 22:38:18.667112       1 flags.go:52] FLAG: --scale-down-utilization-threshold="0.5"                                                                                  
I1119 22:38:18.667125       1 flags.go:52] FLAG: --scan-interval="10s"                                                                                                     I1119 22:38:18.667131       1 flags.go:52] FLAG: --skip-nodes-with-local-storage="false"                                                                                   
I1119 22:38:18.667138       1 flags.go:52] FLAG: --skip-nodes-with-system-pods="true"                                                           
I1119 22:38:18.667143       1 flags.go:52] FLAG: --stderrthreshold="0"                                                                                                     
I1119 22:38:18.667150       1 flags.go:52] FLAG: --v="4"                                                      
I1119 22:38:18.667156       1 flags.go:52] FLAG: --version="false"                                                                                                         
I1119 22:38:18.667168       1 flags.go:52] FLAG: --vmodule=""                                                 
I1119 22:38:18.667176       1 flags.go:52] FLAG: --write-status-configmap="true"                                                                                           
I1119 22:38:18.667184       1 main.go:295] Cluster Autoscaler 1.1.0-alpha1                                    
I1119 22:38:18.878668       1 leaderelection.go:174] attempting to acquire leader lease...                                                                                 
I1119 22:38:18.963510       1 leaderelection.go:243] lock is held by cluster-autoscaler-891337387-wqxz7 and has not yet expired                                            I1119 22:38:18.963538       1 leaderelection.go:180] failed to acquire lease kube-system/cluster-autoscaler                                                                
I1119 22:38:22.418253       1 leaderelection.go:243] lock is held by cluster-autoscaler-891337387-wqxz7 and has not yet expired                                            I1119 22:38:22.418278       1 leaderelection.go:180] failed to acquire lease kube-system/cluster-autoscaler                                                              
I1119 22:38:26.678929       1 leaderelection.go:243] lock is held by cluster-autoscaler-891337387-wqxz7 and has not yet expired                                            I1119 22:38:26.678955       1 leaderelection.go:180] failed to acquire lease kube-system/cluster-autoscaler                                                               
I1119 22:38:30.277331       1 leaderelection.go:243] lock is held by cluster-autoscaler-891337387-wqxz7 and has not yet expired                                            I1119 22:38:30.277355       1 leaderelection.go:180] failed to acquire lease kube-system/cluster-autoscaler
I1119 22:38:33.331305       1 leaderelection.go:243] lock is held by cluster-autoscaler-891337387-wqxz7 and has not yet expired                                        
I1119 22:38:33.331329       1 leaderelection.go:180] failed to acquire lease kube-system/cluster-autoscaler                                                               
I1119 22:38:36.372209       1 leaderelection.go:184] successfully acquired lease kube-system/cluster-autoscaler                                               
I1119 22:38:36.372788       1 factory.go:33] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"kube-system", Name:"cluster-autoscaler", UID:"d217abdf-cc41-11e7-9786-0a
c479f185f8", APIVersion:"v1", ResourceVersion:"5568853", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' cluster-autoscaler-3787437471-khbr2 became leader  
I1119 22:38:36.374420       1 predicates.go:123] Using predicate PodFitsResources                             
I1119 22:38:36.374457       1 predicates.go:123] Using predicate GeneralPredicates                                                                                         
I1119 22:38:36.374476       1 predicates.go:123] Using predicate PodToleratesNodeTaints                                                                      
I1119 22:38:36.374497       1 predicates.go:123] Using predicate CheckNodeMemoryPressure                                                              
I1119 22:38:36.374517       1 predicates.go:123] Using predicate NoVolumeNodeConflict                                                                                   
I1119 22:38:36.374537       1 predicates.go:123] Using predicate CheckNodeCondition                                                                              
I1119 22:38:36.374560       1 predicates.go:123] Using predicate MaxGCEPDVolumeCount                                                                         
I1119 22:38:36.374576       1 predicates.go:123] Using predicate NoDiskConflict                                                                       
I1119 22:38:36.374591       1 predicates.go:123] Using predicate NoVolumeZoneConflict
I1119 22:38:36.374612       1 predicates.go:123] Using predicate CheckNodeDiskPressure  
I1119 22:38:36.374633       1 predicates.go:123] Using predicate MatchInterPodAffinity                                                                  
I1119 22:38:36.374649       1 predicates.go:123] Using predicate MaxAzureDiskVolumeCount                                
I1119 22:38:36.374665       1 predicates.go:123] Using predicate MaxEBSVolumeCount
I1119 22:38:36.374681       1 predicates.go:123] Using predicate ready
I1119 22:38:36.462846       1 reflector.go:202] Starting reflector *v1.Node (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:212
I1119 22:38:36.462891       1 reflector.go:240] Listing and watching *v1.Node from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:212
I1119 22:38:36.463230       1 reflector.go:202] Starting reflector *v1.ReplicationController (0s) from k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/informe
rs/factory.go:73
I1119 22:38:36.463253       1 reflector.go:240] Listing and watching *v1.ReplicationController from k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/informers/
factory.go:73
I1119 22:38:36.463687       1 reflector.go:202] Starting reflector *v1.PersistentVolume (0s) from k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/informers/fa
ctory.go:73
I1119 22:38:36.463695       1 reflector.go:202] Starting reflector *v1beta1.ReplicaSet (0s) from k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/informers/fac
tory.go:73
I1119 22:38:36.463700       1 reflector.go:240] Listing and watching *v1.PersistentVolume from k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/informers/facto
ry.go:73
I1119 22:38:36.463707       1 reflector.go:240] Listing and watching *v1beta1.ReplicaSet from k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/informers/factor
y.go:73
I1119 22:38:36.464061       1 reflector.go:202] Starting reflector *v1beta1.StatefulSet (0s) from k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/informers/fa
ctory.go:73
I1119 22:38:36.464075       1 reflector.go:240] Listing and watching *v1beta1.StatefulSet from k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/informers/facto
ry.go:73
I1119 22:38:36.464205       1 reflector.go:202] Starting reflector *v1.PersistentVolumeClaim (0s) from k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/informe
rs/factory.go:73
I1119 22:38:36.464219       1 reflector.go:240] Listing and watching *v1.PersistentVolumeClaim from k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/informers/
factory.go:73
I1119 22:38:36.464448       1 reflector.go:202] Starting reflector *v1.Pod (0s) from k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/informers/factory.go:73
I1119 22:38:36.464500       1 reflector.go:240] Listing and watching *v1.Pod from k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/informers/factory.go:73
I1119 22:38:36.464822       1 reflector.go:202] Starting reflector *v1.Node (0s) from k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/informers/factory.go:73
I1119 22:38:36.464822       1 reflector.go:202] Starting reflector *v1.Service (0s) from k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/informers/factory.go:
73
I1119 22:38:36.464836       1 reflector.go:240] Listing and watching *v1.Node from k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/informers/factory.go:73
I1119 22:38:36.464842       1 reflector.go:240] Listing and watching *v1.Service from k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/informers/factory.go:73
I1119 22:38:36.464945       1 reflector.go:202] Starting reflector *v1.Node (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:239
I1119 22:38:36.464957       1 reflector.go:240] Listing and watching *v1.Node from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:239
I1119 22:38:36.465064       1 reflector.go:202] Starting reflector *v1beta1.PodDisruptionBudget (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers
.go:266
I1119 22:38:36.465076       1 reflector.go:240] Listing and watching *v1beta1.PodDisruptionBudget from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:266
I1119 22:38:36.465141       1 reflector.go:202] Starting reflector *v1.Pod (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:149
I1119 22:38:36.465153       1 reflector.go:240] Listing and watching *v1.Pod from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:149
I1119 22:38:36.465233       1 reflector.go:202] Starting reflector *v1beta1.DaemonSet (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:293
I1119 22:38:36.465245       1 reflector.go:240] Listing and watching *v1beta1.DaemonSet from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:293
I1119 22:38:36.465348       1 reflector.go:202] Starting reflector *v1.Pod (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:174
I1119 22:38:36.465365       1 reflector.go:240] Listing and watching *v1.Pod from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:174
I1119 22:38:36.662888       1 request.go:462] Throttling request took 197.392431ms, request: GET:https://172.31.0.1:443/api/v1/pods?fieldSelector=spec.nodeName%21%3D%2Csta
tus.phase%21%3DFailed%2Cstatus.phase%21%3DSucceeded&resourceVersion=0
I1119 22:38:37.165149       1 request.go:462] Throttling request took 601.470308ms, request: POST:https://172.31.0.1:443/api/v1/namespaces/kube-system/configmaps
I1119 22:38:38.367988       1 cloud_provider_builder.go:68] Building aws cloud provider.
I1119 22:38:38.368094       1 auto_scaling_groups.go:77] Registering ASG kk-c69d-pool-worker-d0f2-AutoScalingGroup-WJFONSVW6JS8
I1119 22:38:38.368112       1 auto_scaling_groups.go:138] Invalidating unowned instance cache
I1119 22:38:38.368123       1 auto_scaling_groups.go:166] Regenerating instance to ASG map for ASGs: [kk-c69d-pool-worker-d0f2-AutoScalingGroup-WJFONSVW6JS8]
I1119 22:38:38.563216       1 leaderelection.go:199] successfully renewed lease kube-system/cluster-autoscaler
I1119 22:38:39.610660       1 auto_scaling_groups.go:166] Regenerating instance to ASG map for ASGs: [kk-c69d-pool-worker-d0f2-AutoScalingGroup-WJFONSVW6JS8]
I1119 22:38:39.717627       1 auto_scaling_groups.go:77] Registering ASG kk-c69d-pool-worker-41d6-AutoScalingGroup-1SMZC0JBB9Q4A
I1119 22:38:39.717655       1 auto_scaling_groups.go:138] Invalidating unowned instance cache
I1119 22:38:39.717671       1 auto_scaling_groups.go:71] Updated ASG kk-c69d-pool-worker-d0f2-AutoScalingGroup-WJFONSVW6JS8
I1119 22:38:39.717678       1 auto_scaling_groups.go:138] Invalidating unowned instance cache
I1119 22:38:39.717684       1 auto_scaling_groups.go:77] Registering ASG kk-c69d-pool-worker-d705-AutoScalingGroup-11IL1BXLL1XO0
I1119 22:38:39.717690       1 auto_scaling_groups.go:138] Invalidating unowned instance cache
I1119 22:38:39.717707       1 auto_scaling_groups.go:166] Regenerating instance to ASG map for ASGs: [kk-c69d-pool-worker-d0f2-AutoScalingGroup-WJFONSVW6JS8 kk-c69d-pool-w
orker-41d6-AutoScalingGroup-1SMZC0JBB9Q4A kk-c69d-pool-worker-d705-AutoScalingGroup-11IL1BXLL1XO0]
I1119 22:38:39.791514       1 aws_manager.go:232] Refreshed ASG list, next refresh after 2017-11-19 22:39:39.610417957 +0000 UTC m=+82.047601960
I1119 22:38:39.791651       1 main.go:226] Registered cleanup signal handler
I1119 22:38:40.575258       1 leaderelection.go:199] successfully renewed lease kube-system/cluster-autoscaler

Here's an AWS scale-up using just --node-group-auto-discovery:

I1119 22:24:22.832985       1 flags.go:52] FLAG: --node-group-auto-discovery="[asg:tag=k8s.io/cluster-autoscaler/enabled,kubernetes.io/cluster/kk-c69d]"                   
I1119 22:24:22.833006       1 flags.go:52] FLAG: --nodes="[]" 
...
I1119 22:26:17.996400       1 static_autoscaler.go:97] Starting main loop                                                                                                  
I1119 22:26:18.238965       1 utils.go:444] No pod using affinity / antiaffinity found in cluster, disabling affinity predicate for this loop                              
I1119 22:26:18.238995       1 static_autoscaler.go:230] Filtering out schedulables                                                                                         
I1119 22:26:18.240791       1 static_autoscaler.go:240] No schedulable pods                                                                                                
I1119 22:26:18.240826       1 scale_up.go:54] Pod default/test-2473652692-k5xv2 is unschedulable                                                                           I1119 22:26:18.240840       1 scale_up.go:54] Pod default/test-2473652692-87zr7 is unschedulable                                                                           
I1119 22:26:18.240852       1 scale_up.go:54] Pod default/test-2473652692-5h152 is unschedulable                                                                           I1119 22:26:18.240882       1 scale_up.go:54] Pod default/test-2473652692-2qj7z is unschedulable                                                                           
I1119 22:26:18.240896       1 scale_up.go:54] Pod default/test-2473652692-lm6r4 is unschedulable                                                                           I1119 22:26:18.240903       1 scale_up.go:54] Pod default/test-2473652692-zhwbd is unschedulable                                                                           
I1119 22:26:18.240910       1 scale_up.go:54] Pod default/test-2473652692-1gq6p is unschedulable                                                                           I1119 22:26:18.240917       1 scale_up.go:54] Pod default/test-2473652692-6j5nz is unschedulable                                                                           
I1119 22:26:18.240923       1 scale_up.go:54] Pod default/test-2473652692-z4hf0 is unschedulable                                                                           
I1119 22:26:18.240930       1 scale_up.go:54] Pod default/test-2473652692-f6gdf is unschedulable                                                                           
I1119 22:26:18.698999       1 scale_up.go:86] Upcoming 0 nodes                                                                                                             
I1119 22:26:18.817036       1 waste.go:57] Expanding Node Group kk-c69d-pool-worker-41d6-AutoScalingGroup-1SMZC0JBB9Q4A would waste 80.00% CPU, 81.22% Memory, 80.61% Blend
ed                                                                                                                                                                         
I1119 22:26:18.817071       1 waste.go:57] Expanding Node Group kk-c69d-pool-worker-d0f2-AutoScalingGroup-WJFONSVW6JS8 would waste 80.00% CPU, 81.22% Memory, 80.61% Blende
d                                                                                                                                                                          
I1119 22:26:18.817084       1 waste.go:57] Expanding Node Group kk-c69d-pool-worker-d705-AutoScalingGroup-11IL1BXLL1XO0 would waste 80.00% CPU, 81.22% Memory, 80.61% Blend
ed                                                                                                                                                                         
I1119 22:26:18.817099       1 scale_up.go:193] Best option to resize: kk-c69d-pool-worker-d0f2-AutoScalingGroup-WJFONSVW6JS8                                               
I1119 22:26:18.817115       1 scale_up.go:197] Estimated 1 nodes needed in kk-c69d-pool-worker-d0f2-AutoScalingGroup-WJFONSVW6JS8                                          
I1119 22:26:18.884765       1 scale_up.go:286] Final scale-up plan: [{kk-c69d-pool-worker-d0f2-AutoScalingGroup-WJFONSVW6JS8 1->2 (max: 64)}]                              
I1119 22:26:18.884808       1 scale_up.go:338] Scale-up: setting group kk-c69d-pool-worker-d0f2-AutoScalingGroup-WJFONSVW6JS8 size to 2                                    
I1119 22:26:18.923197       1 aws_manager.go:294] Setting asg kk-c69d-pool-worker-d0f2-AutoScalingGroup-WJFONSVW6JS8 size to 2                                             
I1119 22:26:18.978124       1 factory.go:33] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"test-2473652692-2qj7z", UID:"a45ee01b-cd78-11e7-9786-0ac479f18
5f8", APIVersion:"v1", ResourceVersion:"5564844", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{kk-c69d-pool-worker-d0f2-AutoScalingG
roup-WJFONSVW6JS8 1->2 (max: 64)}]

I'm afraid I don't have the logs from my GCE tests in my scrollback, but it's not too difficult to run them again if they'd be useful.

Copy link
Contributor

@mumoshu mumoshu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your efforts 👍
Several comments(nits?) regarding naming and locking but LGTM overall.
Would you mind addressing those if you're ok?

return nil, err
}

cfgs, err := discoveryOpts.ParseASGAutoDiscoverySpecs()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: specs rather than cfgs for consistency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I was looking for something to differentiate from the --nodes 'spec', which is a string, but they're both specs in a way. I don't feel strongly - will change.

interrupt: make(chan struct{}),
service: *service,
asgs: asgs,
autoASGs: cfgs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: asgAutoDiscoverySpecs or autoDiscoverySpecs rather than autoASGs for consistency?

@@ -80,41 +87,179 @@ func createAWSManagerInternal(configReader io.Reader, service *autoScalingWrappe
}
}

asgs, err := newAutoScalingGroups(*service)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps autoScalingGroups and newAutoScalingGroups should be renamed to cachedAutoScalingGroups and newCachedAutoScalingGroups respectively?
Because I occasionally confuse autoScalingGroups as a cache with a list of asgs 😉

m.notInRegisteredAsg = make(map[AwsRef]bool)
m.instanceToAsgMutex.Unlock()
}

func (m *autoScalingGroups) regenerateCache() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you agree with renaming autoScalingGroups to cachedAutoScalingGroups, I guess we can rename this func to just regenerate which looks like cachedAsgs.regenerate() as a whole when called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this - will do.


func (m *autoScalingGroups) get() []*asgInformation {
m.registeredAsgsMutex.RLock()
defer m.registeredAsgsMutex.RUnlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious but what does this RLock prevents us from?
At glance, this lock seemed to do almost nothing as we don't mutate registeredAsgs in this func.

Copy link
Contributor Author

@negz negz Nov 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct that we don't mutate registeredAsgs here, but we do elsewhere (i.e. in Register and Unregister). We need to take a lock when reading memory - not only when writing it. This RLock is intended to prevent us reading registeredAsgs while another goroutine is writing to it.

For example if play.go were as follows:

package main

import (
	"fmt"
	"sync"
	"time"
)

func main() {
	s := make([]int, 0)
	m := sync.RWMutex{}
	go func() {
		for i := 0; ; i++ {
			m.Lock()
			s = append(s, i)
			m.Unlock()
			time.Sleep(1 * time.Second)
		}
	}()
	for {
		fmt.Println(s)
		time.Sleep(1 * time.Second)
	}
}
$ go run -race play.go
[]           
==================                                
WARNING: DATA RACE
Write at 0x00c420092020 by goroutine 6:
  main.main.func1()
      /Users/negz/control/go/src/play.go:15 +0xf4
         
Previous read at 0x00c420092020 by main goroutine:
  main.main()
      /Users/negz/control/go/src/play.go:21 +0x158
                 
Goroutine 6 (running) created at:
  main.main()        
      /Users/negz/control/go/src/play.go:12 +0x147
==================  

Adding an m.RLock() and m.RUnlock() around fmt.Println(s) prevents the data race.

Note that maybe I'm missing something and these methods are in fact only ever called by a single goroutine, in which case all this locking is pointless. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like I had not understood the use-case of RLock/RUnlock.
Thanks for the clear explanation!

// Register ASG. Returns true if the ASG was registered.
func (m *autoScalingGroups) Register(asg *Asg) bool {
m.registeredAsgsMutex.Lock()
defer m.registeredAsgsMutex.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading throughout code, I began to wonder if what we wanted was a single mutex for all the set of internal states including registeredAsgs, instanceToAsg, notInRegisteredAsgs?
For me migrating to the single mutex would result in cleaner code around locking and less functions(regenerateCacheWithoutLock would become unnecesary)? 🤔
I suppose the single mutex could be renamed stateMutex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree a single mutex would make the code cleaner, but on the surface it seems that less granular locking could result in situations where a caller is waiting for a lock unnecessarily, for example not being able to find an ASG for an instance while also registering an ASG.

That said, I mostly added a second mutex in imitation of the two-mutex pattern in the GCE manager. It's quite possible this could be premature optimisation.

Currently we use the following mutexes in the 'autoscaling group cache':

registeredAsgMutex:

  • Taken by a single goroutine when registering an ASG
  • Taken by a single goroutine when unregistering an ASG
  • Taken by one or more goroutines when listing ASGs
  • Taken by one or more goroutines when regenerating the instance to ASG cache

instanceToAsgMutex:

  • Taken by a single goroutine when finding the ASG for an instance
  • Taken by a single goroutine when invalidating the instance to ASG cache
  • Taken by a single goroutine when regenerating the instance to ASG cache

As far as I know there's only two goroutines competing for these mutexes, namely the one used by the RunOnce loop, and the one we spawn to regenerate the instance to ASG cache hourly. I'm new to the codebase so it's not unlikely that there's other goroutines I'm unaware of.

At the end of the day this is not something I feel strongly about. I'm curious if anyone else has an opinion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I don't think using a single mutex would remove the need for regenerateCacheWithoutLock. Even with a single mutex we'd still have two cases in which we need to regenerate the cache:

  1. From inside FindForInstance, where we have already taken a lock and thus can't take it again.
  2. From everywhere else (i.e. registering an instance, unregistering an instance, regenerating the cache hourly).

Previously we dealt with this by all the 'everywhere else' cases taking the mutex explicitly. i.e. The callers would do:

m.registeredAsgsMutex.Lock()
defer m.registeredAsgsMutex.Unlock()
m.regenerateCache()

This puts the responsibility of dealing with locking on the callers rather than the method that actually needs the lock. This is especially messy when the caller is in the AWS manager, not the autoscaling group cache struct - it leaks implementation details of the autoscaling group cache up to the AWS manager to the point that it's almost pointless for them to be two different things.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying again!

for example not being able to find an ASG for an instance while also registering an ASG.
Yes - indeed that sounds problematic in larger clusters.
I'm now inclined to have fine-grained locks as you've done 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up reverting to a single lock after all. :) I ran into a deadlock bug in the AWS provider and after spending some time trying to debug it I was convinced that a single mutex would make life easier.

glog.Errorf("Error while regenerating Asg cache: %v", err)
}
}, time.Hour, manager.interrupt)
if err := manager.fetchExplicitAsgs(discoveryOpts.NodeGroupSpecs); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, it wasn't very straight-forward to understand why we have to pass auto discovery specs and "explicit" discovery specs separately like this.
Can we pass the discoverySpecs.NodeGroupSpecs while creating the AwsManager struct above, so that it looks more consistent between auto vs explicit discovery specs in regard to those two are both about discovery?

Copy link
Contributor Author

@negz negz Nov 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We certainly could store discoverySpecs.NodeGroupSpecs in the AwsManager, but it would only be for consistency's sake because we need fetch 'explicit' node groups only once - at the creation of the AwsManager.

I think you're suggesting an alternative like this:

func createAWSManagerInternal(...) (*AwsManager, error) {
	manager := &AwsManager{
		service:         *service,
		asgs:            asgs,
		autoASGs:        cfgs,
                explicitASGs:    discoveryOpts.NodeGroupSpecs,
		neverUnregister: make(map[AwsRef]bool),
	}
        manager.fetchExplicitAsgs()
}

To me this alternative makes it less obvious that discoveryOpts.NodeGroupSpecs is only consumed once, by fetchExplicitAsgs().

In this PR we have three 'flavours' of spec:

  1. A regular --nodes' 'min:max:name style spec. NodeGroupSpecs is a []string slice of these specs. Each of these specs is passed to dynamic.SpecFromString when fetchExplicitAsgs is called.
  2. A --node-group-autodiscovery ASG spec, like asg:tag=coolTag. This is parsed into an []ASGAutoDiscoveryConfig slice by discoveryOpts.ParseASGAutoDiscoverySpecs().
  3. A -node-group-autodiscovery MIG spec like mix:prefix=coolPrefix,minNodes=0,maxNodes=100. This is parsed into a []MigAutoDiscoveryConfig slice by discoveryOpts.ParseMIGAutoDiscoverySpecs().

These three can never be completely consistent, because they do slightly different things:

  1. Contains everything we need to know to register in ASG/MIG, and is only consumed once.
  2. and 3. instead contain inputs to discover ASGs/MIGs to register, and are consumed repeatedly over the life of the CA.

Talking through this, I wonder whether dynamic.NodeGroupSpec should be merged with NodeGroupDiscoveryOptions. This would move all of the flag/'spec' parsing code up to NodeGroupDiscoveryOptions.

This would look something like:

func (o NodeGroupDiscoveryOptions) ParseExplicitSpecs() ([]NodeGroupSpec, error) {}

func (m *AwsManager) fetchExplicitAsgs(specs cloudprovider.NodeGroupSpec[]) error {}

What do you think? Or is there some other approach we could take to make this clearer?

flag.Var(&nodeGroupAutoDiscoveryFlag, "node-group-auto-discovery", "One or more definition(s) of node group auto-discovery. "+
"A definition is expressed `<name of discoverer>:[<key>[=<value>]]`. "+
"The `aws` and `gce` cloud providers are currently supported. AWS matches by ASG tags, e.g. `asg:tag=tagKey,anotherTagKey`. "+
"GCE matches by IG prefix, and requires you to specify min and max nodes per IG, e.g. `mig:prefix=pfx,min=0,max=10` "+
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make it look like mig:namePrefix=... rather than mig:prefix=...?
The context is that I've initially made asg:tag=... as so because clearly it was about asgs' tag.
On the other hand, for me, mig:prefix=... seems a bit unclear what the prefix is about.
Of course it is the prefix of MIG name but if you'd like to make it even more clearer, I guess it will be good idea to name it something like namePrefix. Just my two cents.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea - clearer is better.

(Except for all the Java style LongVariableNames in this codebase, which make baby Gophers cry. :trollface:)

@negz
Copy link
Contributor Author

negz commented Nov 21, 2017

@MaciekPytel FYI, I'm going to hold off on addressing @mumoshu 's comments until I hear from the Google side of the world so that I can put some time aside to address all comments at once.

@MaciekPytel
Copy link
Contributor

@negz Sure, makes sense. I'm halfway through reading, I'll try to finish by tomorrow end of day. Sorry for the delay.

@negz
Copy link
Contributor Author

negz commented Nov 28, 2017

Just rebased to account for the reintroduced Azure provider.

glog.Fatalf("Failed to create GCE Manager: %v", err)
}

p, err := gce.BuildGceCloudProvider(m, rl)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather avoid one letter variable names. manager is not very long and makes it obvious what is getting passed to where.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

I tend to follow https://github.com/golang/go/wiki/CodeReviewComments#variable-names, but I realise it was a bit cheeky to stray from the established pattern in this codebase. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You make a good point, but, as you also pointed out, I'd rather keep the codebase consistent.

And, personally, it's one of thing in golang that I can't get myself to accept. And I've never been a java developer :)

// Technically we're both ProviderNameGCE and ProviderNameGKE...
// Perhaps we should return a different name depending on
// gce.gceManager.getMode()?
return ProviderNameGCE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we even use this anywhere. That being said your comment sounds like the right thing to do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. I'll leave it as is for now given it's unused unless you feel strongly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me.

if mig, found := m.migCache[*instance]; found {
return mig, nil
}
return nil, fmt.Errorf("Instance %+v does not belong to any configured MIG", *instance)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no difference in what this function does before and after your changes. While I fully agree that your version is more readable I'd rather avoid drive-by fixes (especially in a PR that is already as big as this one).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted.

return m.regenerateCacheWithoutLock()
}

func (m *gceManagerImpl) regenerateCacheWithoutLock() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above regarding drive-by changes, except in this case I'm also less in favor of the change itself. Arguably if you have 2 versions of a method, which differ by whether they take a lock or not, you're not actually hiding the implementation. The user still needs to think about the lock and most likely look into the implementation to figure out which lock it is and whether they're holding it already (also the fact that cache even exists is an implementation detail and so regenerateCache shouldn't need to be called from outside this file).

I feel sort of neutral regarding where the mutex is handled and I wouldn't mind it in new code. But I don't feel it's worth the increase in PR size to change it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough - will revert.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted.

@@ -809,62 +819,169 @@ func (m *gceManagerImpl) getTemplates() *templateBuilder {
}

func (m *gceManagerImpl) Refresh() error {
if m.mode == ModeGCE {
if m.lastRefresh.Add(refreshInterval).After(time.Now()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is purely out of curiosity, no change required. Previously it was:

if condition {
  return action()
}
return nil

Your version is:

if !condition {
  return nil
}
return action()

Any particular reason for changing it? As I said I'm fine with either version, just curious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid I don't recall. 🤔 It's likely I switched it around while experimenting with a different implementation and it ended up getting left like that.


for _, arg := range strings.Split(tokens[1], ",") {
kv := strings.Split(arg, "=")
k, v := kv[0], kv[1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check kv length?

name: "PrefixDoesNotCompileToRegexp",
specs: []string{"mig:prefix=a),min=1"},
wantErr: true,
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add error case for missing fields (eg. "mig:min=3,max=8") and maybe for min > max?

return cfg, fmt.Errorf("unsupported key \"%s\" is specified for discoverer \"%s\". Supported keys are \"%s\"", k, discoverer, validMIGAutoDiscovererKeys)
}
}
return cfg, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should do a few more checks on the input:

  • make sure prefix is actually specified
  • make sure max > 0 and max >= min (i'm mainly thinking about the case of someone not providing max)

return cfg, fmt.Errorf("Unsupported discoverer specified: %s", discoverer)
}
param := tokens[1]
paramTokens := strings.Split(param, "=")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check len(paramTokens)

map[string]int64{cloudprovider.ResourceNameCores: options.MaxCoresTotal, cloudprovider.ResourceNameMemory: options.MaxMemoryTotal},
),
)
expanderStrategy, err := factory.ExpanderStrategyFromString(options.ExpanderName, cloudProvider, listerRegistry.AllNodeLister())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change this formatting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unintentional, I think. Pretty sure I had split it out a bit so I could better follow what it was doing and forgot to put it back. Happy to revert.

@MaciekPytel
Copy link
Contributor

@negz Really sorry it took me that long :(

Overall it looks good, though I'd prefer to avoid (or at least significantly limit) code style drive-by fixes in a PR that is already so large and complex (as much as I agree with most of the clean ups).

}
glog.V(2).Infof("Refreshed NodePool list, next refresh after %v", nextRefreshAfter)
}
m.lastRefresh = time.Now()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like logging one timestamp and than actually using a different one - it will be really confusing to someone debugging this in the future. We're doing multiple API calls in the meantime, so the difference is likely to be noticeable.

Maybe we can just log something like "Finished refreshing config, next refresh after" at the end? We can still keep old log messages, just without the "next refresh after" part.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured this wasn't too bad since technically it said the next refresh would happen "after" the time, not "at" the time. ;) Fixed anyhow.

@negz
Copy link
Contributor Author

negz commented Dec 5, 2017

Just gave this a final test in AWS and GCE. As best I can tell everything is working as expected.

I setup CA on the AWS and GCE cloud providers with node group autodiscovery. I also configured one of the candidate node groups for autodiscovery explicitly using --nodes with a lower max node count than the autodiscovery setup. I then observed the following on both providers:

  • The explicitly configured node group was registered.
  • The autodiscovered node groups were registered.
  • The explicitly configured node group maintained its explicitly configured min/max nodes - i.e. they did not get updated when the node group was also autodiscovered.
  • Scale up and scale down worked as expected.
  • Deleting an autodiscovered node group causes it to be unregistered.
  • Deleting an explicitly configured node group does not cause it to be unregistered.

AWS config:

            - ./cluster-autoscaler
            - --v=4
            - --stderrthreshold=info
            - --cloud-provider=aws
            - --skip-nodes-with-local-storage=false
            - --expander=least-waste
            - --balance-similar-node-groups=true
            - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,kubernetes.io/cluster/$(CLUSTER_NAME)
            - --nodes=1:10:kk-c8a9-pool-worker-f821-AutoScalingGroup-8AQYAA49ML9X

GCE config:

        - /cluster-autoscaler
        - --v=4           
        - --stderrthreshold=info           
        - --cloud-provider=gce
        - --cloud-config=/etc/kubernetes/gce.conf
        - --skip-nodes-with-local-storage=false
        - --expander=price
        - --balance-similar-node-groups=true      
        - --node-group-auto-discovery=mig:namePrefix=tfk-negz-,min=0,max=200
        - --nodes=0:10:https://www.googleapis.com/compute/v1/projects/planet-k8s-staging/zones/us-central1-c/instanceGroups/tfk-negz-wrk-488y

AWS autodiscovering stuff:

$ kubectl --context snorlax-admin -n kube-system logs cluster-autoscaler-3311709055-ckg9m|grep Ignoring|tail -n1
I1205 00:57:39.996656       1 aws_manager.go:180] Ignoring explicitly configured ASG kk-c8a9-pool-worker-f821-AutoScalingGroup-8AQYAA49ML9X for autodiscovery.
$ kubectl --context snorlax-admin -n kube-system logs cluster-autoscaler-3311709055-ckg9m|grep Autodiscovered
I1205 00:15:13.589503       1 aws_manager.go:184] Autodiscovered ASG kk-c8a9-pool-worker-144d-AutoScalingGroup-2HRAH8Q67KSL using tags [k8s.io/cluster-autoscaler/enabled kubernetes.io/cluster/kk-c8a9]
I1205 00:15:13.589549       1 aws_manager.go:184] Autodiscovered ASG kk-c8a9-pool-worker-dc8d-AutoScalingGroup-YADRB18VYXG1 using tags [k8s.io/cluster-autoscaler/enabled kubernetes.io/cluster/kk-c8a9]

GCE autodiscovering stuff:

$ kubectl --kubeconfig tfk-negz.kubecfg -n kube-system logs cluster-autoscaler-7f87d48fd7-z6k5x|grep "Ignoring explicitly"|head -n1
I1204 22:01:43.664112       1 gce_manager.go:913] Ignoring explicitly configured MIG tfk-negz-wrk-488y for autodiscovery.
$ kubectl --kubeconfig tfk-negz.kubecfg -n kube-system logs cluster-autoscaler-7f87d48fd7-z6k5x|grep "Autodiscovered MIG"
I1204 22:01:43.268494       1 gce_manager.go:917] Autodiscovered MIG tfk-negz-wrk-ab1d using regexp ^tfk-negz-.+
I1204 22:01:43.663375       1 gce_manager.go:917] Autodiscovered MIG tfk-negz-wrk-k6vc using regexp ^tfk-negz-.+
I1204 22:01:43.987635       1 gce_manager.go:917] Autodiscovered MIG tfk-negz-wrk-yoz1 using regexp ^tfk-negz-.+

AWS autoscaler status:

$ kubectl --context snorlax-admin -n kube-system get -o yaml configmap cluster-autoscaler-status|grep -E '(Name|Health)'
      Health:      Healthy (ready=6 unready=0 notStarted=0 longNotStarted=0 registered=6 longUnregistered=0)
      Name:        kk-c8a9-pool-worker-f821-AutoScalingGroup-8AQYAA49ML9X
      Health:      Healthy (ready=1 unready=0 notStarted=0 longNotStarted=0 registered=1 longUnregistered=0 cloudProviderTarget=1 (minSize=1, maxSize=10))
      Name:        kk-c8a9-pool-worker-144d-AutoScalingGroup-2HRAH8Q67KSL
      Health:      Healthy (ready=1 unready=0 notStarted=0 longNotStarted=0 registered=1 longUnregistered=0 cloudProviderTarget=1 (minSize=1, maxSize=64))
      Name:        kk-c8a9-pool-worker-dc8d-AutoScalingGroup-YADRB18VYXG1
      Health:      Healthy (ready=1 unready=0 notStarted=0 longNotStarted=0 registered=1 longUnregistered=0 cloudProviderTarget=1 (minSize=1, maxSize=64))

GCE autoscaler status:

      Health:      Healthy (ready=4 unready=0 notStarted=0 longNotStarted=0 registered=4 longUnregistered=0)
      Name:        https://content.googleapis.com/compute/v1/projects/planet-k8s-staging/zones/us-central1-c/instanceGroups/tfk-negz-wrk-488y
      Health:      Healthy (ready=0 unready=0 notStarted=0 longNotStarted=0 registered=0 longUnregistered=0 cloudProviderTarget=0 (minSize=0, maxSize=10))
      Name:        https://content.googleapis.com/compute/v1/projects/planet-k8s-staging/zones/us-central1-a/instanceGroups/tfk-negz-wrk-ab1d
      Health:      Healthy (ready=0 unready=0 notStarted=0 longNotStarted=0 registered=0 longUnregistered=0 cloudProviderTarget=0 (minSize=0, maxSize=200))
      Name:        https://content.googleapis.com/compute/v1/projects/planet-k8s-staging/zones/us-central1-b/instanceGroups/tfk-negz-wrk-k6vc
      Health:      Healthy (ready=0 unready=0 notStarted=0 longNotStarted=0 registered=0 longUnregistered=0 cloudProviderTarget=0 (minSize=0, maxSize=200))
      Name:        https://content.googleapis.com/compute/v1/projects/planet-k8s-staging/zones/us-central1-f/instanceGroups/tfk-negz-wrk-yoz1
      Health:      Healthy (ready=0 unready=0 notStarted=0 longNotStarted=0 registered=0 longUnregistered=0 cloudProviderTarget=0 (minSize=0, maxSize=200))

AWS scaling up:

$ kubectl --context snorlax-admin -n kube-system logs cluster-autoscaler-3311709055-ckg9m|grep "Scale up in group"
I1205 00:21:55.422266       1 clusterstate.go:191] Scale up in group kk-c8a9-pool-worker-144d-AutoScalingGroup-2HRAH8Q67KSL finished successfully in 3m30.95869351s

GCE scaling up:

$ kubectl --kubeconfig tfk-negz.kubecfg -n kube-system logs cluster-autoscaler-7f87d48fd7-z6k5x|grep "Scale up in group"
I1204 22:08:36.078984       1 clusterstate.go:191] Scale up in group https://content.googleapis.com/compute/v1/projects/planet-k8s-staging/zones/us-central1-f/instanceGroups/tfk-negz-wrk-yoz1 finished successfully in 2m53.986810183s
I1204 22:08:47.914760       1 clusterstate.go:191] Scale up in group https://content.googleapis.com/compute/v1/projects/planet-k8s-staging/zones/us-central1-c/instanceGroups/tfk-negz-wrk-488y finished successfully in 4m16.810625391s
I1204 22:08:47.914789       1 clusterstate.go:191] Scale up in group https://content.googleapis.com/compute/v1/projects/planet-k8s-staging/zones/us-central1-b/instanceGroups/tfk-negz-wrk-k6vc finished successfully in 3m6.554820039s
I1204 22:08:47.914795       1 clusterstate.go:191] Scale up in group https://content.googleapis.com/compute/v1/projects/planet-k8s-staging/zones/us-central1-c/instanceGroups/tfk-negz-wrk-488y finished successfully in 2m49.240558728s
I1204 22:08:47.914799       1 clusterstate.go:191] Scale up in group https://content.googleapis.com/compute/v1/projects/planet-k8s-staging/zones/us-central1-b/instanceGroups/tfk-negz-wrk-k6vc finished successfully in 2m48.513106358s

AWS scaling down:

$ kubectl --context snorlax-admin -n kube-system logs cluster-autoscaler-3311709055-ckg9m|grep "scale_down.go:594"
I1205 00:36:21.490272       1 scale_down.go:594] Scale-down: removing empty node REDACTED.us-west-2.compute.internal
I1205 00:36:21.490339       1 scale_down.go:594] Scale-down: removing empty node REDACTED.us-west-2.compute.internal
I1205 00:36:21.490365       1 scale_down.go:594] Scale-down: removing empty node REDACTED.us-west-2.compute.internal

GCE scaling down:

kubectl --kubeconfig tfk-negz.kubecfg -n kube-system logs cluster-autoscaler-7f87d48fd7-z6k5x|grep "scale_down.go:594"
I1204 22:22:08.250324       1 scale_down.go:594] Scale-down: removing empty node tfk-negz-wrk-k6vc-psjg.c.planet-k8s-staging.internal
I1204 22:22:08.250406       1 scale_down.go:594] Scale-down: removing empty node tfk-negz-wrk-488y-lvw5.c.planet-k8s-staging.internal
I1204 22:22:23.967015       1 scale_down.go:594] Scale-down: removing empty node tfk-negz-wrk-yoz1-2wlv.c.planet-k8s-staging.internal
I1204 22:22:23.967110       1 scale_down.go:594] Scale-down: removing empty node tfk-negz-wrk-k6vc-3431.c.planet-k8s-staging.internal
I1204 22:22:36.992188       1 scale_down.go:594] Scale-down: removing empty node tfk-negz-wrk-488y-517t.c.planet-k8s-staging.internal
I1204 22:32:28.530537       1 scale_down.go:594] Scale-down: removing empty node tfk-negz-wrk-ab1d-6qw6.c.planet-k8s-staging.internal

AWS unregistering a deleted ASG:

$ kubectl --context snorlax-admin -n kube-system logs cluster-autoscaler-3311709055-ckg9m|grep "Unregistered"
I1205 01:14:12.329412       1 auto_scaling_groups.go:95] Unregistered ASG kk-c8a9-pool-worker-144d-AutoScalingGroup-2HRAH8Q67KSL

GCE unregistering a deleted MIG:

kubectl --kubeconfig tfk-negz.kubecfg -n kube-system logs cluster-autoscaler-7f87d48fd7-z6k5x|grep "Unregistered"
I1205 01:14:26.757269       1 gce_manager.go:546] Unregistered Mig planet-k8s-staging/us-central1-a/tfk-negz-wrk-ab1d

Note that when testing unregistration I deleted the node group explicitly configured using --nodes (which was also a candidate for autodiscovery) along with one node group that was autodiscovered. I confirmed that only the autodiscovered group was unregistered.

continue
}
updated = append(updated, existing)
changed = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed that during final reading - shouldn't changed=true happen inside if, just before continue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch - fixed.

@MaciekPytel
Copy link
Contributor

4 commits is fine. Also I wish all our contributors provided such a good description of performed tests. Added one more comment I noticed during final re-reading though - can you take a look if I'm right?

@MaciekPytel
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 6, 2017
@negz
Copy link
Contributor Author

negz commented Dec 7, 2017

@MaciekPytel Thanks for the LGTM! What's the next step to get this merged?

@MaciekPytel
Copy link
Contributor

Sorry to leave you without comment like that. Normally I just merge PRs after lgtm-ing them, but in this case we're already very late into 1.1.0 testing and this is a very large PR. After discussing with @mwielgus we'd prefer to merge this after we release 1.1.0 and instead include it in 1.2.0-beta1, which we're planning to release next week. Is that ok with you?

@mwielgus
Copy link
Contributor

mwielgus commented Dec 8, 2017

We decided that there would be too little time to properly test the PR after merging to approve it for final K8S 1.9 version (CA1.1.0), which was cut today. Anyway, we really like your PR and it will be released soon.

@negz
Copy link
Contributor Author

negz commented Dec 8, 2017

Sounds good! I'm not really in a hurry to be honest, just itching to close out my internal issue tracking this work. ;)

@negz negz force-pushed the gcedisco branch 2 times, most recently from d70ecbb to ef05fe9 Compare December 11, 2017 21:05
@negz
Copy link
Contributor Author

negz commented Dec 11, 2017

Just rebased on master to resolve some merge conflicts.

Nic Cope added 4 commits December 11, 2017 13:09
This commit adds a new usage of the --node-group-auto-discovery flag intended
for use with the GCE cloud provider. GCE instance groups can be automatically
discovered based on a prefix of their group name. Example usage:

--node-group-auto-discovery=mig:prefix=k8s-mig,minNodes=0,maxNodes=10

Note that unlike the existing AWS ASG autodetection functionality we must
specify the min and max nodes in the flag. This is because MIGs store only
a target size in the GCE API - they do not have a min and max size we can
infer via the API.

In order to alleviate this limitation a little we allow multiple uses of the
autodiscovery flag. For example to discover two classes (big and small) of
instance groups with different size limits:

./cluster-autoscaler \
  --node-group-auto-discovery=mig:prefix=k8s-a-small,minNodes=1,maxNodes=10 \
  --node-group-auto-discovery=mig:prefix=k8s-a-big,minNodes=1,maxNodes=100

Zonal clusters (i.e. multizone = false in the cloud config) will detect all
managed instance groups within the cluster's zone. Regional clusters will
detect all matching (zonal) managed instance groups within any of that region's
zones.
The Build method was getting pretty big, this hopefully makes it a little
more readable. It also fixes a few minor error shadowing bugs.
Node group discovery is now handled by cloudprovider.Refresh() in all cases.
Additionally, explicit node groups can now be used alongside autodiscovery.
@negz
Copy link
Contributor Author

negz commented Dec 18, 2017

Checking in now that 1.9 is out. :) Again, no huge pressure from my end. I mostly just want to get this in before I have to rebase again.

@mwielgus
Copy link
Contributor

/lgtm

cacheMutex sync.Mutex
instancesNotInManagedAsg map[AwsRef]struct{}
service autoScalingWrapper
const scaleToZeroSupported = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was scale to zero disabled for AWS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capturing discussion we had elsewhere: this was totally accidental - sorry!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants