Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect deployment command format for cluster-autoscaler on GCE causing "Back-off restarting failed container" loop #10938

Closed
MoSheikh opened this issue Feb 26, 2021 · 0 comments · Fixed by #10944

Comments

@MoSheikh
Copy link

MoSheikh commented Feb 26, 2021

1. What kops version are you running? The command kops version, will display
this information.

1.19.1
1.20.0-beta.1

Bug occurs in both versions.

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

1.20.2

3. What cloud provider are you using?

Google Compute Engine

4. What commands did you run? What is the simplest way to reproduce this issue?

Created a basic GCE cluster as per documentation. In my configuration, api.dev.k8s.local was the name of my cluster, and us-east1-b was the zone I selected for my cluster. After creating the cluster, run the following.

export KOPS_FEATURE_FLAGS=AlphaAllowGCE,SpecOverrideFlag
kops set cluster spec.clusterAutoscaler.enabled=true 
kops update cluster --admin=87600h -y

5. What happened after the commands executed?

The cluster-autoscaler pod created by kops get stuck in the following event: Back-off restarting failed container

Kops creates the a cluster-autoscaler deployment file with a spec.containers[0].command array that included the following line:

--nodes=1:1:nodes-us-east1-b.api.example.k8s.local

Which is in the format of:

--nodes=<minNodes>:<maxNodes>:<instanceGroupName>.<clusterName>

6. What did you expect to happen?

The above command should be

--nodes=1:1:nodes-us-east1-b

Which is in the format of:

--nodes=<minNodes>:<maxNodes>:<instanceGroupName>

Replacing the values for that Deployment spec field resolves the issue.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: "2021-02-26T13:57:13Z"
  generation: 5
  name: api.dev.k8s.local
spec:
  api:
    loadBalancer:
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudConfig:
    gceServiceAccount: [email protected]
  cloudProvider: gce
  clusterAutoscaler:
    cpuRequest: 100m
    enabled: true
    memoryRequest: 300Mi
    skipNodesWithLocalStorage: true
    skipNodesWithSystemPods: true
  configBase: gs://devops.example.app/kops/state/api.dev.k8s.local
  containerRuntime: docker
  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - instanceGroup: master-us-east1-b
      name: b
      volumeType: pd-standard
    memoryRequest: 100Mi
    name: main
  - cpuRequest: 100m
    etcdMembers:
    - instanceGroup: master-us-east1-b
      name: b
      volumeType: pd-standard
    memoryRequest: 100Mi
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubelet:
    anonymousAuth: false
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.19.7
  masterInternalName: api.internal.api.dev.k8s.local
  masterPublicName: api.api.dev.k8s.local
  metricsServer: {}
  networking:
    kubenet: {}
  nonMasqueradeCIDR: 100.64.0.0/10
  project: dojolaunch-302702
  sshAccess:
  - 0.0.0.0/0
  subnets:
  - name: us-east1
    region: us-east1
    type: Public
  topology:
    dns:
      type: Public
    masters: public
    nodes: public

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?

@MoSheikh MoSheikh changed the title Incorrect deployment command syntax for cluster-autoscaler on GCE Incorrect deployment command format for cluster-autoscaler on GCE causing "Back-off restarting failed container" loop Feb 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant