Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ubuntu 20.04 - Nodes sometime fails to come up due to package install issues? #9180

Closed
andersosthus opened this issue May 26, 2020 · 6 comments
Assignees

Comments

@andersosthus
Copy link
Contributor

1. What kops version are you running? The command kops version, will display
this information.

1.17-beta2

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

1.17.5

3. What cloud provider are you using?
AWS

Monday 25th of May we started noticing problems with new nodes added to our clusters. When adding nodes, some nodes comes up just fine, while some doesn't come up at all.

When investigating the failed nodes, we see what appears to be the kops-configuration.service being restarted in the middle of an apt-get install command, resulting in a bad dpkg state that needs to be resolved with dpkg --configure -a.

The timeline on a nodeup looks like this:

  1. Nodeup starts
  2. When starting to install the required packages, nodeup gets restarted (kops-configuration.service: Main process exited, code=killed, status=15/TERM)
  3. Nodeup then starts up again. Depending on when the restart occured (either during a package install or just inbetween two package installs) determines if the nodeup command completes successfully. Also note that it's not the same package this occurs on every time, but it's always on one of the initial packages.
  4. If the restart of nodeup occured during a package install, nodeup will halt when it tries to install packages again due to a bad dpkg state

Right now it seems random if a node comes up or not.

Excerpt of the kops-configuration.service around the time the restart happens:

May 26 12:10:20 ip-10-9-155-142.eu-west-1.compute.internal nodeup[1992]: I0526 12:10:20.562898    1992 executor.go:176] Executing task "Package/nfs-common": Package: nfs-common
May 26 12:10:20 ip-10-9-155-142.eu-west-1.compute.internal nodeup[1992]: I0526 12:10:20.562926    1992 package.go:142] Listing installed packages: dpkg-query -f ${db:Status-Abbrev}${Version}\n -W nfs-common
May 26 12:10:20 ip-10-9-155-142.eu-west-1.compute.internal nodeup[1992]: I0526 12:10:20.562981    1992 executor.go:176] Executing task "File//var/lib/kubelet/kubeconfig": File: "/var/lib/kubelet/kubeconfig"
May 26 12:10:20 ip-10-9-155-142.eu-west-1.compute.internal nodeup[1992]: I0526 12:10:20.563068    1992 files.go:50] Writing file "/var/lib/kubelet/kubeconfig"
May 26 12:10:20 ip-10-9-155-142.eu-west-1.compute.internal nodeup[1992]: I0526 12:10:20.563140    1992 executor.go:176] Executing task "Package/perl": Package: perl
May 26 12:10:20 ip-10-9-155-142.eu-west-1.compute.internal nodeup[1992]: I0526 12:10:20.563166    1992 package.go:142] Listing installed packages: dpkg-query -f ${db:Status-Abbrev}${Version}\n -W perl
May 26 12:10:20 ip-10-9-155-142.eu-west-1.compute.internal nodeup[1992]: I0526 12:10:20.573181    1992 package.go:267] Installing package "python-apt" (dependencies: [])
May 26 12:10:20 ip-10-9-155-142.eu-west-1.compute.internal nodeup[1992]: I0526 12:10:20.573222    1992 package.go:350] running command [apt-get install --yes --no-install-recommends python-apt]
May 26 12:10:26 ip-10-9-155-142.eu-west-1.compute.internal nodeup[1992]: I0526 12:10:26.243773    1992 package.go:267] Installing package "bridge-utils" (dependencies: [])
May 26 12:10:26 ip-10-9-155-142.eu-west-1.compute.internal nodeup[1992]: I0526 12:10:26.243815    1992 package.go:350] running command [apt-get install --yes --no-install-recommends bridge-utils]
May 26 12:10:28 ip-10-9-155-142.eu-west-1.compute.internal nodeup[1992]: I0526 12:10:28.960278    1992 package.go:267] Installing package "apt-transport-https" (dependencies: [])
May 26 12:10:28 ip-10-9-155-142.eu-west-1.compute.internal nodeup[1992]: I0526 12:10:28.960315    1992 package.go:350] running command [apt-get install --yes --no-install-recommends apt-transport-https]
May 26 12:10:31 ip-10-9-155-142.eu-west-1.compute.internal nodeup[1992]: I0526 12:10:31.373998    1992 package.go:267] Installing package "ebtables" (dependencies: [])
May 26 12:10:31 ip-10-9-155-142.eu-west-1.compute.internal nodeup[1992]: I0526 12:10:31.374035    1992 package.go:350] running command [apt-get install --yes --no-install-recommends ebtables]
May 26 12:10:34 ip-10-9-155-142.eu-west-1.compute.internal nodeup[1992]: I0526 12:10:34.281194    1992 package.go:267] Installing package "conntrack" (dependencies: [])
May 26 12:10:34 ip-10-9-155-142.eu-west-1.compute.internal nodeup[1992]: I0526 12:10:34.281223    1992 package.go:350] running command [apt-get install --yes --no-install-recommends conntrack]
May 26 12:10:36 ip-10-9-155-142.eu-west-1.compute.internal nodeup[1992]: I0526 12:10:36.836464    1992 package.go:267] Installing package "socat" (dependencies: [])
May 26 12:10:36 ip-10-9-155-142.eu-west-1.compute.internal nodeup[1992]: I0526 12:10:36.836499    1992 package.go:350] running command [apt-get install --yes --no-install-recommends socat]
May 26 12:10:39 ip-10-9-155-142.eu-west-1.compute.internal nodeup[1992]: I0526 12:10:39.538337    1992 package.go:267] Installing package "netcat-traditional" (dependencies: [])
May 26 12:10:39 ip-10-9-155-142.eu-west-1.compute.internal nodeup[1992]: I0526 12:10:39.538367    1992 package.go:350] running command [apt-get install --yes --no-install-recommends netcat-traditional]
May 26 12:10:42 ip-10-9-155-142.eu-west-1.compute.internal nodeup[1992]: I0526 12:10:42.193969    1992 package.go:267] Installing package "nfs-common" (dependencies: [])
May 26 12:10:42 ip-10-9-155-142.eu-west-1.compute.internal nodeup[1992]: I0526 12:10:42.193998    1992 package.go:350] running command [apt-get install --yes --no-install-recommends nfs-common]
May 26 12:10:43 ip-10-9-155-142.eu-west-1.compute.internal useradd[3590]: new user: name=_rpc, UID=113, GID=65534, home=/run/rpcbind, shell=/usr/sbin/nologin, from=none
May 26 12:10:43 ip-10-9-155-142.eu-west-1.compute.internal usermod[3598]: change user '_rpc' password
May 26 12:10:43 ip-10-9-155-142.eu-west-1.compute.internal chage[3605]: changed password expiry for _rpc
May 26 12:10:44 ip-10-9-155-142.eu-west-1.compute.internal useradd[3969]: new user: name=statd, UID=114, GID=65534, home=/var/lib/nfs, shell=/usr/sbin/nologin, from=none
May 26 12:10:44 ip-10-9-155-142.eu-west-1.compute.internal usermod[3977]: change user 'statd' password
May 26 12:10:44 ip-10-9-155-142.eu-west-1.compute.internal chage[3984]: changed password expiry for statd
May 26 12:10:48 ip-10-9-155-142.eu-west-1.compute.internal nodeup[1992]: I0526 12:10:48.764090    1992 package.go:267] Installing package "ntp" (dependencies: [])
May 26 12:10:48 ip-10-9-155-142.eu-west-1.compute.internal nodeup[1992]: I0526 12:10:48.764120    1992 package.go:350] running command [apt-get install --yes --no-install-recommends ntp]
May 26 12:10:51 ip-10-9-155-142.eu-west-1.compute.internal groupadd[4685]: group added to /etc/group: name=ntp, GID=119
May 26 12:10:51 ip-10-9-155-142.eu-west-1.compute.internal groupadd[4685]: group added to /etc/gshadow: name=ntp
May 26 12:10:51 ip-10-9-155-142.eu-west-1.compute.internal groupadd[4685]: new group: name=ntp, GID=119
May 26 12:10:51 ip-10-9-155-142.eu-west-1.compute.internal useradd[4692]: new user: name=ntp, UID=115, GID=119, home=/nonexistent, shell=/usr/sbin/nologin, from=none
May 26 12:10:51 ip-10-9-155-142.eu-west-1.compute.internal usermod[4700]: change user 'ntp' password
May 26 12:10:51 ip-10-9-155-142.eu-west-1.compute.internal chage[4707]: changed password expiry for ntp
May 26 12:10:57 ip-10-9-155-142.eu-west-1.compute.internal nodeup[1992]: I0526 12:10:57.307743    1992 executor.go:103] Tasks: 51 done / 64 total; 1 can run
May 26 12:10:57 ip-10-9-155-142.eu-west-1.compute.internal nodeup[1992]: I0526 12:10:57.307782    1992 executor.go:176] Executing task "Package/docker-ce": Package: docker-ce
May 26 12:10:57 ip-10-9-155-142.eu-west-1.compute.internal nodeup[1992]: I0526 12:10:57.307826    1992 package.go:142] Listing installed packages: dpkg-query -f ${db:Status-Abbrev}${Version}\n -W docker-ce
May 26 12:10:57 ip-10-9-155-142.eu-west-1.compute.internal nodeup[1992]: I0526 12:10:57.316511    1992 package.go:267] Installing package "docker-ce" (dependencies: [Package: docker-ce-cli Package: containerd.io])
May 26 12:10:57 ip-10-9-155-142.eu-west-1.compute.internal nodeup[1992]: I0526 12:10:57.316611    1992 http.go:77] Downloading "https://download.docker.com/linux/ubuntu/dists/bionic/pool/stable/amd64/docker-ce_19>
May 26 12:10:57 ip-10-9-155-142.eu-west-1.compute.internal nodeup[1992]: I0526 12:10:57.658060    1992 files.go:100] Hash matched for "/var/cache/nodeup/packages/docker-ce.deb": sha1:ee640d9258fd4d3f4c7017ab2a71d>
May 26 12:10:57 ip-10-9-155-142.eu-west-1.compute.internal nodeup[1992]: I0526 12:10:57.658167    1992 http.go:77] Downloading "https://download.docker.com/linux/ubuntu/dists/bionic/pool/stable/amd64/docker-ce-cl>
May 26 12:10:58 ip-10-9-155-142.eu-west-1.compute.internal nodeup[1992]: I0526 12:10:58.120070    1992 files.go:100] Hash matched for "/var/cache/nodeup/packages/docker-ce-cli.deb": sha1:09402bf5dac40f0c50f1071b1>
May 26 12:10:58 ip-10-9-155-142.eu-west-1.compute.internal nodeup[1992]: I0526 12:10:58.120171    1992 http.go:77] Downloading "https://download.docker.com/linux/ubuntu/dists/bionic/pool/stable/amd64/containerd.i>
May 26 12:10:58 ip-10-9-155-142.eu-west-1.compute.internal nodeup[1992]: I0526 12:10:58.336682    1992 files.go:100] Hash matched for "/var/cache/nodeup/packages/containerd.io.deb": sha1:f4c941807310e3fa470dddfb0>
May 26 12:10:58 ip-10-9-155-142.eu-west-1.compute.internal nodeup[1992]: I0526 12:10:58.336719    1992 package.go:327] running command [apt-get install --yes --no-install-recommends /var/cache/nodeup/packages/doc>
May 26 12:11:03 ip-10-9-155-142.eu-west-1.compute.internal systemd[1]: kops-configuration.service: Main process exited, code=killed, status=15/TERM
May 26 12:11:03 ip-10-9-155-142.eu-west-1.compute.internal systemd[1]: kops-configuration.service: Failed with result 'signal'.
May 26 12:11:03 ip-10-9-155-142.eu-west-1.compute.internal systemd[1]: Stopped Run kops bootstrap (nodeup).
May 26 12:11:03 ip-10-9-155-142.eu-west-1.compute.internal systemd[1]: Starting Run kops bootstrap (nodeup)...
May 26 12:11:03 ip-10-9-155-142.eu-west-1.compute.internal nodeup[5879]: nodeup version 1.17.0-beta.2 (git-e0d2809d0)

We've had some discussions on Slack with @hakman about this, see this thread: https://kubernetes.slack.com/archives/C3QUFP0QM/p1590443302247200

If needed we can provide full logs, and we can also test fixes on this cluster.

@hakman
Copy link
Member

hakman commented May 26, 2020

/assign

@paalkr
Copy link

paalkr commented May 26, 2020

This is a slightly truncated version of the cluster spec

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  name: xxx
spec:

### General config ###
  channel: stable
  cloudProvider: aws
  configBase: xxx
  dnsZone: xxx
  kubernetesVersion: 1.17.5
  encryptionConfig: true
  docker:
    version: 19.03.4

### Security and access config ###
  sshKeyName: xxx
  authorization:
    rbac: {}
  api:
    loadBalancer:
      type: Internal
      crossZoneLoadBalancing: true
      useForInternalApi: true
  masterInternalName: xxx
  masterPublicName: xxx
  topology:
    dns:
      type: Private
    masters: private
    nodes: private
  kubernetesApiAccess:
  - 0.0.0.0/0
  sshAccess:
  - 0.0.0.0/0
  iam:
    allowContainerRegistry: true
    legacy: false
  additionalPolicies:
    master: |
      [
        {
          "Action": "sts:AssumeRole",
          "Resource": "*",
          "Effect": "Allow"
        },
        {
            "Action": "autoscaling:Describe*",
            "Resource": [
                "*"
            ],
            "Effect": "Allow"
        },
        {
            "Action": [
                "autoscaling:CompleteLifecycleAction",
                "autoscaling:PutLifecycleHook"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
          "Effect": "Allow",
          "Action": [
            "kms:Decrypt",
            "kms:Encrypt"
          ],
          "Resource": [
            "arn:aws:kms:eu-west-1:xxx:key/xxx"
          ]
        }         
      ]
    node: |
      [
        {
          "Effect": "Allow",
          "Action": ["ec2:DescribeTags"],
          "Resource": ["*"]
        },
        {
          "Effect": "Allow",
          "Action": ["s3:*"],
          "Resource": [
            "arn:aws:s3:::xxx/*"
          ]
        },
        {
            "Action": "autoscaling:Describe*",
            "Resource": [
                "*"
            ],
            "Effect": "Allow"
        },
        {
            "Action": [
                "autoscaling:CompleteLifecycleAction",
                "autoscaling:PutLifecycleHook"
            ],
            "Resource": "*",
            "Effect": "Allow"
        }        
      ]

### Cluster config ###
  networking:
    flannel:
      backend: vxlan
  nonMasqueradeCIDR: 100.64.0.0/10
  externalDns:
    watchIngress: false
  kubeAPIServer:
    appendAdmissionPlugins:
    - PodPreset
    runtimeConfig:
      settings.k8s.io/v1alpha1: "true"
  kubelet:
    anonymousAuth: false
    authorizationMode: Webhook
    authenticationTokenWebhook: true
    featureGates:
      ExpandInUsePersistentVolumes: "true"
      ExpandCSIVolumes: "true"
      VolumeSnapshotDataSource: "true"
    allowedUnsafeSysctls:
    - "net.ipv4.*"
  kubeProxy:
    metricsBindAddress: 0.0.0.0
  kubeDNS:
    provider: CoreDNS
  kubeControllerManager:
    horizontalPodAutoscalerDownscaleStabilization: 10m
    horizontalPodAutoscalerUpscaleDelay: 4m
    horizontalPodAutoscalerDownscaleDelay: 10m
  fileAssets:
  - name: aws-encryption-provider.yaml
    ## Note if not path is specified the default path it /srv/kubernetes/assets/<name>
    path: /etc/kubernetes/manifests/aws-encryption-provider.yaml
    roles:
    - Master
    content: |
      apiVersion: v1
      kind: Pod
      metadata:
        annotations:
          scheduler.alpha.kubernetes.io/critical-pod: ""
        labels:
          k8s-app: aws-encryption-provider
        name: aws-encryption-provider
        namespace: kube-system
      spec:
        containers:
        - image: xxx
          name: aws-encryption-provider
          command:
          - /aws-encryption-provider
          - --key=arn:aws:kms:eu-west-1:xxx:key/xxx
          - --region=eu-west-1
          - --listen=/srv/kubernetes/socket.sock
          - --health-port=:8083
          ports:
          - containerPort: 8083
            protocol: TCP
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8083
          volumeMounts:
          - mountPath: /srv/kubernetes
            name: kmsplugin
        hostNetwork: true
        tolerations:
        - key: CriticalAddonsOnly
          operator: Exists
        - effect: NoExecute
          operator: Exists       
        priorityClassName: system-cluster-critical
        volumes:
        - name: kmsplugin
          hostPath:
            path: /srv/kubernetes
            type: DirectoryOrCreate    
  hooks:
  - name: flannel-4096-tx-checksum-offload-disable.service
    # Flannel runs on a non standard port due to windows nodes
    roles:
    - Node
    - Master
    useRawManifest: true
    manifest: |
      [Unit]
      Description=Disable TX checksum offload on flannel.4096
      After=sys-devices-virtual-net-flannel.4096.device
      After=sys-subsystem-net-devices-flannel.4096.device
      After=docker.service
      [Service]
      Type=oneshot
      ExecStart=/sbin/ethtool -K flannel.4096 tx-checksum-ip-generic off

### AWS VPC config ###
  networkCIDR: 10.9.0.0/16
  networkID: xxx
  subnets:
  - cidr: xxx
    egress: External
    id: subnet-xxx
    name: private-a
    type: Private
    zone: eu-west-1a
  - cidr: 10.9.160.0/19
    egress: External
    id: subnet-xxx
    name: private-b
    type: Private
    zone: eu-west-1b
  - cidr: 10.9.192.0/19
    egress: External
    id: subnet-xxx
    name: private-c
    type: Private
    zone: eu-west-1c
  - cidr: 10.9.0.0/19
    egress: External
    id: subnet-xxx
    name: public-a
    type: Utility
    zone: eu-west-1a
  - cidr: 10.9.32.0/19
    egress: External
    id: subnet-xxx
    name: public-b
    type: Utility
    zone: eu-west-1b
  - cidr: 10.9.64.0/19
    egress: External
    id: subnet-xxx
    name: public-c
    type: Utility
    zone: eu-west-1c

### etcd config ###
  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - instanceGroup: master-a
      name: a
      volumeType: gp2
      volumeSize: 5
    - instanceGroup: master-b
      name: b
      volumeType: gp2
      volumeSize: 5
    - instanceGroup: master-c
      name: c
      volumeType: gp2
      volumeSize: 5
    memoryRequest: 100Mi
    name: main
  - cpuRequest: 100m
    etcdMembers:
    - instanceGroup: master-a
      name: a
      volumeType: gp2
      volumeSize: 5
    - instanceGroup: master-b
      name: b
      volumeType: gp2
      volumeSize: 5
    - instanceGroup: master-c
      name: c
      volumeType: gp2
      volumeSize: 5
    memoryRequest: 100Mi
    name: events

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: xxx
  name: master-a
spec:
  image: ubuntu/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20200423
  detailedInstanceMonitoring: true
  additionalSecurityGroups:
  - sg-xxx
  machineType: t3.large
  maxSize: 2
  minSize: 1
  rootVolumeSize: 30
  rootVolumeType: gp2
  nodeLabels:
    kops.k8s.io/instancegroup: master-a
    xxx/role: master
  role: Master
  additionalUserData:
  - name: hostnamefix.sh
    type: text/x-shellscript
    content: |
      #!/bin/sh
      hostnamectl set-hostname $(hostname -f)
  subnets:
  - private-a

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: xxx
  name: master-b
spec:
  image: ubuntu/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20200423
  detailedInstanceMonitoring: true
  additionalSecurityGroups:
  - sg-xxx
  machineType: t3.large
  maxSize: 2
  minSize: 1
  rootVolumeSize: 30
  rootVolumeType: gp2
  nodeLabels:
    kops.k8s.io/instancegroup: master-b
    xxx/role: master
  role: Master
  additionalUserData:
  - name: hostnamefix.sh
    type: text/x-shellscript
    content: |
      #!/bin/sh
      hostnamectl set-hostname $(hostname -f)
  subnets:
  - private-b

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: xxx
  name: master-c
spec:
  image: ubuntu/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20200423
  detailedInstanceMonitoring: true
  additionalSecurityGroups:
  - sg-xxx
  machineType: t3.large
  maxSize: 2
  minSize: 1
  rootVolumeSize: 30
  rootVolumeType: gp2
  nodeLabels:
    kops.k8s.io/instancegroup: master-c
    xxx/role: master
  role: Master
  additionalUserData:
  - name: hostnamefix.sh
    type: text/x-shellscript
    content: |
      #!/bin/sh
      hostnamectl set-hostname $(hostname -f)
  subnets:
  - private-c

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  labels:
    kops.k8s.io/cluster: xxx
  name: spot
spec:
  image: ubuntu/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20200423
  role: Node
  additionalUserData:
  - name: hostnamefix.sh
    type: text/x-shellscript
    content: |
      #!/bin/sh
      hostnamectl set-hostname $(hostname -f)
  maxSize: 10
  minSize: 3
  nodeLabels:
    kops.k8s.io/instancegroup: spot
    xxx/role: worker
    node-role.kubernetes.io/spot: ''
  cloudLabels:
    role: spot
  externalLoadBalancers:
  - targetGroupArn: arn:aws:elasticloadbalancing:eu-west-1:xxx
  - targetGroupArn: arn:aws:elasticloadbalancing:eu-west-1:xxx
  - targetGroupArn: arn:aws:elasticloadbalancing:eu-west-1:xxx
  - targetGroupArn: arn:aws:elasticloadbalancing:eu-west-1:xxx
  detailedInstanceMonitoring: true
  additionalSecurityGroups:
  - sg-xxx
  - sg-xxx
  - sg-xxx
  subnets:
  - private-a
  - private-b
  - private-c
  machineType: r5d.xlarge
  mixedInstancesPolicy:
    instances:
    - r4.xlarge
    - r5.xlarge
    # - r5d.xlarge
    # - i3.xlarge
    # - r5d.xlarge
    # - i3.xlarge   
    # - r5ad.xlarge
    onDemandAllocationStrategy: prioritized
    onDemandBase: 0
    onDemandAboveBase: 0
    spotInstancePools: 2
    spotAllocationStrategy: lowest-price
  rootVolumeSize: 75
  rootVolumeType: gp2
  # volumeMounts:
  # - device: /dev/nvme1n1
  #   filesystem: ext4
  #   path: /var/lib/docker

@tuapuikia
Copy link

Docker just added new repository for Focal, we can now use the official package instead of Bionic version.
https://download.docker.com/linux/ubuntu/dists/focal/

@paalkr
Copy link

paalkr commented May 28, 2020

@hakman , I can confirm that #9182 fixed the problem.
Thanks

@hakman
Copy link
Member

hakman commented May 29, 2020

Thanks for the update and for all the effort on this @paalkr @andersosthus .
/close

@k8s-ci-robot
Copy link
Contributor

@hakman: Closing this issue.

In response to this:

Thanks for the update and for all the effort on this @paalkr @andersosthus .
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants