Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubeadm-bootstrap fails on vSphere AirGapped Bottlerocket Deployment #9040

Open
geoffo-dev opened this issue Dec 4, 2024 · 1 comment
Open

Comments

@geoffo-dev
Copy link

What happened:

I have been trying to deploy EKS-A on an airgapped vSphere deployment and having little success. I am following the guides, but have not been able to get past the etcd node build. It looks like the kubeadm-bootstrap is failing with the following error:

Dec 02 16:45:14 xxx-mgmt-etcd-7q68r host-containers@kubeadm-bootstrap[2774]: -----END RSA PRIVATE KEY-----
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r host-containers@kubeadm-bootstrap[2774]: } {Path:/run/cluster-api/placeholder Owner:root:root Permissions:0640 Content:This placeholder file is used to create the /run/cluster-api sub directory in a way that is compatible with both Linux and Windows (mkdir -p /run/cluster-api does not work with Windows)}] RunCmd:EtcdadmInit public.ecr.aws/eks-distro/etcd-io/etcd 3.5.15-eks-1-31-7 TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256}
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r host-containers@kubeadm-bootstrap[2774]: Using etcdadm support by CAPI
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r host-containers@kubeadm-bootstrap[2774]: Writing userdata write files
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r host-containers@kubeadm-bootstrap[2774]: Running etcdadm init phases
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r host-containers@kubeadm-bootstrap[2774]: Running etcdadm init install phase
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r host-containers@kubeadm-bootstrap[2774]: Phase command output:
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r host-containers@kubeadm-bootstrap[2774]: --------
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r host-containers@kubeadm-bootstrap[2774]: time="2024-12-02T16:45:14Z" level=info msg="[install] Removing existing data dir \"/var/lib/etcd/data\""
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r host-containers@kubeadm-bootstrap[2774]: --------
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r host-containers@kubeadm-bootstrap[2774]: Running etcdadm init certificates phase
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r systemd[1]: Stopping Host container: kubeadm-bootstrap...
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r host-containers@kubeadm-bootstrap[2774]: time="2024-12-02T16:45:14Z" level=info msg="received signal: terminated"
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r host-containers@kubeadm-bootstrap[2774]: time="2024-12-02T16:45:14Z" level=info msg="container task exited" code=143
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r host-containers@kubeadm-bootstrap[2774]: time="2024-12-02T16:45:14Z" level=fatal msg="Container kubeadm-bootstrap exited with non-zero status"
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r systemd[1]: [email protected]: Main process exited, code=exited, status=1/FAILURE
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r systemd[1]: [email protected]: Failed with result 'exit-code'.
Dec 02 16:45:14 xxx-mgmt-etcd-7q68r systemd[1]: Stopped Host container: kubeadm-bootstrap.

This appears to be at the 'etcdadm init certificates phase` with the container failing. This then means none of the other etcd nodes can join (as they cannot find the cert server) and the cluster creation fails.

I have tried this on a number of different versions of eks-anywhere:

  • v0.20.0 (bundle 68)
  • v0.20.9 (bundle 78)
  • v0.21.1 (bundle 83)

Sadly I dont know enough about how these certs are generated or the internals of the bootstrap-container, but each attempted creation appears to fail at the same stage.

What you expected to happen:

The initialisation completes and the etcd nodes are healthy.

How to reproduce it (as minimally and precisely as possible):

To stick with the latest version of eks-a:

  • Download the version of eks-anywhere
  • Download the artifacts as described
  • Download the images as described above
  • Download Bottlerocket OS and Upload to vSphere as a template

This is the mgmt cluster configuration:

apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: Cluster
metadata:
  name: xxxxx-mgmt
spec:
  registryMirrorConfiguration:
    endpoint: harbor.xxx.xxxxx.com
    port: 443
    authenticate: false
    caCertContent: |
      -----BEGIN CERTIFICATE-----
      xxxxxxx
      JsRPFd8GD+ZAEnOQ
      -----END CERTIFICATE-----
  clusterNetwork:
    cniConfig:
      cilium: {}
    pods:
      cidrBlocks:
      - 192.168.0.0/16
    services:
      cidrBlocks:
      - 10.96.0.0/12
  controlPlaneConfiguration:
    count: 2
    endpoint:
      host: "10.xxx.xxx.xxx"
    machineGroupRef:
      kind: VSphereMachineConfig
      name: xxxxx-mgmt-cp
  datacenterRef:
    kind: VSphereDatacenterConfig
    name: xxxxx-mgmt
  externalEtcdConfiguration:
    count: 3
    machineGroupRef:
      kind: VSphereMachineConfig
      name: xxxxx-mgmt-etcd
  kubernetesVersion: "1.31"
  managementCluster:
    name: xxxxx-mgmt
  workerNodeGroupConfigurations:
  - count: 2
    machineGroupRef:
      kind: VSphereMachineConfig
      name: xxxxx-mgmt
    name: md-0

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereDatacenterConfig
metadata:
  name: xxxxx-mgmt
spec:
  datacenter: "xxx"
  insecure: true
  network: "Kubernetes"
  server: "xxx-xxx-vca-01.xxx.xxxxx.com"
  thumbprint: "xxxxxx"

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereMachineConfig
metadata:
  name: xxxxx-mgmt-cp
spec:
  datastore: "Datastore_01_C1"
  diskGiB: 25
  folder: "eksa/xxxxx-mgmt"
  memoryMiB: 8192
  numCPUs: 2
  osFamily: bottlerocket
  template: "bottlerocket-vmware-k8s-1.31-x86_64-v1.26.2-xxxxx"
  resourcePool: "/xxx/host/xxx-xxx-vcl-01/Resources"
  users:
  - name: ec2-user
    sshAuthorizedKeys:
    - ssh-rsa xxx ec2-user  

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereMachineConfig
metadata:
  name: xxxxx-mgmt
spec:
  datastore: "Datastore_01_C1"
  diskGiB: 150
  folder: "eksa/xxxxx-mgmt"
  memoryMiB: 8192
  numCPUs: 4
  osFamily: bottlerocket
  template: "bottlerocket-vmware-k8s-1.31-x86_64-v1.26.2-xxxxx"
  resourcePool: "/xxx/host/xxx-xxx-vcl-01/Resources"
  users:
  - name: ec2-user
    sshAuthorizedKeys:
    - ssh-rsa xxx ec2-user  
  
---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereMachineConfig
metadata:
  name: xxxxx-mgmt-etcd
spec:
  datastore: "Datastore_01_C1"
  diskGiB: 25
  folder: "eksa/xxxxx-mgmt"
  memoryMiB: 8192
  numCPUs: 2
  osFamily: bottlerocket
  template: "bottlerocket-vmware-k8s-1.31-x86_64-v1.26.2-xxxxx"
  resourcePool: "/xxx/host/xxx-xxx-vcl-01/Resources"
  users:
  - name: ec2-user
    sshAuthorizedKeys:
    - ssh-rsa xxx ec2-user  

---

Anything else we need to know?:

Environment:

  • EKS Anywhere Release: v0.21.1
  • EKS Distro Release: v1.31
  • Operating Systems: ubuntu 22.04 and Fedora 41
  • Bottlerocket Version: bottlerocket-vmware-k8s-1.31-x86_64-v1.26.2
@geoffo-dev
Copy link
Author

So I eventually managed to track down the issue - through grepping through the logs, I managed to find the following errors:

Dec 10 16:07:36 localhost systemd-tmpfiles[1652]: Reading config file "/x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/tmpfiles.d/release-ca-certificates.conf"…
Dec 10 16:08:00 xxx-mgmt-etcd-74dk8 host-containers@kubeadm-bootstrap[1838]: Running etcdadm init certificates phase
Dec 10 16:08:00 xxx-mgmt-etcd-74dk8 host-containers@kubeadm-bootstrap[1838]: time="2024-12-10T16:08:00Z" level=info msg="[certificates] creating PKI assets"
Dec 10 16:08:00 xxx-mgmt-etcd-74dk8 host-containers@kubeadm-bootstrap[1838]: time="2024-12-10T16:08:00Z" level=fatal msg="[certificates] failed creating PKI assets: failure loading ca certificate: the certificate is not valid yet"
Dec 10 16:08:00 xxx-mgmt-etcd-74dk8 host-containers@kubeadm-bootstrap[1838]: Error running bootstrapper cmd: error running etcdadm phase 'init certificates', out:
Dec 10 16:08:00 xxx-mgmt-etcd-74dk8 host-containers@kubeadm-bootstrap[1838]:  time="2024-12-10T16:08:00Z" level=info msg="[certificates] creating PKI assets"
Dec 10 16:08:00 xxx-mgmt-etcd-74dk8 host-containers@kubeadm-bootstrap[1838]: time="2024-12-10T16:08:00Z" level=fatal msg="[certificates] failed creating PKI assets: failure loading ca certificate: the certificate is not valid yet"
Dec 10 16:08:57 xxx-mgmt-etcd-74dk8 host-containers@kubeadm-bootstrap[2474]: Running etcdadm init certificates phase

As this was a disconnected cluster this lead me to the fact the time was out of sync - the admin machine I was using was 8 minutes ahead of the ESXI hosts and therefore the certificate was not being accepted and causing the crash.

I dont know if this is possible to either add to the airgapped requirements / troubleshooting so that this could possibly be identified by others in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant