node_groups never create successfully #1628

mhill-holoplot · 2021-10-07T14:41:27Z

Description

node_groups never create successfully. Presumably this is a problem with communication between the workers and the control plane.

Versions

Terraform: Terraform v1.0.8
Provider(s):

provider registry.terraform.io/aiven/aiven v2.2.1
provider registry.terraform.io/hashicorp/aws v3.61.0
provider registry.terraform.io/hashicorp/cloudinit v2.2.0
provider registry.terraform.io/hashicorp/kubernetes v2.5.0
provider registry.terraform.io/hashicorp/local v2.1.0
provider registry.terraform.io/hashicorp/null v3.1.0
provider registry.terraform.io/hashicorp/tls v3.1.0
provider registry.terraform.io/terraform-aws-modules/http v2.4.1

Reproduction

terraform apply

Code Snippet to Reproduce

resource "aws_kms_key" "argo_eks" {                                                                                                                                                                                                    [7/1815]
  description             = "Argo EKS Secret Encryption Key"                                                                                                                                                                                   
  enable_key_rotation     = true                                                                                                                                                                                                               
  deletion_window_in_days = 7                                                                                                                                                                                                                  
}                                                                                                                                                                                                                                              
                                                                                                                                                                                                                                               
module "argo_eks" {                                                                                                                                                                                                                            
  source                    = "terraform-aws-modules/eks/aws"                                                                                                                                                                                  
  manage_aws_auth           = false
  cluster_name              = format("metrics_argo_eks_%s", terraform.workspace)
  cluster_version           = "1.21"
  subnets                   = var.metrics_az_ids
  vpc_id                    = var.vpc
  cluster_enabled_log_types = ["audit", "api"]

  cluster_encryption_config = [
    {
      provider_key_arn = aws_kms_key.argo_eks.arn
      resources        = ["secrets"]
    }
  ]

  cluster_endpoint_private_access                = true
  cluster_create_endpoint_private_access_sg_rule = true
  cluster_endpoint_private_access_cidrs          = var.metrics_az_cidrs

  node_groups_defaults = {
    ami_type  = "AL2_x86_64"
    disk_size = 50
  }

  node_groups = {
    default = {
      desired_capacity = 2
      max_capacity     = 10
      min_capacity     = 2

      instance_types = ["t3.small"]
      capacity_type  = "ON_DEMAND"
      update_config = {
        max_unavailable = 1
      }
    }
  }

  worker_create_cluster_primary_security_group_rules = true 

  tags = {
    Environment = terraform.workspace
  }
}

data "aws_eks_cluster" "argo_eks" {
  name = module.argo_eks.cluster_id
}

data "aws_eks_cluster_auth" "argo_eks" {
  name = module.argo_eks.cluster_id
}

provider "kubernetes" {
  host                   = data.aws_eks_cluster.argo_eks.endpoint
  cluster_ca_certificate = base64decode(data.aws_eks_cluster.argo_eks.certificate_authority[0].data)
  token                  = data.aws_eks_cluster_auth.argo_eks.token
}

Expected behavior

Applies successfully

Actual behavior

Times out an enters a failed state

Terminal Output Screenshot(s)

Additional context

The text was updated successfully, but these errors were encountered:

jaimehrubiks · 2021-10-07T14:48:16Z

Are the instances created? Maybe you can SSH and do "sudo journalctl -f | grep cloud-init" and see why they can't join. If the instances are not created at all, cloudtrail may help

mhill-holoplot · 2021-10-07T14:51:56Z

The instances are created. I'll see if I can ssh into them.

daroga0002 · 2021-10-07T14:53:11Z

you are using private endpoint and manage_aws_auth = false so suggesting to check configmap aws-auth

mhill-holoplot · 2021-10-07T15:10:41Z

sudo journalctl -f | grep cloud-init

Oct 07 13:28:42 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2512]: Cloud-init v. 19.3-44.amzn2 running 'modules:final' at Thu, 07 Oct 2021 13:28:41 +0000. Up 51.58 seconds.
Oct 07 13:28:42 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2512]: + B64_CLUSTER_CA=LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUM1ekNDQWMrZ0F3SUJBZ0lCQURBTkJna3Foa2lHOXcwQkFRc0ZBREFWTVJNd0VRWURWUVFERXdwcmRXSmwKY201bGRHVnpNQjRYRFRJeE1UQXdOekV5TVRrMU4xb1hEVE14TVRBd05URXlNVGsxTjFvd0ZURVRNQkVHQTFVRQpBeE1LYTNWaVpYSnVaWFJsY3pDQ0FTSXdEUVlKS29aSWh2Y05BUUVCQlFBRGdnRVBBRENDQVFvQ2dnRUJBTTlCClFzYU4vRk1LMExBeVVrZzk4RnVDdnRCNjJ5RzNjRWNHeFhVejhzcnBFS2V2aU9iYWxwcm54bFJFeXB5OTdOYVUKNnZjYndoYVplSjVCTTBCRGRiYllvVTlVYU1qUFVBTnNuUnUzN0l5dG1hcmpJV21WVlJtZG5NRHJjYnNxZjIxbQpRVjNVTVp5SUFHakpLQTZ5dFMxcVhXRkhRZVk3M3dsaFZ3YzRCbFVVaGg4Zll5Y0RjTWJlQWN5UjB0allIWXhpCjBNdEdPdEx6RmNnenoxRGpLRFBENXhYWFJUSVQyS1pGU1ZnQVN1R0twZUhhbzhkeHFEQmd1eGtqbldRWlpQQzEKR3h1Zk1DN0F1M3BkUlBsb0hCVDF3S05CYTI3dGQwNGdjTGdIYUwvZGxDY1NtdWNxMVFYeWxnaUhwa3VoK21WTQpZTHlOSi95WEhhcXJJeFRrLzBFQ0F3RUFBYU5DTUVBd0RnWURWUjBQQVFIL0JBUURBZ0trTUE4R0ExVWRFd0VCCi93UUZNQU1CQWY4d0hRWURWUjBPQkJZRUZPOVBxaTBHWEg4ejJSUG5HNjVuYWF4SzhOTUJNQTBHQ1NxR1NJYjMKRFFFQkN3VUFBNElCQVFCaExrd2pvSklEUUhtNVpnemd6azdLUjM2cmhZeEFiZHZ1MFV0RlhjQThydW54SlpuQwpUK3FKS0hWZUkzNGR4RzRBTWNSRnBSUDBDSWIvbXpvYmQ5U2lWOTZwRVFleGZQQmRQYTNyNDFsanJnK0Vqb1JpClh5R0pUb0VtekJNOEIvWFBGWFhoWFkzRmlyR3BhczRBRkY5MFkyakdBNmE2dGZZcDBDTzlxek0xTU9JYzRwMTIKcHplSW13TFdBZWVrRmxIeWJjQlFUai8vRUFhZG9YMGo4T05DVm5WMHIvZ0lxYUV1aXpycDNySHc2ZFVHRmtLcwo3T0xBVmhlQWU2TUFDZysvbnZUZDA5bVVlL29yRms2ei90dnBpSFRPM1lMZWNIeHdpcWZBc1k1S0NoS3hIYnJPCmNhaTViQTVEQVRKWWJnVm50clB2eUc4M3pkc2RYOGI2bUxRYwotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg==
Oct 07 13:28:42 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2512]: + API_SERVER_URL=https://1F1FABABCA51EC054EB52BEDA96605AD.gr7.eu-central-1.eks.amazonaws.com
Oct 07 13:28:42 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2512]: + K8S_CLUSTER_DNS_IP=172.20.0.10
Oct 07 13:28:42 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2512]: + /etc/eks/bootstrap.sh metrics_argo_eks_prod --kubelet-extra-args --node-labels=eks.amazonaws.com/nodegroup-image=ami-0703a89bcf1417d91,eks.amazonaws.com/capacityType=ON_DEMAND,eks.amazonaws.com/nodegroup=metrics_argo_eks_prod-default20211007132715842900000001 --b64-cluster-ca LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUM1ekNDQWMrZ0F3SUJBZ0lCQURBTkJna3Foa2lHOXcwQkFRc0ZBREFWTVJNd0VRWURWUVFERXdwcmRXSmwKY201bGRHVnpNQjRYRFRJeE1UQXdOekV5TVRrMU4xb1hEVE14TVRBd05URXlNVGsxTjFvd0ZURVRNQkVHQTFVRQpBeE1LYTNWaVpYSnVaWFJsY3pDQ0FTSXdEUVlKS29aSWh2Y05BUUVCQlFBRGdnRVBBRENDQVFvQ2dnRUJBTTlCClFzYU4vRk1LMExBeVVrZzk4RnVDdnRCNjJ5RzNjRWNHeFhVejhzcnBFS2V2aU9iYWxwcm54bFJFeXB5OTdOYVUKNnZjYndoYVplSjVCTTBCRGRiYllvVTlVYU1qUFVBTnNuUnUzN0l5dG1hcmpJV21WVlJtZG5NRHJjYnNxZjIxbQpRVjNVTVp5SUFHakpLQTZ5dFMxcVhXRkhRZVk3M3dsaFZ3YzRCbFVVaGg4Zll5Y0RjTWJlQWN5UjB0allIWXhpCjBNdEdPdEx6RmNnenoxRGpLRFBENXhYWFJUSVQyS1pGU1ZnQVN1R0twZUhhbzhkeHFEQmd1eGtqbldRWlpQQzEKR3h1Zk1DN0F1M3BkUlBsb0hCVDF3S05CYTI3dGQwNGdjTGdIYUwvZGxDY1NtdWNxMVFYeWxnaUhwa3VoK21WTQpZTHlOSi95WEhhcXJJeFRrLzBFQ0F3RUFBYU5DTUVBd0RnWURWUjBQQVFIL0JBUURBZ0trTUE4R0ExVWRFd0VCCi93UUZNQU1CQWY4d0hRWURWUjBPQkJZRUZPOVBxaTBHWEg4ejJSUG5HNjVuYWF4SzhOTUJNQTBHQ1NxR1NJYjMKRFFFQkN3VUFBNElCQVFCaExrd2pvSklEUUhtNVpnemd6azdLUjM2cmhZeEFiZHZ1MFV0RlhjQThydW54SlpuQwpUK3FKS0hWZUkzNGR4RzRBTWNSRnBSUDBDSWIvbXpvYmQ5U2lWOTZwRVFleGZQQmRQYTNyNDFsanJnK0Vqb1JpClh5R0pUb0VtekJNOEIvWFBGWFhoWFkzRmlyR3BhczRBRkY5MFkyakdBNmE2dGZZcDBDTzlxek0xTU9JYzRwMTIKcHplSW13TFdBZWVrRmxIeWJjQlFUai8vRUFhZG9YMGo4T05DVm5WMHIvZ0lxYUV1aXpycDNySHc2ZFVHRmtLcwo3T0xBVmhlQWU2TUFDZysvbnZUZDA5bVVlL29yRms2ei90dnBpSFRPM1lMZWNIeHdpcWZBc1k1S0NoS3hIYnJPCmNhaTViQTVEQVRKWWJnVm50clB2eUc4M3pkc2RYOGI2bUxRYwotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg== --apiserver-endpoint https://1F1FABABCA51EC054EB52BEDA96605AD.gr7.eu-central-1.eks.amazonaws.com --dns-cluster-ip 172.20.0.10
Oct 07 13:28:42 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2512]: Created symlink from /etc/systemd/system/multi-user.target.wants/iptables-restore.service to /etc/systemd/system/iptables-restore.service.
Oct 07 13:28:43 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2512]: Created symlink from /etc/systemd/system/multi-user.target.wants/docker.service to /usr/lib/systemd/system/docker.service.
Oct 07 13:28:51 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2512]: Created symlink from /etc/systemd/system/multi-user.target.wants/kubelet.service to /etc/systemd/system/kubelet.service.
Oct 07 13:28:51 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2512]: nvidia-smi not found
Oct 07 13:28:51 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2512]: ci-info: no authorized ssh keys fingerprints found for user ec2-user.
Oct 07 13:28:51 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2512]: Cloud-init v. 19.3-44.amzn2 finished at Thu, 07 Oct 2021 13:28:51 +0000. Datasource DataSourceEc2.  Up 61.41 seconds

mhill-holoplot · 2021-10-07T15:17:42Z

you are using private endpoint and manage_aws_auth = false so suggesting to check configmap aws-auth

Yep I had to do that because of #1280

$ kubectl describe configmap -n kube-system aws-auth
Name:         aws-auth
Namespace:    kube-system
Labels:       <none>
Annotations:  <none>

Data
====
mapRoles:
----
- groups:
  - system:bootstrappers
  - system:nodes
  rolearn: arn:aws:iam::760247569728:role/metrics_argo_eks_prod20211007122813721800000009
  username: system:node:{{EC2PrivateDNSName}}


BinaryData
====

Events:  <none>

mhill-holoplot · 2021-10-07T15:22:58Z

umm looking at the cloud-config log because that unit failed:

-- Logs begin at Thu 2021-10-07 13:27:51 UTC, end at Thu 2021-10-07 15:19:57 UTC. --
Oct 07 13:28:04 ip-10-1-6-110.eu-central-1.compute.internal systemd[1]: Starting Apply the settings specified in cloud-config...
Oct 07 13:28:04 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: Cloud-init v. 19.3-44.amzn2 running 'modules:config' at Thu, 07 Oct 2021 13:28:04 +0000. Up 14.03 seconds.
Oct 07 13:28:05 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: Loaded plugins: priorities, update-motd, versionlock
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: One of the configured repositories failed (Unknown),
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: and yum doesn't have enough cached data to continue. At this point the only
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: safe thing yum can do is fail. There are a few ways to work "fix" this:
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: 1. Contact the upstream for the repository and get them to fix the problem.
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: 2. Reconfigure the baseurl/etc. for the repository, to point to a working
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: upstream. This is most often useful if you are using a newer
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: distribution release than is supported by the repository (and the
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: packages for the previous distribution release still work).
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: 3. Run the command with the repository temporarily disabled
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: yum --disablerepo=<repoid> ...
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: 4. Disable the repository permanently, so yum won't use it by default. Yum
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: will then just ignore the repository until you permanently enable it
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: again or use --enablerepo for temporary usage:
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: yum-config-manager --disable <repoid>
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: or
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: subscription-manager repos --disable=<repoid>
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: 5. Configure the failing repository to be skipped, if it is unavailable.
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: Note that yum will try to contact the repo. when it runs most commands,
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: so will have to try and fail each time (and thus. yum will be be much
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: slower). If it is a very temporary problem though, this is often a nice
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: compromise:
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: yum-config-manager --save --setopt=<repoid>.skip_if_unavailable=true
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: Cannot find a valid baseurl for repo: amzn2-core/2/x86_64
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: Could not retrieve mirrorlist https://amazonlinux-2-repos-eu-central-1.s3.eu-central-1.amazonaws.com/2/core/latest/x86_64/mirror.list error was
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: 12: Timeout on https://amazonlinux-2-repos-eu-central-1.s3.eu-central-1.amazonaws.com/2/core/latest/x86_64/mirror.list: (28, 'Connection timed out after 5001 milliseconds')
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: Oct 07 13:28:41 cloud-init[2338]: util.py[WARNING]: Package upgrade failed
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: Oct 07 13:28:41 cloud-init[2338]: cc_package_update_upgrade_install.py[WARNING]: 1 failed with exceptions, re-raising the last one
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: Oct 07 13:28:41 cloud-init[2338]: util.py[WARNING]: Running module package-update-upgrade-install (<module 'cloudinit.config.cc_package_update_upgrade_install' from '/usr/lib/python2.7/site-packages/cloudinit/config/cc_package_update_upgrade_install.pyc'>) failed
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal systemd[1]: cloud-config.service: main process exited, code=exited, status=1/FAILURE
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal systemd[1]: Failed to start Apply the settings specified in cloud-config.
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal systemd[1]: Unit cloud-config.service entered failed state.
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal systemd[1]: cloud-config.service failed.

Is internet access a requirement for workers in EKS? Not a problem to add a route to the internet gateway in these subnets but I wasn't aware it was required.

daroga0002 · 2021-10-07T17:28:19Z

Is internet access a requirement for workers in EKS? Not a problem to add a route to the internet gateway in these subnets but I wasn't aware it was required.

It looks so, if you want to not give internet access then probably must prepare own AMI and follow with own launch_configuration. You can take a look into this example

mhill-holoplot · 2021-10-07T17:36:17Z

I can confirm that internet access was the issue. Thanks a lot for your help.

github-actions · 2022-11-18T02:29:20Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

mhill-holoplot closed this as completed Oct 7, 2021

github-actions bot locked as resolved and limited conversation to collaborators Nov 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

node_groups never create successfully #1628

node_groups never create successfully #1628

mhill-holoplot commented Oct 7, 2021

jaimehrubiks commented Oct 7, 2021

mhill-holoplot commented Oct 7, 2021

daroga0002 commented Oct 7, 2021

mhill-holoplot commented Oct 7, 2021

mhill-holoplot commented Oct 7, 2021 •

edited

Loading

mhill-holoplot commented Oct 7, 2021

daroga0002 commented Oct 7, 2021

mhill-holoplot commented Oct 7, 2021

github-actions bot commented Nov 18, 2022

node_groups never create successfully #1628

node_groups never create successfully #1628

Comments

mhill-holoplot commented Oct 7, 2021

Description

Versions

Reproduction

Code Snippet to Reproduce

Expected behavior

Actual behavior

Terminal Output Screenshot(s)

Additional context

jaimehrubiks commented Oct 7, 2021

mhill-holoplot commented Oct 7, 2021

daroga0002 commented Oct 7, 2021

mhill-holoplot commented Oct 7, 2021

mhill-holoplot commented Oct 7, 2021 • edited Loading

mhill-holoplot commented Oct 7, 2021

daroga0002 commented Oct 7, 2021

mhill-holoplot commented Oct 7, 2021

github-actions bot commented Nov 18, 2022

mhill-holoplot commented Oct 7, 2021 •

edited

Loading