Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node_groups never create successfully #1628

Closed
mhill-holoplot opened this issue Oct 7, 2021 · 9 comments
Closed

node_groups never create successfully #1628

mhill-holoplot opened this issue Oct 7, 2021 · 9 comments

Comments

@mhill-holoplot
Copy link

Description

node_groups never create successfully. Presumably this is a problem with communication between the workers and the control plane.

Versions

  • Terraform: Terraform v1.0.8
  • Provider(s):
  • provider registry.terraform.io/aiven/aiven v2.2.1
  • provider registry.terraform.io/hashicorp/aws v3.61.0
  • provider registry.terraform.io/hashicorp/cloudinit v2.2.0
  • provider registry.terraform.io/hashicorp/kubernetes v2.5.0
  • provider registry.terraform.io/hashicorp/local v2.1.0
  • provider registry.terraform.io/hashicorp/null v3.1.0
  • provider registry.terraform.io/hashicorp/tls v3.1.0
  • provider registry.terraform.io/terraform-aws-modules/http v2.4.1

Reproduction

terraform apply

Code Snippet to Reproduce

resource "aws_kms_key" "argo_eks" {                                                                                                                                                                                                    [7/1815]
  description             = "Argo EKS Secret Encryption Key"                                                                                                                                                                                   
  enable_key_rotation     = true                                                                                                                                                                                                               
  deletion_window_in_days = 7                                                                                                                                                                                                                  
}                                                                                                                                                                                                                                              
                                                                                                                                                                                                                                               
module "argo_eks" {                                                                                                                                                                                                                            
  source                    = "terraform-aws-modules/eks/aws"                                                                                                                                                                                  
  manage_aws_auth           = false
  cluster_name              = format("metrics_argo_eks_%s", terraform.workspace)
  cluster_version           = "1.21"
  subnets                   = var.metrics_az_ids
  vpc_id                    = var.vpc
  cluster_enabled_log_types = ["audit", "api"]

  cluster_encryption_config = [
    {
      provider_key_arn = aws_kms_key.argo_eks.arn
      resources        = ["secrets"]
    }
  ]

  cluster_endpoint_private_access                = true
  cluster_create_endpoint_private_access_sg_rule = true
  cluster_endpoint_private_access_cidrs          = var.metrics_az_cidrs

  node_groups_defaults = {
    ami_type  = "AL2_x86_64"
    disk_size = 50
  }

  node_groups = {
    default = {
      desired_capacity = 2
      max_capacity     = 10
      min_capacity     = 2

      instance_types = ["t3.small"]
      capacity_type  = "ON_DEMAND"
      update_config = {
        max_unavailable = 1
      }
    }
  }

  worker_create_cluster_primary_security_group_rules = true 

  tags = {
    Environment = terraform.workspace
  }
}

data "aws_eks_cluster" "argo_eks" {
  name = module.argo_eks.cluster_id
}

data "aws_eks_cluster_auth" "argo_eks" {
  name = module.argo_eks.cluster_id
}

provider "kubernetes" {
  host                   = data.aws_eks_cluster.argo_eks.endpoint
  cluster_ca_certificate = base64decode(data.aws_eks_cluster.argo_eks.certificate_authority[0].data)
  token                  = data.aws_eks_cluster_auth.argo_eks.token
}

Expected behavior

Applies successfully

Actual behavior

Times out an enters a failed state

Terminal Output Screenshot(s)

screenshot-2021-10-07T16:39:19+02:00
screenshot-2021-10-07T16:38:57+02:00

Additional context

screenshot-2021-10-07T16:40:53+02:00

@jaimehrubiks
Copy link
Contributor

Are the instances created? Maybe you can SSH and do "sudo journalctl -f | grep cloud-init" and see why they can't join. If the instances are not created at all, cloudtrail may help

@mhill-holoplot
Copy link
Author

The instances are created. I'll see if I can ssh into them.

@daroga0002
Copy link
Contributor

you are using private endpoint and manage_aws_auth = false so suggesting to check configmap aws-auth

@mhill-holoplot
Copy link
Author

sudo journalctl -f | grep cloud-init

Oct 07 13:28:42 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2512]: Cloud-init v. 19.3-44.amzn2 running 'modules:final' at Thu, 07 Oct 2021 13:28:41 +0000. Up 51.58 seconds.
Oct 07 13:28:42 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2512]: + B64_CLUSTER_CA=LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUM1ekNDQWMrZ0F3SUJBZ0lCQURBTkJna3Foa2lHOXcwQkFRc0ZBREFWTVJNd0VRWURWUVFERXdwcmRXSmwKY201bGRHVnpNQjRYRFRJeE1UQXdOekV5TVRrMU4xb1hEVE14TVRBd05URXlNVGsxTjFvd0ZURVRNQkVHQTFVRQpBeE1LYTNWaVpYSnVaWFJsY3pDQ0FTSXdEUVlKS29aSWh2Y05BUUVCQlFBRGdnRVBBRENDQVFvQ2dnRUJBTTlCClFzYU4vRk1LMExBeVVrZzk4RnVDdnRCNjJ5RzNjRWNHeFhVejhzcnBFS2V2aU9iYWxwcm54bFJFeXB5OTdOYVUKNnZjYndoYVplSjVCTTBCRGRiYllvVTlVYU1qUFVBTnNuUnUzN0l5dG1hcmpJV21WVlJtZG5NRHJjYnNxZjIxbQpRVjNVTVp5SUFHakpLQTZ5dFMxcVhXRkhRZVk3M3dsaFZ3YzRCbFVVaGg4Zll5Y0RjTWJlQWN5UjB0allIWXhpCjBNdEdPdEx6RmNnenoxRGpLRFBENXhYWFJUSVQyS1pGU1ZnQVN1R0twZUhhbzhkeHFEQmd1eGtqbldRWlpQQzEKR3h1Zk1DN0F1M3BkUlBsb0hCVDF3S05CYTI3dGQwNGdjTGdIYUwvZGxDY1NtdWNxMVFYeWxnaUhwa3VoK21WTQpZTHlOSi95WEhhcXJJeFRrLzBFQ0F3RUFBYU5DTUVBd0RnWURWUjBQQVFIL0JBUURBZ0trTUE4R0ExVWRFd0VCCi93UUZNQU1CQWY4d0hRWURWUjBPQkJZRUZPOVBxaTBHWEg4ejJSUG5HNjVuYWF4SzhOTUJNQTBHQ1NxR1NJYjMKRFFFQkN3VUFBNElCQVFCaExrd2pvSklEUUhtNVpnemd6azdLUjM2cmhZeEFiZHZ1MFV0RlhjQThydW54SlpuQwpUK3FKS0hWZUkzNGR4RzRBTWNSRnBSUDBDSWIvbXpvYmQ5U2lWOTZwRVFleGZQQmRQYTNyNDFsanJnK0Vqb1JpClh5R0pUb0VtekJNOEIvWFBGWFhoWFkzRmlyR3BhczRBRkY5MFkyakdBNmE2dGZZcDBDTzlxek0xTU9JYzRwMTIKcHplSW13TFdBZWVrRmxIeWJjQlFUai8vRUFhZG9YMGo4T05DVm5WMHIvZ0lxYUV1aXpycDNySHc2ZFVHRmtLcwo3T0xBVmhlQWU2TUFDZysvbnZUZDA5bVVlL29yRms2ei90dnBpSFRPM1lMZWNIeHdpcWZBc1k1S0NoS3hIYnJPCmNhaTViQTVEQVRKWWJnVm50clB2eUc4M3pkc2RYOGI2bUxRYwotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg==
Oct 07 13:28:42 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2512]: + API_SERVER_URL=https://1F1FABABCA51EC054EB52BEDA96605AD.gr7.eu-central-1.eks.amazonaws.com
Oct 07 13:28:42 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2512]: + K8S_CLUSTER_DNS_IP=172.20.0.10
Oct 07 13:28:42 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2512]: + /etc/eks/bootstrap.sh metrics_argo_eks_prod --kubelet-extra-args --node-labels=eks.amazonaws.com/nodegroup-image=ami-0703a89bcf1417d91,eks.amazonaws.com/capacityType=ON_DEMAND,eks.amazonaws.com/nodegroup=metrics_argo_eks_prod-default20211007132715842900000001 --b64-cluster-ca LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUM1ekNDQWMrZ0F3SUJBZ0lCQURBTkJna3Foa2lHOXcwQkFRc0ZBREFWTVJNd0VRWURWUVFERXdwcmRXSmwKY201bGRHVnpNQjRYRFRJeE1UQXdOekV5TVRrMU4xb1hEVE14TVRBd05URXlNVGsxTjFvd0ZURVRNQkVHQTFVRQpBeE1LYTNWaVpYSnVaWFJsY3pDQ0FTSXdEUVlKS29aSWh2Y05BUUVCQlFBRGdnRVBBRENDQVFvQ2dnRUJBTTlCClFzYU4vRk1LMExBeVVrZzk4RnVDdnRCNjJ5RzNjRWNHeFhVejhzcnBFS2V2aU9iYWxwcm54bFJFeXB5OTdOYVUKNnZjYndoYVplSjVCTTBCRGRiYllvVTlVYU1qUFVBTnNuUnUzN0l5dG1hcmpJV21WVlJtZG5NRHJjYnNxZjIxbQpRVjNVTVp5SUFHakpLQTZ5dFMxcVhXRkhRZVk3M3dsaFZ3YzRCbFVVaGg4Zll5Y0RjTWJlQWN5UjB0allIWXhpCjBNdEdPdEx6RmNnenoxRGpLRFBENXhYWFJUSVQyS1pGU1ZnQVN1R0twZUhhbzhkeHFEQmd1eGtqbldRWlpQQzEKR3h1Zk1DN0F1M3BkUlBsb0hCVDF3S05CYTI3dGQwNGdjTGdIYUwvZGxDY1NtdWNxMVFYeWxnaUhwa3VoK21WTQpZTHlOSi95WEhhcXJJeFRrLzBFQ0F3RUFBYU5DTUVBd0RnWURWUjBQQVFIL0JBUURBZ0trTUE4R0ExVWRFd0VCCi93UUZNQU1CQWY4d0hRWURWUjBPQkJZRUZPOVBxaTBHWEg4ejJSUG5HNjVuYWF4SzhOTUJNQTBHQ1NxR1NJYjMKRFFFQkN3VUFBNElCQVFCaExrd2pvSklEUUhtNVpnemd6azdLUjM2cmhZeEFiZHZ1MFV0RlhjQThydW54SlpuQwpUK3FKS0hWZUkzNGR4RzRBTWNSRnBSUDBDSWIvbXpvYmQ5U2lWOTZwRVFleGZQQmRQYTNyNDFsanJnK0Vqb1JpClh5R0pUb0VtekJNOEIvWFBGWFhoWFkzRmlyR3BhczRBRkY5MFkyakdBNmE2dGZZcDBDTzlxek0xTU9JYzRwMTIKcHplSW13TFdBZWVrRmxIeWJjQlFUai8vRUFhZG9YMGo4T05DVm5WMHIvZ0lxYUV1aXpycDNySHc2ZFVHRmtLcwo3T0xBVmhlQWU2TUFDZysvbnZUZDA5bVVlL29yRms2ei90dnBpSFRPM1lMZWNIeHdpcWZBc1k1S0NoS3hIYnJPCmNhaTViQTVEQVRKWWJnVm50clB2eUc4M3pkc2RYOGI2bUxRYwotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg== --apiserver-endpoint https://1F1FABABCA51EC054EB52BEDA96605AD.gr7.eu-central-1.eks.amazonaws.com --dns-cluster-ip 172.20.0.10
Oct 07 13:28:42 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2512]: Created symlink from /etc/systemd/system/multi-user.target.wants/iptables-restore.service to /etc/systemd/system/iptables-restore.service.
Oct 07 13:28:43 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2512]: Created symlink from /etc/systemd/system/multi-user.target.wants/docker.service to /usr/lib/systemd/system/docker.service.
Oct 07 13:28:51 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2512]: Created symlink from /etc/systemd/system/multi-user.target.wants/kubelet.service to /etc/systemd/system/kubelet.service.
Oct 07 13:28:51 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2512]: nvidia-smi not found
Oct 07 13:28:51 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2512]: ci-info: no authorized ssh keys fingerprints found for user ec2-user.
Oct 07 13:28:51 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2512]: Cloud-init v. 19.3-44.amzn2 finished at Thu, 07 Oct 2021 13:28:51 +0000. Datasource DataSourceEc2.  Up 61.41 seconds

@mhill-holoplot
Copy link
Author

mhill-holoplot commented Oct 7, 2021

you are using private endpoint and manage_aws_auth = false so suggesting to check configmap aws-auth

Yep I had to do that because of #1280

$ kubectl describe configmap -n kube-system aws-auth
Name:         aws-auth
Namespace:    kube-system
Labels:       <none>
Annotations:  <none>

Data
====
mapRoles:
----
- groups:
  - system:bootstrappers
  - system:nodes
  rolearn: arn:aws:iam::760247569728:role/metrics_argo_eks_prod20211007122813721800000009
  username: system:node:{{EC2PrivateDNSName}}


BinaryData
====

Events:  <none>

@mhill-holoplot
Copy link
Author

umm looking at the cloud-config log because that unit failed:

-- Logs begin at Thu 2021-10-07 13:27:51 UTC, end at Thu 2021-10-07 15:19:57 UTC. --
Oct 07 13:28:04 ip-10-1-6-110.eu-central-1.compute.internal systemd[1]: Starting Apply the settings specified in cloud-config...
Oct 07 13:28:04 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: Cloud-init v. 19.3-44.amzn2 running 'modules:config' at Thu, 07 Oct 2021 13:28:04 +0000. Up 14.03 seconds.
Oct 07 13:28:05 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: Loaded plugins: priorities, update-motd, versionlock
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: One of the configured repositories failed (Unknown),
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: and yum doesn't have enough cached data to continue. At this point the only
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: safe thing yum can do is fail. There are a few ways to work "fix" this:
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: 1. Contact the upstream for the repository and get them to fix the problem.
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: 2. Reconfigure the baseurl/etc. for the repository, to point to a working
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: upstream. This is most often useful if you are using a newer
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: distribution release than is supported by the repository (and the
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: packages for the previous distribution release still work).
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: 3. Run the command with the repository temporarily disabled
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: yum --disablerepo=<repoid> ...
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: 4. Disable the repository permanently, so yum won't use it by default. Yum
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: will then just ignore the repository until you permanently enable it
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: again or use --enablerepo for temporary usage:
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: yum-config-manager --disable <repoid>
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: or
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: subscription-manager repos --disable=<repoid>
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: 5. Configure the failing repository to be skipped, if it is unavailable.
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: Note that yum will try to contact the repo. when it runs most commands,
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: so will have to try and fail each time (and thus. yum will be be much
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: slower). If it is a very temporary problem though, this is often a nice
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: compromise:
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: yum-config-manager --save --setopt=<repoid>.skip_if_unavailable=true
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: Cannot find a valid baseurl for repo: amzn2-core/2/x86_64
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: Could not retrieve mirrorlist https://amazonlinux-2-repos-eu-central-1.s3.eu-central-1.amazonaws.com/2/core/latest/x86_64/mirror.list error was
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: 12: Timeout on https://amazonlinux-2-repos-eu-central-1.s3.eu-central-1.amazonaws.com/2/core/latest/x86_64/mirror.list: (28, 'Connection timed out after 5001 milliseconds')
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: Oct 07 13:28:41 cloud-init[2338]: util.py[WARNING]: Package upgrade failed
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: Oct 07 13:28:41 cloud-init[2338]: cc_package_update_upgrade_install.py[WARNING]: 1 failed with exceptions, re-raising the last one
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal cloud-init[2338]: Oct 07 13:28:41 cloud-init[2338]: util.py[WARNING]: Running module package-update-upgrade-install (<module 'cloudinit.config.cc_package_update_upgrade_install' from '/usr/lib/python2.7/site-packages/cloudinit/config/cc_package_update_upgrade_install.pyc'>) failed
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal systemd[1]: cloud-config.service: main process exited, code=exited, status=1/FAILURE
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal systemd[1]: Failed to start Apply the settings specified in cloud-config.
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal systemd[1]: Unit cloud-config.service entered failed state.
Oct 07 13:28:41 ip-10-1-6-110.eu-central-1.compute.internal systemd[1]: cloud-config.service failed.

Is internet access a requirement for workers in EKS? Not a problem to add a route to the internet gateway in these subnets but I wasn't aware it was required.

@daroga0002
Copy link
Contributor

Is internet access a requirement for workers in EKS? Not a problem to add a route to the internet gateway in these subnets but I wasn't aware it was required.

It looks so, if you want to not give internet access then probably must prepare own AMI and follow with own launch_configuration. You can take a look into this example

@mhill-holoplot
Copy link
Author

I can confirm that internet access was the issue. Thanks a lot for your help.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants