Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws_eks: Cluster creation with AlbControllerOptions is running into error #22005

Closed
mrlikl opened this issue Sep 12, 2022 · 36 comments
Closed

aws_eks: Cluster creation with AlbControllerOptions is running into error #22005

mrlikl opened this issue Sep 12, 2022 · 36 comments
Labels
@aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service bug This issue is a bug. closed-for-staleness This issue was automatically closed because it hadn't received any attention in a while. p2 response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days.

Comments

@mrlikl
Copy link
Contributor

mrlikl commented Sep 12, 2022

Describe the bug

While creating an eks cluster with eks.AlbControllerOptions, it is running into error while creating the custom resource Custom::AWSCDK-EKS-HelmChart

"Received response status [FAILED] from custom resource. Message returned: Error: b'Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress' "

Expected Behavior

Creation of the custom resource Custom::AWSCDK-EKS-HelmChart to be succesfull

Current Behavior

Custom::AWSCDK-EKS-HelmChart is running into error "Received response status [FAILED] from custom resource. Message returned: Error: b'Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress' "

Reproduction Steps

cluster = eks.Cluster(
scope=self,
id=construct_id,
tags={"env": "production"},
alb_controller=eks.AlbControllerOptions(
version=eks.AlbControllerVersion.V2_4_1
),
version=eks.KubernetesVersion.V1_21,
cluster_logging=[
eks.ClusterLoggingTypes.API,
eks.ClusterLoggingTypes.AUTHENTICATOR,
eks.ClusterLoggingTypes.SCHEDULER,
],
endpoint_access=eks.EndpointAccess.PUBLIC,
place_cluster_handler_in_vpc=True,
cluster_name="basking-k8s",
output_masters_role_arn=True,
output_cluster_name=True,
default_capacity=0,
kubectl_environment={"MINIMUM_IP_TARGET": "100", "WARM_IP_TARGET": "100"},
)

Possible Solution

No response

Additional Information/Context

No response

CDK CLI Version

2.40.0

Framework Version

No response

Node.js Version

16.17.0

OS

macos 12.5.1

Language

Python

Language Version

3.10.6

Other information

No response

@mrlikl mrlikl added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Sep 12, 2022
@github-actions github-actions bot added the @aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service label Sep 12, 2022
@mrlikl mrlikl changed the title eks: Cluster creation with AlbControllerOptions is running into error aws_eks: Cluster creation with AlbControllerOptions is running into error Sep 12, 2022
@pahud
Copy link
Contributor

pahud commented Oct 19, 2022

related to #19705

@pahud
Copy link
Contributor

pahud commented Oct 20, 2022

image

@mrlikl I was able to deploy it with cdk 2.46.0, kubernetes 1.21 and alb controller 2.4.1. Are you still having the issue?

@mrlikl
Copy link
Contributor Author

mrlikl commented Oct 21, 2022

Getting the same error when default_capacity=0, the code mentioned in the description will reproduce the error now.

@pahud pahud added investigating This issue is being investigated and/or work is in progress to resolve the issue. and removed needs-triage This issue or PR still needs to be triaged. labels Nov 18, 2022
@pahud
Copy link
Contributor

pahud commented Nov 18, 2022

@mrlikl I am running the following code to reproduce this error. Will let you know when the deploy completed.

import { KubectlV23Layer } from '@aws-cdk/lambda-layer-kubectl-v23';
import {
  App, Stack,
  aws_eks as eks,
  aws_ec2 as ec2,
} from 'aws-cdk-lib';

const devEnv = {
  account: process.env.CDK_DEFAULT_ACCOUNT,
  region: process.env.CDK_DEFAULT_REGION,
};

const app = new App();

const stack = new Stack(app, 'triage-dev5', { env: devEnv });

new eks.Cluster(stack, 'Cluster', {
  vpc: ec2.Vpc.fromLookup(stack, 'Vpc', { isDefault: true }),
  albController: {
    version: eks.AlbControllerVersion.V2_4_1,
  },
  version: eks.KubernetesVersion.V1_23,
  kubectlLayer: new KubectlV23Layer(stack, 'LayerVersion'),
  clusterLogging: [
    eks.ClusterLoggingTypes.API,
    eks.ClusterLoggingTypes.AUTHENTICATOR,
    eks.ClusterLoggingTypes.SCHEDULER,
  ],
  endpointAccess: eks.EndpointAccess.PUBLIC,
  placeClusterHandlerInVpc: true,
  clusterName: 'baking-k8s',
  outputClusterName: true,
  outputMastersRoleArn: true,
  defaultCapacity: 0,
  kubectlEnvironment: { MINIMUM_IP_TARGET: '100', WARM_IP_TARGET: '100' },
});

@pahud pahud assigned pahud and unassigned otaviomacedo Nov 18, 2022
@pahud
Copy link
Contributor

pahud commented Nov 18, 2022

I am getting error with the CDK code provided above:

image

Lambda Log:

[ERROR] Exception: b'Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress\n'
Traceback (most recent call last):
  File "/var/task/index.py", line 17, in handler
    return helm_handler(event, context)
  File "/var/task/helm/__init__.py", line 88, in helm_handler
    helm('upgrade', release, chart, repository, values_file, namespace, version, wait, timeout, create_namespace)
  File "/var/task/helm/__init__.py", line 186, in helm
    raise Exception(output)

I am making this a P2 now and I will investigate a little bit more on this next week. If you have any possible solution please let me know. Any pull request would be highly appreciated as well.

@pahud pahud added the p2 label Nov 18, 2022
@dimmyshu
Copy link

dimmyshu commented Dec 30, 2022

I think this issue should be prioritized, a lot of other folks running into trouble when developing on sandbox.

I have seen a lot of issue in this repo which have setting default capacity 0 but did not realized it's a bug, It really impact development productivity since cloud formation template will take hours to rollback and cleanup the resource.

@m17kea
Copy link

m17kea commented Jan 9, 2023

I have the same issue:

  • CDK: 2.59
  • KubernetesVersion.V1_24,
  • AlbControllerVersion.V2_4_1

The error from CloudFormation is:

Received response status [FAILED] from custom resource. Message returned: Error: b'Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress\n' Logs: /aws/lambda/TestingStage-Release-awscdkawseksK-Handler886CB40B-KG9T55a3ZdwW at invokeUserFunction (/var/task/framework.js:2:6) at processTicksAndRejections (internal/process/task_queues.js:95:5) at async onEvent (/var/task/framework.js:1:365) at async Runtime.handler (/var/task/cfn-response.js:1:1543) (RequestId: 16bb84de-c183-4e1c-9e4e-cc7ec0efc5b8)

@smislam
Copy link

smislam commented Jun 22, 2023

Hey @pahud. Thank you so much for looking into this.
Were you able to make any progress? I've been struggling on this for a while. Here is my latest stack Info:

    "aws-cdk-lib": "2.63.0",
    KubernetesVersion.V1_26
    AlbControllerVersion.V2_5_1

@YikaiHu
Copy link

YikaiHu commented Jul 17, 2023

Hi @pahud, still face the same issue.

I deployed the cdk in cn-north-1 region.

@YikaiHu
Copy link

YikaiHu commented Jul 17, 2023

Hi @pahud , I think I found out the root cause in my scenario. It may be caused by image can not be pulled in cn-north-1 region.

Please check:

Failed to pull image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1": rpc error: code = Unknown desc = failed to pull and unpack image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1": failed to resolve reference "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1": pulling from host 602401143452.dkr.ecr.us-west-2.amazonaws.com failed with status code [manifests v2.4.1]: 401 Unauthorized

image

k logs aws-load-balancer-controller-75c785bc8c-72zpg -n kube-system

Error from server (BadRequest): container "aws-load-balancer-controller" in pod "aws-load-balancer-controller-75c785bc8c-72zpg" is waiting to start: trying and failing to pull image

kubectl describe pod aws-load-balancer-controller-75c785bc8c-72zpg -n kube-system

Name:                 aws-load-balancer-controller-75c785bc8c-72zpg
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Service Account:      aws-load-balancer-controller
Node:                 ip-10-0-3-136.cn-north-1.compute.internal/10.0.3.136
Start Time:           Mon, 17 Jul 2023 16:30:59 +0800
Labels:               app.kubernetes.io/instance=aws-load-balancer-controller
                      app.kubernetes.io/name=aws-load-balancer-controller
                      pod-template-hash=75c785bc8c
Annotations:          kubernetes.io/psp: eks.privileged
                      prometheus.io/port: 8080
                      prometheus.io/scrape: true
Status:               Pending
IP:                   10.0.3.160
IPs:
  IP:           10.0.3.160
Controlled By:  ReplicaSet/aws-load-balancer-controller-75c785bc8c
Containers:
  aws-load-balancer-controller:
    Container ID:  
    Image:         602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1
    Image ID:      
    Ports:         9443/TCP, 8080/TCP
    Host Ports:    0/TCP, 0/TCP
    Command:
      /controller
    Args:
      --cluster-name=Workshop-Cluster
      --ingress-class=alb
      --aws-region=cn-north-1
      --aws-vpc-id=vpc-0e4a9201452c76b0e
    State:          Waiting
      Reason:       ErrImagePull
    Ready:          False
    Restart Count:  0
    Liveness:       http-get http://:61779/healthz delay=30s timeout=10s period=10s #success=1 #failure=2
    Environment:
      AWS_STS_REGIONAL_ENDPOINTS:   regional
      AWS_DEFAULT_REGION:           cn-north-1
      AWS_REGION:                   cn-north-1
      AWS_ROLE_ARN:                 arn:aws-cn:iam::743271379588:role/clo-workshop-07-CLWorkshopEC2AndEKSeksClusterStack-1XO6CGEC91JGY
      AWS_WEB_IDENTITY_TOKEN_FILE:  /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    Mounts:
      /tmp/k8s-webhook-server/serving-certs from cert (ro)
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jct6t (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  aws-iam-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  86400
  cert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  aws-load-balancer-tls
    Optional:    false
  kube-api-access-jct6t:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                 From               Message
  ----     ------     ----                ----               -------
  Normal   Scheduled  16m                 default-scheduler  Successfully assigned kube-system/aws-load-balancer-controller-75c785bc8c-72zpg to ip-10-0-3-136.cn-north-1.compute.internal
  Normal   Pulling    14m (x4 over 16m)   kubelet            Pulling image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1"
  Warning  Failed     14m (x4 over 16m)   kubelet            Failed to pull image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1": rpc error: code = Unknown desc = failed to pull and unpack image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1": failed to resolve reference "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1": pulling from host 602401143452.dkr.ecr.us-west-2.amazonaws.com failed with status code [manifests v2.4.1]: 401 Unauthorized
  Warning  Failed     14m (x4 over 16m)   kubelet            Error: ErrImagePull
  Warning  Failed     14m (x6 over 16m)   kubelet            Error: ImagePullBackOff
  Normal   BackOff    87s (x62 over 16m)  kubelet            Back-off pulling image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1"

@YikaiHu
Copy link

YikaiHu commented Jul 17, 2023

Seems like related to #22520

@YikaiHu
Copy link

YikaiHu commented Jul 17, 2023

013241004608.dkr.ecr.us-gov-west-1.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1
151742754352.dkr.ecr.us-gov-east-1.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1
558608220178.dkr.ecr.me-south-1.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1
590381155156.dkr.ecr.eu-south-1.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1
602401143452.dkr.ecr.ap-northeast-1.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1
602401143452.dkr.ecr.ap-northeast-2.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1
602401143452.dkr.ecr.ap-northeast-3.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1
602401143452.dkr.ecr.ap-south-1.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1
602401143452.dkr.ecr.ap-southeast-1.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1
602401143452.dkr.ecr.ap-southeast-2.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1
602401143452.dkr.ecr.ca-central-1.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1
602401143452.dkr.ecr.eu-central-1.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1
602401143452.dkr.ecr.eu-north-1.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1
602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1
602401143452.dkr.ecr.eu-west-2.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1
602401143452.dkr.ecr.eu-west-3.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1
602401143452.dkr.ecr.sa-east-1.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1
602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1
602401143452.dkr.ecr.us-east-2.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1
602401143452.dkr.ecr.us-west-1.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1
602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1
800184023465.dkr.ecr.ap-east-1.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1
877085696533.dkr.ecr.af-south-1.amazonaws.com/amazon/aws-load-balancer-controller:v2.4.1
918309763551.dkr.ecr.cn-north-1.amazonaws.com.cn/amazon/aws-load-balancer-controller:v2.4.1
961992271922.dkr.ecr.cn-northwest-1.amazonaws.com.cn/amazon/aws-load-balancer-controller:v2.4.1

Find a solution in kubernetes-sigs/aws-load-balancer-controller#1694, you can manually replace the ecr template url in cloudformation.

https://github.com/kubernetes-sigs/aws-load-balancer-controller/releases?page=2

@mrlikl
Copy link
Contributor Author

mrlikl commented Oct 1, 2023

The issue is that when the cluster is deployed with default_capacity as 0 there will not be any nodes attached to it. While installing the aws-load-balancer-controller via helm, the status goes into pending-install, the pods will be pending as no nodes available to schedule pods. The handler lambda eventually times out after 15mins and the event handler lambda will retry the installation once again. The handler lambda executes helm upgrade and errors with Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress.

While this is expected as there are no nodes, I was testing by adding a check to kubectl-handler to see if nodes are 0 when the error is thrown and was able to handle the error. However, I am not sure if this is the right approach to solve this issue.

if b'another operation (install/upgrade/rollback) is in progress' in output:
                cmd_to_run = ["kubectl","get","nodes"]
                cmd_to_run.extend(['--kubeconfig', kubeconfig])
                get_nodes_output = subprocess.check_output(cmd_to_run, stderr=subprocess.STDOUT,cwd=outdir)
                if b'No resources found' in get_nodes_output:
                    return

@Karatakos
Copy link

@pahud out of interest is this still on the backlog or has it been deprioritized? Calling addnodegroupcapacity on the cluster doesn't work with defaultcapacity: 0 so it's not possible to use launch templates to control capacity via CDK -- as far as i've tested.

@smislam
Copy link

smislam commented Oct 25, 2023

I have been Stuck on creating FargateCluster with this issue since 06/22 #22005 (comment) . Did the 'defaultCapacity' work for you? It is not an option for fargate.

Just tried with latest version of CDK today and still having this issue. It is possible to escalate this issue please?
image

@PavanMudigondaTR
Copy link

Could someone help me i have the same issue. Here is my repo https://github.com/PavanMudigondaTR/install-karpenter-with-cdk

@pahud
Copy link
Contributor

pahud commented Dec 18, 2023

It's been a while and I am now testing the following code in the latest CDK

export class EksStack extends Stack {
  constructor(scope: Construct, id: string, props?: StackProps) {
    super(scope, id, props)

    // use my default VPC
    const vpc = getDefaultVpc(this);
    new eks.Cluster(this, 'Cluster', {
      vpc,
      albController: {
        version: eks.AlbControllerVersion.V2_6_2,
      },
      version: eks.KubernetesVersion.V1_27,
      kubectlLayer: new KubectlLayer(this, 'LayerVersion'),
      clusterLogging: [
        eks.ClusterLoggingTypes.API,
        eks.ClusterLoggingTypes.AUTHENTICATOR,
        eks.ClusterLoggingTypes.SCHEDULER,
      ],
      endpointAccess: eks.EndpointAccess.PUBLIC,
      placeClusterHandlerInVpc: true,
      clusterName: 'baking-k8s',
      outputClusterName: true,
      outputMastersRoleArn: true,
      defaultCapacity: 0,
      kubectlEnvironment: { MINIMUM_IP_TARGET: '100', WARM_IP_TARGET: '100' },
    });
  }
}

For issues from @mrlikl @Karatakos @smislam @PavanMudigondaTR, I am not sure if your issues are related to this one which seems to be related with AlbController, if it doesn't come with AlbController, please open a new issue and link to this one.

@YikaiHu EKS in China is a little bit more complicated, please open a separate issue for your case in China and link to this one. Thanks.

@pahud
Copy link
Contributor

pahud commented Dec 18, 2023

Unfortunately I can't deploy it with the following code in my first attempt.

I am making it a p1 for now and will simplify the code hopefully to figure out the root cause.

export class EksStack extends Stack {
  constructor(scope: Construct, id: string, props?: StackProps) {
    super(scope, id, props)

    // use my default VPC
    const vpc = getDefaultVpc(this);
    new eks.Cluster(this, 'Cluster', {
      vpc,
      albController: {
        version: eks.AlbControllerVersion.V2_6_2,
      },
      version: eks.KubernetesVersion.V1_27,
      kubectlLayer: new KubectlLayer(this, 'LayerVersion'),
      clusterLogging: [
        eks.ClusterLoggingTypes.API,
        eks.ClusterLoggingTypes.AUTHENTICATOR,
        eks.ClusterLoggingTypes.SCHEDULER,
      ],
      endpointAccess: eks.EndpointAccess.PUBLIC,
      placeClusterHandlerInVpc: true,
      clusterName: 'baking-k8s',
      outputClusterName: true,
      outputMastersRoleArn: true,
      defaultCapacity: 0,
      kubectlEnvironment: { MINIMUM_IP_TARGET: '100', WARM_IP_TARGET: '100' },
    });
  }
}

@pahud pahud added p1 and removed p2 labels Dec 18, 2023
Copy link

This issue has not received a response in a while. If you want to keep this issue open, please leave a comment below and auto-close will be canceled.

@github-actions github-actions bot added the closing-soon This issue will automatically close in 4 days unless further comments are made. label Dec 21, 2023
@PavanMudigondaTR
Copy link

issue still persists. please bot don't close the ticket

@github-actions github-actions bot removed closing-soon This issue will automatically close in 4 days unless further comments are made. response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. labels Dec 21, 2023
@smislam
Copy link

smislam commented Dec 21, 2023

Hey @pahud, thank you so much for looking into this. I concur that the issue still persist. Here is the error:

Node: v20.10.0
Npm: 10.2.5
"aws-cdk-lib": "^2.115.0"
KubernetesVersion.V1_28
AlbControllerVersion.V2_6_2

EksClusterStack | 26/28 | 9:06:12 AM | CREATE_FAILED | Custom::AWSCDK-EKS-HelmChart | EksClusterS tackEksCluster922FB9AE-AlbController/Resource/Resource/Default (EksClusterStackEksCluster922FB9AEAlbContro ller1636C356) Received response status [FAILED] from custom resource. Message returned: Error: b'Release "aws-load-balancer-controller" does not exist. Installing it now.\nError: looks like "https://aws.github.io/eks-charts" is not a valid chart reposito ry or cannot be reached: Get "https://aws.github.io/eks-charts/index.yaml": dial tcp 185.199.110.153:443: connect: connection t imed out\n'

When I add your suggestion cluster.albController?.node.addDependency(cluster.defaultNodegroup!);, I get the following error:

$eks-cluster\node_modules\constructs\src\dependency.ts:91 const ret = (instance as any)[DEPENDABLE_SYMBOL]; ^ TypeError: Cannot read properties of undefined (reading 'Symbol(@aws-cdk/core.DependableTrait)')

@smislam
Copy link

smislam commented Dec 21, 2023

@pahud, @mrlikl et. al,

I was able to resolve the issue. What I have found is that to create the egress controller, the code is getting helm files from Kubernetes sigs. To access those file, you must have egress enabled. In my case, I was creating my cluster in Private subnet. You need to create your cluster in a subnet with egress. SubnetType.PRIVATE_WITH_EGRESS.

Please update your Cluster and your VPC configurations to see if this gets resolved for you. My Stack completed successfully.

@pahud
Copy link
Contributor

pahud commented Dec 26, 2023

Thank you @smislam for the insights.

@pahud pahud added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Dec 26, 2023
Copy link

This issue has not received a response in a while. If you want to keep this issue open, please leave a comment below and auto-close will be canceled.

@github-actions github-actions bot added the closing-soon This issue will automatically close in 4 days unless further comments are made. label Dec 28, 2023
@andreprawira
Copy link

@smislam SubnetType.PRIVATE_WITH_EGRESS causes RuntimeError: There are no 'Private' subnet groups in this VPC. Available types: Isolated,Deprecated_Isolated,Public

@pahud im still getting the same error with my python code even with default_capacity do you know where am i missing?

        vpc = ec2.Vpc.from_lookup(self, "VPCLookup", vpc_id=props.vpc_id)

        # provisioning a cluster
        cluster = eks.Cluster(
            self,
            "eks-cluster",
            version=eks.KubernetesVersion.V1_28,
            kubectl_layer=lambda_layer_kubectl_v28.KubectlV28Layer(self, "kubectl-layer"),
            cluster_name=f"{props.customer}-eks-cluster",
            default_capacity_instance=ec2.InstanceType("t3.medium"),
            default_capacity=2,
            alb_controller=eks.AlbControllerOptions(version=eks.AlbControllerVersion.V2_6_2),
            vpc=vpc,
            vpc_subnets=[ec2.SubnetSelection(subnet_type=ec2.SubnetType.PRIVATE_ISOLATED)],
            masters_role=iam.Role(self, "masters-role", assumed_by=iam.AccountRootPrincipal()),
        )

@github-actions github-actions bot removed closing-soon This issue will automatically close in 4 days unless further comments are made. response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. labels Dec 29, 2023
@pahud
Copy link
Contributor

pahud commented Dec 29, 2023

@andreprawira

For some reason it will fail if vpc_subnets selection is ec2.SubnetType.PRIVATE_ISOLATED as described in #22005 (comment).

RuntimeError: There are no 'Private' subnet groups in this VPC. Available types: Isolated,Deprecated_Isolated,Public

This means CDK doesn't seem to find any "private with egress" subnets in your vpc. Can you make sure you do have private subnets with egress(typically NAT gateway)?

@pahud pahud added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Dec 29, 2023
@smislam
Copy link

smislam commented Dec 29, 2023

@andreprawira, It looks like you are using a VPC (already created in another stack) that doesn't have a private subnet with egress. And, that is why you are getting that error.

vpc = ec2.Vpc.from_lookup(self, "VPCLookup", vpc_id=props.vpc_id)

You will not be able to use CDK to create your stack with such configuration for the reason I mentioned earlier in my comment.. So, either update with your VPC to create new private subnet with Egress or create an entirely new VPC with SubnetType.PRIVATE_WITH_EGRESS. This will require a NAT (either gateway or instance) as @pahud mentioned.

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Dec 30, 2023
@andreprawira
Copy link

andreprawira commented Dec 30, 2023

@pahud @smislam so we have a product in our service catalog that deploys VPC and IGW to all of our accounts and within that product, we dont use NAT GW, rather we use a TGW in our network account (meaning all traffic goes in and out through network account, even with the VPCs in various other accounts). That is why i did a VPC from lookup cause it has already been created.

That being said, is there another way for me to use the alb_controller with the VPC, TGW, and IGW are already set up as is? Btw, i hope i am not misunderstanding you guys when you said i cant use ec2.SubnetType.PRIVATE_ISOLATED because if i look at my cluster, i can see the subnets that it uses are all private subnets (the route tables for those subnets route the traffic to TGW that exists in network account, and the RT of those subnets dont route the traffic to IGW)

Furthermore, using vpc_subnets=[ec2.SubnetSelection(subnet_type=ec2.SubnetType.PRIVATE_WITH_EGRESS)] causes RuntimeError: There are no 'Private' subnet groups in this VPC. Available types: Isolated,Deprecated_Isolated,Public and to answer your question @pahud i could be wrong but i dont think i have private subnets with egress if it uses NAT GW, but i have a TGW, shouldnt it worked as well?

How do i use ec2.SubnetType.PRIVATE_WITH_EGRESS)] with a TGW instead of NAT GW?

@smislam
Copy link

smislam commented Dec 30, 2023

@andreprawira, Your setup should work. There is a bug in the older version of CDK that has an issue with Transit Gateway. I ran into this a while back. Any chance you are using older version of CDK?
Can you please try with latest version?

@andreprawira
Copy link

@smislam i just updated my cdk from version 2.115.0 to 2.117.0 and below is my code

vpc = ec2.Vpc.from_lookup(self, "VPCLookup", vpc_id=props.vpc_id)

        # provisioning a cluster
        cluster = eks.Cluster(
            self,
            "eks-cluster",
            version=eks.KubernetesVersion.V1_28,
            kubectl_layer=lambda_layer_kubectl_v28.KubectlV28Layer(self, "kubectl-layer"),
            # place_cluster_handler_in_vpc=True,
            cluster_name=f"{props.customer}-eks-cluster",
            default_capacity_instance=ec2.InstanceType("t3.medium"),
            default_capacity=2,
            alb_controller=eks.AlbControllerOptions(version=eks.AlbControllerVersion.V2_6_2),
            vpc=vpc,
            vpc_subnets=[ec2.SubnetSelection(subnet_type=ec2.SubnetType.PRIVATE_WITH_EGRESS)],
            # masters_role=iam.Role(self, "masters-role", assumed_by=iam.AccountRootPrincipal()),
        )

but i am still getting the same RuntimeError: There are no 'Private' subnet groups in this VPC. Available types: Isolated,Deprecated_Isolated,Public

@smislam
Copy link

smislam commented Dec 30, 2023

That is strange. I am not sure what is happening @andreprawira. We will need @pahud and the AWS CDK team to look deeper into this. Happy coding and a happy New Year!

@pahud
Copy link
Contributor

pahud commented Jan 2, 2024

@andreprawira

I think you still can use private isolated for the vpc_subnets as below:

vpc_subnets=[ec2.SubnetSelection(subnet_type=ec2.SubnetType.PRIVATE_ISOLATED)],

But if you look at the synthesized template, there could be a chance

  1. Your lambda function for kubectl handler is associated with isolated subnets, which means:
    a. your kubectl lambda handler may not be able to access the aws eks API endpoint through public internet unless the isolated subnets has relevant vpc endpoints enabled.
    b. your kubectl lambda handler may not be able to access the cluster endpoint if it's public only
  2. Your nodegroup may be deployed in the isolates subnets which may not be able to pull images from ECR public unless relevant vpc endpoint or proxy configuration is well configured.

Technically, it is possible to deploy eks cluster with isolated subnets but there're a lot of requirements you need to consider and we don't have a working sample for now and we will need more feedback from the community before we know how to do that and add it in the document.

We have a p1 tracking issue for eks cluster with isolated support at #12171 - we will need to close that first but that should not relevant to albcontroller.

@pahud pahud added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Jan 4, 2024
Copy link

github-actions bot commented Jan 5, 2024

This issue has not received a response in a while. If you want to keep this issue open, please leave a comment below and auto-close will be canceled.

@github-actions github-actions bot added closing-soon This issue will automatically close in 4 days unless further comments are made. closed-for-staleness This issue was automatically closed because it hadn't received any attention in a while. and removed closing-soon This issue will automatically close in 4 days unless further comments are made. labels Jan 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service bug This issue is a bug. closed-for-staleness This issue was automatically closed because it hadn't received any attention in a while. p2 response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days.
Projects
None yet
Development

No branches or pull requests

10 participants