Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS Load Balancer Controller v2.1.3 + EKSCTL (Multiple tagged SGs) #3459

Closed
marciogmorales opened this issue Mar 19, 2021 · 6 comments
Closed
Assignees
Labels

Comments

@marciogmorales
Copy link

Hello,
I'm using eksctl to launch an Amazon EKS cluster consisting of:

1 Managed Linux node
2 Windows Nodes

The Windows nodes by default has two Security Groups created and attached:

1 - ClusterSharedNodeSecurityGroup
2 - SG for communication between the control plane and worker nodes in group windows-ng-ltsc

However,
When trying to launch a Type: Loadbalancer (NLB) I receive the following error on the Service:

Warning SyncLoadBalancerFailed 23s (x5 over 101s) service-controller Error syncing load balancer: failed to ensure load balancer: Multiple tagged security groups found for instance i-0f7XXXX; ensure only the k8s security group is tagged; the tagged groups were sg-0cf0274a15cb31e9a(eksctl-eks-windows-nodegroup-windows-ng-ltsc-SG-7ZXXXXXZ) sg-04be95eb6f37ad123(eks-cluster-sg-eks-windows-1XXXX7)

I found some people complain about the same some years ago, but sounds like the issue still persists. Any clue on how to solve it?

@aclevername
Copy link
Contributor

Thanks for opening the issue @marciogmorales !

https://github.com/kubernetes/kubernetes/blob/68b4e26caf6ede7af577db4af62fb405b4dd47e6/staging/src/k8s.io/legacy-cloud-providers/aws/aws.go#L4111-L4138 the error is coming back from the AWS cloud provider. It looks for the kuberntes.io/cluster/<name> tag on the security groups attached to an EC2 Instance, and appears to not like it when more than one security group has it. We currently add this annotation to the default nodegroup SG that allows communication between nodegroup and the control plane https://github.com/weaveworks/eksctl/blob/d08516d74e6f4c533406b3e001b951343e04225c/pkg/cfn/builder/vpc.go#L493-L501

The error your getting would indicate that the ClusterSharedNodeSecurityGroup also has this tag, but that does not happen AFAIK: https://github.com/weaveworks/eksctl/blob/d08516d74e6f4c533406b3e001b951343e04225c/pkg/cfn/builder/vpc.go#L397-L400

Which nodegroup type is instance instance i-0f7XXXX, the Linux or Window node? And can you provide all the tags attached to the securitygroups attached to that instane? Thanks

@aclevername aclevername self-assigned this Mar 19, 2021
@marciogmorales
Copy link
Author

Thanks for the quickly feedback.

This is happening on my Windows node (instance i-0f7XXXX). The kuberntes.io/cluster/ tag has begin created on both SGs.

Here are some screenshots of the Tags created:

ClusterSharedNodeSecurityGroup:
Screen Shot 2021-03-19 at 1 57 30 PM

NodeSecurityGroup:
Screen Shot 2021-03-19 at 1 58 19 PM

@aclevername
Copy link
Contributor

Thanks @marciogmorales. I don't see the kubernetes/clusters/<name> tag on the screenshot above for ClusterSharedNodeSecurityGroup. Are you sure you have the correct SG's? Are those two the same reported in the error? the tagged groups were sg-0cf0274a15cb31e9a(eksctl-eks-windows-nodegroup-windows-ng-ltsc-SG-7ZXXXXXZ) sg-04be95eb6f37ad123(eks-cluster-sg-eks-windows-1XXXX7)

@aclevername
Copy link
Contributor

closing due to inactivity

@consideRatio
Copy link
Contributor

consideRatio commented May 31, 2021

Hi @aclevername, I've run into this specific error as well and hope to revive this debugging effort (eksctl version: 0.52.0). I've tried to provide as much information as I can below.

UPDATE: RESOLVED

The issue was caused by having a tag called KubernetesCluster, something that I had added to my eksctl-config because i thought it was a strategy to recognize my resources created by eksctl. But, it turned out to have other meaning influencing things. Removing it did the trick.


Info: background

  • I have created the entire cluster recently
  • I have multiple node groups but they are all linux based on just different sizes.
  • I have only one running node from all these node groups, from the core-a node group.
  • I have not created any AWS SecurityGroup manually or similarly, they are all created via eksctl to my knowledge.

Info: kubectl apply -f my-svc.yaml

Note that I've deleted this k8s Service and created it again without any change in outcome.

apiVersion: v1
kind: Service
metadata:
  name: proxy-public
  namespace: prod
spec:
  ports:
  - name: https
    port: 443
    targetPort: https
  - name: http
    port: 80
    targetPort: http
  selector:
    component: autohttps
    release: prod
  type: LoadBalancer

Info: kubectl describe svc my-svc

Note the errors are about the security groups eksctl-jmte-cluster-ClusterSharedNodeSecurityGroup-2LRGJW3MQ4O8 and eksctl-jmte-nodegroup-core-a-SG-LGMTEQ7JLTNX.

Name:                     proxy-public
Namespace:                prod
Labels:                   app=jupyterhub
                          app.kubernetes.io/managed-by=Helm
                          chart=jupyterhub-1.0.0-beta.1
                          component=proxy-public
                          heritage=Helm
                          release=prod
Annotations:              meta.helm.sh/release-name: prod
                          meta.helm.sh/release-namespace: prod
Selector:                 component=autohttps,release=prod
Type:                     LoadBalancer
IP:                       10.100.101.229
Port:                     https  443/TCP
TargetPort:               https/TCP
NodePort:                 https  32624/TCP
Endpoints:                192.168.22.121:8443
Port:                     http  80/TCP
TargetPort:               http/TCP
NodePort:                 http  32559/TCP
Endpoints:                192.168.22.121:8080
Session Affinity:         None
External Traffic Policy:  Cluster
Events:
  Type     Reason                  Age               From                Message
  ----     ------                  ----              ----                -------
  Normal   EnsuringLoadBalancer    5s (x2 over 12s)  service-controller  Ensuring load balancer
  Warning  SyncLoadBalancerFailed  4s (x2 over 10s)  service-controller  Error syncing load balancer: failed to ensure load balancer: Multiple tagged security groups found for instance i-0a24c650cd9fbfc68; ensure only the k8s security group is tagged; the tagged groups were sg-0731df37a5a8e6844(eksctl-jmte-cluster-ClusterSharedNodeSecurityGroup-2LRGJW3MQ4O8) sg-00efe23d69c9c0c4a(eksctl-jmte-nodegroup-core-a-SG-LGMTEQ7JLTNX)

Info: security groups details

core-a (a unmanaged nodepool, the first in my eksctl config list) ClusterSharedNodeSecurityGroup
image image

Info: all security groups

All security groups related to the VPC at least, I omitted a few related to another VPC.

image

Info: about associated instance

instance info: tags section

image

instance info: security section

image

Info: eksctl-cluster-config.yaml

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: jmte
  region: us-west-2
  version: "1.19"
  tags:
    2i2c.org/project: jmte

availabilityZones: [us-west-2d, us-west-2b, us-west-2a]

iam:
  withOIDC: true

nodeGroups:
  - name: core-a
    availabilityZones: [us-west-2d]
    instanceType: m5.large
    minSize: 0
    maxSize: 2
    desiredCapacity: 1
    volumeSize: 80
    labels:
      hub.jupyter.org/node-purpose: core
    tags:
      k8s.io/cluster-autoscaler/node-template/label/hub.jupyter.org/node-purpose: core
    iam:
      withAddonPolicies:
        autoScaler: true
        efs: true

  - name: user-a
    availabilityZones: [us-west-2d]
    instanceType: m5.xlarge   # 57 pods, 4 cpu, 16 GB
    minSize: 0
    maxSize: 20
    desiredCapacity: 0
    volumeSize: 80
    labels:
      hub.jupyter.org/node-purpose: user
    tags:
      k8s.io/cluster-autoscaler/node-template/label/hub.jupyter.org/node-purpose: user
    iam:
      withAddonPolicies:
        autoScaler: true
        efs: true

  - name: worker-xlarge
    availabilityZones: &availabilityZones [us-west-2d, us-west-2b, us-west-2a]
    minSize: &minSize 0
    maxSize: &maxSize 8
    desiredCapacity: &desiredCapacity 0
    volumeSize: &volumeSize 80
    labels: &labels
      k8s.dask.org/node-purpose: worker
    taints: &taints
      k8s.dask.org_dedicated: worker:NoSchedule
    tags: &tags
      k8s.io/cluster-autoscaler/node-template/label/k8s.dask.org/node-purpose: worker
      k8s.io/cluster-autoscaler/node-template/taint/k8s.dask.org_dedicated: worker:NoSchedule
    iam: &iam
      withAddonPolicies:
        autoScaler: true
        efs: true
    instancesDistribution:
      instanceTypes:
        - m5a.xlarge      # 57 pods, 4 cpu, 16 GB (AMD,   10 GBits network,  100% cost)
        - m5.xlarge       # 57 pods, 4 cpu, 16 GB (Intel, 10 GBits network, ~112% cost)
        # - m5n.xlarge    # 57 pods, 4 cpu, 16 GB (Intel, 25 GBits network, ~139% cost)
      onDemandBaseCapacity: &onDemandBaseCapacity 0
      onDemandPercentageAboveBaseCapacity: &onDemandPercentageAboveBaseCapacity 0
      spotAllocationStrategy: &spotAllocationStrategy capacity-optimized

  - name: worker-2xlarge
    availabilityZones: *availabilityZones
    minSize: *minSize
    maxSize: *maxSize
    desiredCapacity: *desiredCapacity
    volumeSize: *volumeSize
    labels: *labels
    taints: *taints
    tags: *tags
    iam: *iam
    instancesDistribution:
      instanceTypes:
        - m5a.2xlarge     # 57 pods, 8 cpu, 32 GB (AMD,   10 GBits network,  100% cost)
        - m5.2xlarge      # 57 pods, 8 cpu, 32 GB (Intel, 10 GBits network, ~112% cost)
        # - m5n.2xlarge   # 57 pods, 8 cpu, 32 GB (Intel, 25 GBits network, ~139% cost)
      onDemandBaseCapacity: *onDemandBaseCapacity
      onDemandPercentageAboveBaseCapacity: *onDemandPercentageAboveBaseCapacity
      spotAllocationStrategy: *spotAllocationStrategy

  # ... repeating entries for worker-4xlarge, worker-8xlarge, and worker-16xlarge omitted

@consideRatio
Copy link
Contributor

I updated my comment with the resolution for me to the issue, see #3459 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants