Support for Security Group specification/override #474

ellistarn · 2021-06-23T00:04:39Z

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

pkg/cloudprovider/aws/ami.go

pkg/cloudprovider/aws/constraints.go

pkg/cloudprovider/aws/launchtemplate.go

prateekgogia · 2021-06-23T15:22:05Z

pkg/cloudprovider/aws/launchtemplate.go

+		Constraints *Constraints
+		Cluster     v1alpha1.ClusterSpec
+	}{constraints, *provisioner.Spec.Cluster}); err != nil {
+		panic(fmt.Sprintf("Parsing user data from %v, %v, %s", provisioner, constraints, err.Error()))


Seems like if missed the validation and there is something wrong user has added to provisioner, we want to panic here? This will impact others too, if there are multiple provisioners?

We rely on cluster being set in so many other places that you'd seg fault well before this line. I want to address this holistically at some point, but for now, we assume some invariants.

pkg/cloudprovider/aws/securitygroups.go

prateekgogia · 2021-06-23T15:30:45Z

pkg/cloudprovider/aws/securitygroups.go

 			Values: []*string{aws.String(fmt.Sprintf(ClusterTagKeyFormat, clusterName))},
 		}},
 	})
 	if err != nil {
 		return nil, fmt.Errorf("describing security groups with tag key %s, %w", fmt.Sprintf(ClusterTagKeyFormat, clusterName), err)
 	}
+	s.cache.Set(clusterName, output.SecurityGroups, CacheTTL)


Here is a scenario for a customer use case-

Installed Karpenter, created a pending pods, Karpenter brings up an instance

User checks the security group doesn't like it, reads the documentation, can create their own security group

User creates a new security group, adds the SG name to pod and re-creates the pod

Will we not see the new security group until cacheTTL expires?

Correct, this will fail (and loop) until the security group becomes visible and it eventually heals.

This may not be a great user experience because in some cases we can immediately get from AWS the latest (AMIID) and in some cases like security group we have to wait for cache to expire.

May be we can do something like this-

func (s *SecurityGroupProvider) Get(ctx context.Context, provisioner *v1alpha1.Provisioner, constraints *Constraints) ([]*ec2.SecurityGroup, error) { // 1. Get Security Groups from cache securityGroups, ok := s.cache.Get(clusterName); // 2. Filter by subnet name and group tag key if constrained // 3. If no security groups found, call AWS API to get security groups and set the cache if len(securityGroups) == 0 { securityGroups, err = s.getSecurityGroups(ctx, provisioner.Spec.Cluster.Name) if err != nil { return nil, err } // Filter by subnet name and group tag key if constrained } return securityGroups, nil }

Essentially we are removing the cache.Get functionality from getSecurityGroups

We still need to filter by subnet name and group tag key after we call the AWS API though, right?

pkg/cloudprovider/aws/launchtemplate.go

prateekgogia · 2021-06-24T14:39:45Z

pkg/cloudprovider/aws/securitygroups.go

 			Values: []*string{aws.String(fmt.Sprintf(ClusterTagKeyFormat, clusterName))},
 		}},
 	})
 	if err != nil {
 		return nil, fmt.Errorf("describing security groups with tag key %s, %w", fmt.Sprintf(ClusterTagKeyFormat, clusterName), err)
 	}
+	s.cache.Set(clusterName, output.SecurityGroups, CacheTTL)


This may not be a great user experience because in some cases we can immediately get from AWS the latest (AMIID) and in some cases like security group we have to wait for cache to expire.

prateekgogia · 2021-06-24T14:55:41Z

pkg/cloudprovider/aws/securitygroups.go

 			Values: []*string{aws.String(fmt.Sprintf(ClusterTagKeyFormat, clusterName))},
 		}},
 	})
 	if err != nil {
 		return nil, fmt.Errorf("describing security groups with tag key %s, %w", fmt.Sprintf(ClusterTagKeyFormat, clusterName), err)
 	}
+	s.cache.Set(clusterName, output.SecurityGroups, CacheTTL)


May be we can do something like this-

func (s *SecurityGroupProvider) Get(ctx context.Context, provisioner *v1alpha1.Provisioner, constraints *Constraints) ([]*ec2.SecurityGroup, error) { // 1. Get Security Groups from cache securityGroups, ok := s.cache.Get(clusterName); // 2. Filter by subnet name and group tag key if constrained // 3. If no security groups found, call AWS API to get security groups and set the cache if len(securityGroups) == 0 { securityGroups, err = s.getSecurityGroups(ctx, provisioner.Spec.Cluster.Name) if err != nil { return nil, err } // Filter by subnet name and group tag key if constrained } return securityGroups, nil }

Essentially we are removing the cache.Get functionality from getSecurityGroups

pkg/cloudprovider/aws/utils/predicates/tags.go

ellistarn · 2021-06-24T16:52:37Z

Hey @prateekgogia. Github won't let me respond for some reason, but I've made the changes you requested except for the cache invalidation on security groups. I've measured all of the AWS API calls, which are between 50 and 100ms. This makes me comfortable to simply reduce the cache ttl to 1 minute.

prateekgogia · 2021-06-24T19:57:48Z

pkg/cloudprovider/aws/cloudprovider.go

+	// resources. Cache hits enable faster provisioning and reduced API load on
+	// AWS APIs, which can have a serious import on performance and scalability.
+	// DO NOT CHANGE THIS VALUE WITHOUT DUE CONSIDERATION
+	CacheTTL = 60 * time.Second


Just noticed this we are setting cacheTTL at two places?

p.cache.Set(name, ami, CacheTTL)

and here-

return &AMIProvider{ ssm: ssm, clientSet: clientSet, cache: cache.New(CacheTTL, CacheCleanupInterval), }

So may be we can just set the smaller cacheTTL for security groups instead of changing for all resources? WDYT?

prateekgogia

LGTM!

prateekgogia · 2021-06-24T22:36:10Z

pkg/controllers/controller.go

-		return reconcile.Result{Requeue: true}, nil
+	_, err := c.Controller.Reconcile(ctx, resource)
+	if err != nil {
+		zap.S().Errorf("Controller failed to reconcile kind %s, %s", resource.GetObjectKind().GroupVersionKind().Kind, err.Error())


So now we won't requeue right away once we hit an error? I am not sure if this can cause issues, and we might have missed testing this?
Why do we need to make this change though?

After we fail a reconcile loop, we need to patch the status to say that the reconciliation failed. Regardless of if that patch fails or not, we will end up requeue-ing because returning an error requeues.

pkg/cloudprovider/aws/ami.go

njtran · 2021-06-24T21:49:16Z

pkg/cloudprovider/aws/cloudprovider.go

+	// resources. Cache hits enable faster provisioning and reduced API load on
+	// AWS APIs, which can have a serious import on performance and scalability.
+	// DO NOT CHANGE THIS VALUE WITHOUT DUE CONSIDERATION
+	CacheTTL = 60 * time.Second
 	// CacheCleanupInterval triggers cache cleanup (lazy eviction) at this interval.
 	CacheCleanupInterval = 10 * time.Minute


I realize this was set in a previous PR, but I wanted to check if there was a specific reason we pick 10 minutes here and TTL as 60 seconds above?

No reason. Just a shot in the dark.

This value was lowered to reduce @prateekgogia's concerns about 5 minutes being a long time to start using security groups and subnets after they're created.

pkg/cloudprovider/aws/launchtemplate.go

njtran · 2021-06-24T22:06:02Z

pkg/cloudprovider/aws/launchtemplate.go

-	launchTemplate, err := p.getLaunchTemplate(ctx, &options)
+	// 4. Ensure the launch template exists, or create it
+	launchTemplate, err := p.ensureLaunchTemplate(ctx, &launchTemplateOptions{
+		Cluster:        *provisioner.Spec.Cluster,


It looks like the launch templates are pulled from the spec of the provisioner. Just wanted to check that we can't create different Launch Templates for the same cluster from different provisioners? Seems like based on this PR's changes, we cannot.

The cache key is the name, which is a hash of launchTemplateOptions. You can have many launch templates per cluster.

pkg/cloudprovider/aws/node.go

njtran · 2021-06-24T22:23:06Z

pkg/cloudprovider/aws/securitygroups.go

 			Values: []*string{aws.String(fmt.Sprintf(ClusterTagKeyFormat, clusterName))},
 		}},
 	})
 	if err != nil {
 		return nil, fmt.Errorf("describing security groups with tag key %s, %w", fmt.Sprintf(ClusterTagKeyFormat, clusterName), err)
 	}
+	s.cache.Set(clusterName, output.SecurityGroups, CacheTTL)


We still need to filter by subnet name and group tag key after we call the AWS API though, right?

njtran

Nice work! LGTM.

prateekgogia reviewed Jun 23, 2021

View reviewed changes

ellistarn force-pushed the sg branch from ec3e8b4 to a60082c Compare June 23, 2021 18:02

ellistarn changed the title ~~Support for Security Group specification/override~~ [WIP] Support for Security Group specification/override Jun 23, 2021

ellistarn force-pushed the sg branch 4 times, most recently from abbe133 to 3f2e106 Compare June 23, 2021 22:04

ellistarn changed the title ~~[WIP] Support for Security Group specification/override~~ Support for Security Group specification/override Jun 23, 2021

ellistarn force-pushed the sg branch 5 times, most recently from 9f1e67e to 46a02ba Compare June 23, 2021 23:46

prateekgogia suggested changes Jun 24, 2021

View reviewed changes

ellistarn force-pushed the sg branch from cd4d54b to d34ed52 Compare June 24, 2021 17:01

prateekgogia reviewed Jun 24, 2021

View reviewed changes

prateekgogia previously approved these changes Jun 24, 2021

View reviewed changes

ellistarn dismissed prateekgogia’s stale review via a9cd814 June 24, 2021 20:15

ellistarn force-pushed the sg branch from a9cd814 to 20f21cc Compare June 24, 2021 20:17

prateekgogia previously approved these changes Jun 24, 2021

View reviewed changes

ellistarn dismissed prateekgogia’s stale review via a67b5db June 24, 2021 21:22

ellistarn force-pushed the sg branch from a67b5db to e6c3c94 Compare June 24, 2021 21:22

prateekgogia reviewed Jun 24, 2021

View reviewed changes

njtran reviewed Jun 24, 2021

View reviewed changes

njtran previously approved these changes Jun 24, 2021

View reviewed changes

ellistarn added 4 commits June 24, 2021 16:47

Support for Security Group specification/override

827a2ce

PR comments

f783b4b

Simplified logging statements

e23ae7a

Added kube apiserver version cache

0f41b13

Check if subnets are empty before passing to the instance provider

7a095af

ellistarn dismissed njtran’s stale review via 4ba52f5 June 24, 2021 23:49

ellistarn force-pushed the sg branch from e6c3c94 to 4ba52f5 Compare June 24, 2021 23:49

Patch resource even if error to resolve a test failure

4cd2f76

ellistarn force-pushed the sg branch from 4ba52f5 to 4cd2f76 Compare June 24, 2021 23:50

njtran approved these changes Jun 24, 2021

View reviewed changes

ellistarn merged commit 08b78c7 into aws:main Jun 25, 2021

ellistarn deleted the sg branch June 25, 2021 03:22

This was referenced Jun 25, 2021

AWS: Security Group Discovery/Override #450

Closed

Upgrade launch templates k8s version when masters upgrade #422

Closed

[AWS Cloud Provider] - Support custom security groups. #376

Closed

gfcroft pushed a commit to gfcroft/karpenter-provider-aws that referenced this pull request Nov 25, 2023

chore: Update drift reason (aws#474)

b0a51da

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Security Group specification/override #474

Support for Security Group specification/override #474

ellistarn commented Jun 23, 2021

prateekgogia Jun 23, 2021

ellistarn Jun 23, 2021

prateekgogia Jun 23, 2021

ellistarn Jun 23, 2021

prateekgogia Jun 24, 2021

prateekgogia Jun 24, 2021

njtran Jun 24, 2021

prateekgogia Jun 24, 2021

prateekgogia Jun 24, 2021

ellistarn commented Jun 24, 2021

prateekgogia Jun 24, 2021

prateekgogia left a comment

prateekgogia Jun 24, 2021

njtran Jun 24, 2021

njtran Jun 24, 2021

ellistarn Jun 24, 2021

ellistarn Jun 24, 2021

njtran Jun 24, 2021

ellistarn Jun 24, 2021

njtran Jun 24, 2021

njtran left a comment

Support for Security Group specification/override #474

Support for Security Group specification/override #474

Conversation

ellistarn commented Jun 23, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ellistarn commented Jun 24, 2021

Choose a reason for hiding this comment

prateekgogia left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

njtran left a comment

Choose a reason for hiding this comment