Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Security Group specification/override #474

Merged
merged 6 commits into from
Jun 25, 2021
Merged

Conversation

ellistarn
Copy link
Contributor

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

pkg/cloudprovider/aws/ami.go Show resolved Hide resolved
pkg/cloudprovider/aws/constraints.go Outdated Show resolved Hide resolved
pkg/cloudprovider/aws/launchtemplate.go Outdated Show resolved Hide resolved
Constraints *Constraints
Cluster v1alpha1.ClusterSpec
}{constraints, *provisioner.Spec.Cluster}); err != nil {
panic(fmt.Sprintf("Parsing user data from %v, %v, %s", provisioner, constraints, err.Error()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like if missed the validation and there is something wrong user has added to provisioner, we want to panic here? This will impact others too, if there are multiple provisioners?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We rely on cluster being set in so many other places that you'd seg fault well before this line. I want to address this holistically at some point, but for now, we assume some invariants.

pkg/cloudprovider/aws/securitygroups.go Outdated Show resolved Hide resolved
Values: []*string{aws.String(fmt.Sprintf(ClusterTagKeyFormat, clusterName))},
}},
})
if err != nil {
return nil, fmt.Errorf("describing security groups with tag key %s, %w", fmt.Sprintf(ClusterTagKeyFormat, clusterName), err)
}
s.cache.Set(clusterName, output.SecurityGroups, CacheTTL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a scenario for a customer use case-

  • Installed Karpenter, created a pending pods, Karpenter brings up an instance
  • User checks the security group doesn't like it, reads the documentation, can create their own security group
  • User creates a new security group, adds the SG name to pod and re-creates the pod

Will we not see the new security group until cacheTTL expires?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, this will fail (and loop) until the security group becomes visible and it eventually heals.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may not be a great user experience because in some cases we can immediately get from AWS the latest (AMIID) and in some cases like security group we have to wait for cache to expire.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be we can do something like this-

func (s *SecurityGroupProvider) Get(ctx context.Context, provisioner *v1alpha1.Provisioner, constraints *Constraints) ([]*ec2.SecurityGroup, error) {
	// 1. Get Security Groups from cache 
	securityGroups, ok := s.cache.Get(clusterName); 
	// 2. Filter by subnet name and group tag key if constrained
	// 3. If no security groups found, call AWS API to get security groups and set the cache
	if len(securityGroups) == 0 {
                securityGroups, err = s.getSecurityGroups(ctx, provisioner.Spec.Cluster.Name)
	        if err != nil {
		        return nil, err
	        }
                // Filter by subnet name and group tag key if constrained
	}
	return securityGroups, nil
}

Essentially we are removing the cache.Get functionality from getSecurityGroups

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to filter by subnet name and group tag key after we call the AWS API though, right?

@ellistarn ellistarn changed the title Support for Security Group specification/override [WIP] Support for Security Group specification/override Jun 23, 2021
@ellistarn ellistarn force-pushed the sg branch 4 times, most recently from abbe133 to 3f2e106 Compare June 23, 2021 22:04
@ellistarn ellistarn changed the title [WIP] Support for Security Group specification/override Support for Security Group specification/override Jun 23, 2021
@ellistarn ellistarn force-pushed the sg branch 5 times, most recently from 9f1e67e to 46a02ba Compare June 23, 2021 23:46
pkg/cloudprovider/aws/launchtemplate.go Show resolved Hide resolved
Values: []*string{aws.String(fmt.Sprintf(ClusterTagKeyFormat, clusterName))},
}},
})
if err != nil {
return nil, fmt.Errorf("describing security groups with tag key %s, %w", fmt.Sprintf(ClusterTagKeyFormat, clusterName), err)
}
s.cache.Set(clusterName, output.SecurityGroups, CacheTTL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may not be a great user experience because in some cases we can immediately get from AWS the latest (AMIID) and in some cases like security group we have to wait for cache to expire.

Values: []*string{aws.String(fmt.Sprintf(ClusterTagKeyFormat, clusterName))},
}},
})
if err != nil {
return nil, fmt.Errorf("describing security groups with tag key %s, %w", fmt.Sprintf(ClusterTagKeyFormat, clusterName), err)
}
s.cache.Set(clusterName, output.SecurityGroups, CacheTTL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be we can do something like this-

func (s *SecurityGroupProvider) Get(ctx context.Context, provisioner *v1alpha1.Provisioner, constraints *Constraints) ([]*ec2.SecurityGroup, error) {
	// 1. Get Security Groups from cache 
	securityGroups, ok := s.cache.Get(clusterName); 
	// 2. Filter by subnet name and group tag key if constrained
	// 3. If no security groups found, call AWS API to get security groups and set the cache
	if len(securityGroups) == 0 {
                securityGroups, err = s.getSecurityGroups(ctx, provisioner.Spec.Cluster.Name)
	        if err != nil {
		        return nil, err
	        }
                // Filter by subnet name and group tag key if constrained
	}
	return securityGroups, nil
}

Essentially we are removing the cache.Get functionality from getSecurityGroups

pkg/cloudprovider/aws/utils/predicates/tags.go Outdated Show resolved Hide resolved
@ellistarn
Copy link
Contributor Author

Hey @prateekgogia. Github won't let me respond for some reason, but I've made the changes you requested except for the cache invalidation on security groups. I've measured all of the AWS API calls, which are between 50 and 100ms. This makes me comfortable to simply reduce the cache ttl to 1 minute.

// resources. Cache hits enable faster provisioning and reduced API load on
// AWS APIs, which can have a serious import on performance and scalability.
// DO NOT CHANGE THIS VALUE WITHOUT DUE CONSIDERATION
CacheTTL = 60 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed this we are setting cacheTTL at two places?

	p.cache.Set(name, ami, CacheTTL)

and here-

	return &AMIProvider{
		ssm:       ssm,
		clientSet: clientSet,
		cache:     cache.New(CacheTTL, CacheCleanupInterval),
	}

So may be we can just set the smaller cacheTTL for security groups instead of changing for all resources? WDYT?

prateekgogia
prateekgogia previously approved these changes Jun 24, 2021
Copy link
Contributor

@prateekgogia prateekgogia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

prateekgogia
prateekgogia previously approved these changes Jun 24, 2021
return reconcile.Result{Requeue: true}, nil
_, err := c.Controller.Reconcile(ctx, resource)
if err != nil {
zap.S().Errorf("Controller failed to reconcile kind %s, %s", resource.GetObjectKind().GroupVersionKind().Kind, err.Error())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So now we won't requeue right away once we hit an error? I am not sure if this can cause issues, and we might have missed testing this?
Why do we need to make this change though?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After we fail a reconcile loop, we need to patch the status to say that the reconciliation failed. Regardless of if that patch fails or not, we will end up requeue-ing because returning an error requeues.

pkg/cloudprovider/aws/ami.go Show resolved Hide resolved
// resources. Cache hits enable faster provisioning and reduced API load on
// AWS APIs, which can have a serious import on performance and scalability.
// DO NOT CHANGE THIS VALUE WITHOUT DUE CONSIDERATION
CacheTTL = 60 * time.Second
// CacheCleanupInterval triggers cache cleanup (lazy eviction) at this interval.
CacheCleanupInterval = 10 * time.Minute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this was set in a previous PR, but I wanted to check if there was a specific reason we pick 10 minutes here and TTL as 60 seconds above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason. Just a shot in the dark.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value was lowered to reduce @prateekgogia's concerns about 5 minutes being a long time to start using security groups and subnets after they're created.

pkg/cloudprovider/aws/launchtemplate.go Show resolved Hide resolved
launchTemplate, err := p.getLaunchTemplate(ctx, &options)
// 4. Ensure the launch template exists, or create it
launchTemplate, err := p.ensureLaunchTemplate(ctx, &launchTemplateOptions{
Cluster: *provisioner.Spec.Cluster,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the launch templates are pulled from the spec of the provisioner. Just wanted to check that we can't create different Launch Templates for the same cluster from different provisioners? Seems like based on this PR's changes, we cannot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache key is the name, which is a hash of launchTemplateOptions. You can have many launch templates per cluster.

pkg/cloudprovider/aws/node.go Show resolved Hide resolved
Values: []*string{aws.String(fmt.Sprintf(ClusterTagKeyFormat, clusterName))},
}},
})
if err != nil {
return nil, fmt.Errorf("describing security groups with tag key %s, %w", fmt.Sprintf(ClusterTagKeyFormat, clusterName), err)
}
s.cache.Set(clusterName, output.SecurityGroups, CacheTTL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to filter by subnet name and group tag key after we call the AWS API though, right?

njtran
njtran previously approved these changes Jun 24, 2021
Copy link
Contributor

@njtran njtran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants