-
Notifications
You must be signed in to change notification settings - Fork 983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Security Group specification/override #474
Conversation
Constraints *Constraints | ||
Cluster v1alpha1.ClusterSpec | ||
}{constraints, *provisioner.Spec.Cluster}); err != nil { | ||
panic(fmt.Sprintf("Parsing user data from %v, %v, %s", provisioner, constraints, err.Error())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like if missed the validation and there is something wrong user has added to provisioner, we want to panic here? This will impact others too, if there are multiple provisioners?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We rely on cluster being set in so many other places that you'd seg fault well before this line. I want to address this holistically at some point, but for now, we assume some invariants.
Values: []*string{aws.String(fmt.Sprintf(ClusterTagKeyFormat, clusterName))}, | ||
}}, | ||
}) | ||
if err != nil { | ||
return nil, fmt.Errorf("describing security groups with tag key %s, %w", fmt.Sprintf(ClusterTagKeyFormat, clusterName), err) | ||
} | ||
s.cache.Set(clusterName, output.SecurityGroups, CacheTTL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a scenario for a customer use case-
- Installed Karpenter, created a pending pods, Karpenter brings up an instance
- User checks the security group doesn't like it, reads the documentation, can create their own security group
- User creates a new security group, adds the SG name to pod and re-creates the pod
Will we not see the new security group until cacheTTL expires?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, this will fail (and loop) until the security group becomes visible and it eventually heals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may not be a great user experience because in some cases we can immediately get from AWS the latest (AMIID) and in some cases like security group we have to wait for cache to expire.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be we can do something like this-
func (s *SecurityGroupProvider) Get(ctx context.Context, provisioner *v1alpha1.Provisioner, constraints *Constraints) ([]*ec2.SecurityGroup, error) {
// 1. Get Security Groups from cache
securityGroups, ok := s.cache.Get(clusterName);
// 2. Filter by subnet name and group tag key if constrained
// 3. If no security groups found, call AWS API to get security groups and set the cache
if len(securityGroups) == 0 {
securityGroups, err = s.getSecurityGroups(ctx, provisioner.Spec.Cluster.Name)
if err != nil {
return nil, err
}
// Filter by subnet name and group tag key if constrained
}
return securityGroups, nil
}
Essentially we are removing the cache.Get functionality from getSecurityGroups
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need to filter by subnet name and group tag key after we call the AWS API though, right?
abbe133
to
3f2e106
Compare
9f1e67e
to
46a02ba
Compare
Values: []*string{aws.String(fmt.Sprintf(ClusterTagKeyFormat, clusterName))}, | ||
}}, | ||
}) | ||
if err != nil { | ||
return nil, fmt.Errorf("describing security groups with tag key %s, %w", fmt.Sprintf(ClusterTagKeyFormat, clusterName), err) | ||
} | ||
s.cache.Set(clusterName, output.SecurityGroups, CacheTTL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may not be a great user experience because in some cases we can immediately get from AWS the latest (AMIID) and in some cases like security group we have to wait for cache to expire.
Values: []*string{aws.String(fmt.Sprintf(ClusterTagKeyFormat, clusterName))}, | ||
}}, | ||
}) | ||
if err != nil { | ||
return nil, fmt.Errorf("describing security groups with tag key %s, %w", fmt.Sprintf(ClusterTagKeyFormat, clusterName), err) | ||
} | ||
s.cache.Set(clusterName, output.SecurityGroups, CacheTTL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be we can do something like this-
func (s *SecurityGroupProvider) Get(ctx context.Context, provisioner *v1alpha1.Provisioner, constraints *Constraints) ([]*ec2.SecurityGroup, error) {
// 1. Get Security Groups from cache
securityGroups, ok := s.cache.Get(clusterName);
// 2. Filter by subnet name and group tag key if constrained
// 3. If no security groups found, call AWS API to get security groups and set the cache
if len(securityGroups) == 0 {
securityGroups, err = s.getSecurityGroups(ctx, provisioner.Spec.Cluster.Name)
if err != nil {
return nil, err
}
// Filter by subnet name and group tag key if constrained
}
return securityGroups, nil
}
Essentially we are removing the cache.Get functionality from getSecurityGroups
Hey @prateekgogia. Github won't let me respond for some reason, but I've made the changes you requested except for the cache invalidation on security groups. I've measured all of the AWS API calls, which are between 50 and 100ms. This makes me comfortable to simply reduce the cache ttl to 1 minute. |
// resources. Cache hits enable faster provisioning and reduced API load on | ||
// AWS APIs, which can have a serious import on performance and scalability. | ||
// DO NOT CHANGE THIS VALUE WITHOUT DUE CONSIDERATION | ||
CacheTTL = 60 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noticed this we are setting cacheTTL at two places?
p.cache.Set(name, ami, CacheTTL)
and here-
return &AMIProvider{
ssm: ssm,
clientSet: clientSet,
cache: cache.New(CacheTTL, CacheCleanupInterval),
}
So may be we can just set the smaller cacheTTL for security groups instead of changing for all resources? WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
pkg/controllers/controller.go
Outdated
return reconcile.Result{Requeue: true}, nil | ||
_, err := c.Controller.Reconcile(ctx, resource) | ||
if err != nil { | ||
zap.S().Errorf("Controller failed to reconcile kind %s, %s", resource.GetObjectKind().GroupVersionKind().Kind, err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So now we won't requeue right away once we hit an error? I am not sure if this can cause issues, and we might have missed testing this?
Why do we need to make this change though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After we fail a reconcile loop, we need to patch the status to say that the reconciliation failed. Regardless of if that patch fails or not, we will end up requeue-ing because returning an error requeues.
// resources. Cache hits enable faster provisioning and reduced API load on | ||
// AWS APIs, which can have a serious import on performance and scalability. | ||
// DO NOT CHANGE THIS VALUE WITHOUT DUE CONSIDERATION | ||
CacheTTL = 60 * time.Second | ||
// CacheCleanupInterval triggers cache cleanup (lazy eviction) at this interval. | ||
CacheCleanupInterval = 10 * time.Minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize this was set in a previous PR, but I wanted to check if there was a specific reason we pick 10 minutes here and TTL as 60 seconds above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No reason. Just a shot in the dark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This value was lowered to reduce @prateekgogia's concerns about 5 minutes being a long time to start using security groups and subnets after they're created.
launchTemplate, err := p.getLaunchTemplate(ctx, &options) | ||
// 4. Ensure the launch template exists, or create it | ||
launchTemplate, err := p.ensureLaunchTemplate(ctx, &launchTemplateOptions{ | ||
Cluster: *provisioner.Spec.Cluster, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the launch templates are pulled from the spec of the provisioner. Just wanted to check that we can't create different Launch Templates for the same cluster from different provisioners? Seems like based on this PR's changes, we cannot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cache key is the name, which is a hash of launchTemplateOptions. You can have many launch templates per cluster.
Values: []*string{aws.String(fmt.Sprintf(ClusterTagKeyFormat, clusterName))}, | ||
}}, | ||
}) | ||
if err != nil { | ||
return nil, fmt.Errorf("describing security groups with tag key %s, %w", fmt.Sprintf(ClusterTagKeyFormat, clusterName), err) | ||
} | ||
s.cache.Set(clusterName, output.SecurityGroups, CacheTTL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need to filter by subnet name and group tag key after we call the AWS API though, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! LGTM.
Issue #, if available:
Description of changes:
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.