-
Notifications
You must be signed in to change notification settings - Fork 983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1alpha2 API changes #477
v1alpha2 API changes #477
Conversation
4099a87
to
532b426
Compare
@@ -52,6 +52,18 @@ spec: | |||
- endpoint | |||
- name | |||
type: object | |||
expiration: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will there be a way for a user to pause this or opt out specific instances? A use case I could think of is during a big event/launch it's nice to temporarily pause these automatic termination events to prevent potential outages. It would also be nice to have a label that can be added (or removed) to skip a specific instance from expiring if you need to do troubleshooting or debugging on it. If karpenter adds a label it would be nice for a company to know which nodes will be automatically recycled based on a common label.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some sort of locking mechanism at the cluster level would also be helpful. Something that understands the health of the cluster and a percentage of healthy instances to consider before blindly terminating instances. I've been in real world situations where automation was terminating healthy instances while new, unhealthy instances were being created which was quickly degrading the health of the system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agree. I think we should respect a node.karpenter.sh/do-not-expire
label on nodes. I see this as more of a manual intervention, so I'm wary of plumbing this through the API. I'll include this with the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something that understands the health of the cluster and a percentage of healthy instances to consider before blindly terminating instances.
See @njtran's doc on node termination budgets, which handles exactly this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Further, if you remove the expiration concept from the provisioner spec, it will never expire nodes (even if the nodes were created before this change).
charts/karpenter/templates/provisioning.karpenter.sh_provisioners.yaml
Outdated
Show resolved
Hide resolved
charts/karpenter/templates/provisioning.karpenter.sh_provisioners.yaml
Outdated
Show resolved
Hide resolved
charts/karpenter/templates/provisioning.karpenter.sh_provisioners.yaml
Outdated
Show resolved
Hide resolved
ecb9057
to
36276c4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there might be some inconsistencies with the api comments and the functionality of the underutilization controller. Did you add in the no-op when underutilization isn't defined?
@@ -63,6 +63,11 @@ codegen: ## Generate code. Must be run if changes are made to ./pkg/apis/... | |||
# CRDs don't currently jive with VolatileTime, which has an Any type. | |||
perl -pi -e 's/Any/string/g' charts/karpenter/templates/provisioning.karpenter.sh_provisioners.yaml | |||
hack/boilerplate.sh | |||
gen-crd-api-reference-docs \ | |||
-api-dir ./pkg/apis/provisioning/v1alpha2 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a lot of files, so it's hard to grok the new file structure here. Does removing the version names from pkg/controllers directory have any effect on this?
Is there a point to keeping a previous apis
version if the pkg/controllers
folder doesn't distinguish pkg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. My expectation is that we will only ever have one version of the controller version, but we may at some point support multiple CRD API versions. K8s uses a concept of "storage version" for CRs, and then all CRs are translated (via a map function) into that storage version. Therefore we should only ever need to maintain one version of the controller code, but we may have multiple versions of the APIs. In this case, we're not keeping v1alpha1 around, since we can backwards incompat it.
docs/aws/README.md
Outdated
@@ -107,6 +107,8 @@ spec: | |||
name: ${CLUSTER_NAME} | |||
caBundle: $(aws eks describe-cluster --name ${CLUSTER_NAME} --query "cluster.certificateAuthority.data" --output json) | |||
endpoint: $(aws eks describe-cluster --name ${CLUSTER_NAME} --query "cluster.endpoint" --output json) | |||
utilization: | |||
ttlSeconds: 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to suggest the ttlSecondsUtilization here to be 300, right? Setting it to 0 would cause a lot of thrashing.
This should also be a top level parameter now, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From @rothgar's PR, making this 30.
pkg/cloudprovider/aws/ami.go
Outdated
See the License for the specific language governing permissions and | ||
limitations under the License. | ||
*/ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You added this in a previous PR, right? Do you need a rebase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I built on a previous PR -- now that it's merged, this will disappear.
See the License for the specific language governing permissions and | ||
limitations under the License. | ||
*/ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rebase?
See the License for the specific language governing permissions and | ||
limitations under the License. | ||
*/ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: rebase?
@@ -46,16 +46,12 @@ func (c *GenericController) Reconcile(ctx context.Context, req reconcile.Request | |||
} | |||
// 2. Copy object for merge patch base | |||
persisted := resource.DeepCopyObject() | |||
// 3. Set defaults to enforce invariants on object being reconciled | |||
if _, ok := resource.(apis.Defaultable); ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we added this to enforce invariants in case the webhooks were down. Are you removing this because there are no fields that need defaulting right now? In this case, shouldn't we keep this in the case where some more apis need defaulting functionality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually a little wary of the defaulting logic. I decided that we should use something like ptr.Int64Value() wherever possible. This is why I implemented the knative PR.
pkg/utils/conditions/conditions.go
Outdated
See the License for the specific language governing permissions and | ||
limitations under the License. | ||
*/ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: re: re: rebase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
This reverts commit 8aab1e5.
Issue #, if available:
Description of changes:
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.