Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1alpha2 API changes #477

Merged
merged 1 commit into from
Jun 29, 2021
Merged

v1alpha2 API changes #477

merged 1 commit into from
Jun 29, 2021

Conversation

ellistarn
Copy link
Contributor

@ellistarn ellistarn commented Jun 25, 2021

Issue #, if available:

Description of changes:

  • Introduced a concept of node Expiration (not yet implemented)
  • Utilization functionality is now disabled by default and must be specified
  • v1alpha1 is now gone, customers will be forced to upgrade if they upgrade karpenter.
  • Simplified controller folder structure

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@ellistarn ellistarn force-pushed the ttl branch 4 times, most recently from 4099a87 to 532b426 Compare June 25, 2021 17:06
@ellistarn ellistarn changed the title [WIP] v1alpha2 API changes v1alpha2 API changes Jun 25, 2021
@JacobGabrielson JacobGabrielson self-requested a review June 28, 2021 16:49
@@ -52,6 +52,18 @@ spec:
- endpoint
- name
type: object
expiration:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there be a way for a user to pause this or opt out specific instances? A use case I could think of is during a big event/launch it's nice to temporarily pause these automatic termination events to prevent potential outages. It would also be nice to have a label that can be added (or removed) to skip a specific instance from expiring if you need to do troubleshooting or debugging on it. If karpenter adds a label it would be nice for a company to know which nodes will be automatically recycled based on a common label.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some sort of locking mechanism at the cluster level would also be helpful. Something that understands the health of the cluster and a percentage of healthy instances to consider before blindly terminating instances. I've been in real world situations where automation was terminating healthy instances while new, unhealthy instances were being created which was quickly degrading the health of the system.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree. I think we should respect a node.karpenter.sh/do-not-expire label on nodes. I see this as more of a manual intervention, so I'm wary of plumbing this through the API. I'll include this with the implementation.

Copy link
Contributor Author

@ellistarn ellistarn Jun 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something that understands the health of the cluster and a percentage of healthy instances to consider before blindly terminating instances.

See @njtran's doc on node termination budgets, which handles exactly this.

Copy link
Contributor Author

@ellistarn ellistarn Jun 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further, if you remove the expiration concept from the provisioner spec, it will never expire nodes (even if the nodes were created before this change).

@ellistarn ellistarn force-pushed the ttl branch 3 times, most recently from ecb9057 to 36276c4 Compare June 28, 2021 21:09
Copy link
Contributor

@njtran njtran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there might be some inconsistencies with the api comments and the functionality of the underutilization controller. Did you add in the no-op when underutilization isn't defined?

@@ -63,6 +63,11 @@ codegen: ## Generate code. Must be run if changes are made to ./pkg/apis/...
# CRDs don't currently jive with VolatileTime, which has an Any type.
perl -pi -e 's/Any/string/g' charts/karpenter/templates/provisioning.karpenter.sh_provisioners.yaml
hack/boilerplate.sh
gen-crd-api-reference-docs \
-api-dir ./pkg/apis/provisioning/v1alpha2 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of files, so it's hard to grok the new file structure here. Does removing the version names from pkg/controllers directory have any effect on this?

Is there a point to keeping a previous apis version if the pkg/controllers folder doesn't distinguish pkg?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. My expectation is that we will only ever have one version of the controller version, but we may at some point support multiple CRD API versions. K8s uses a concept of "storage version" for CRs, and then all CRs are translated (via a map function) into that storage version. Therefore we should only ever need to maintain one version of the controller code, but we may have multiple versions of the APIs. In this case, we're not keeping v1alpha1 around, since we can backwards incompat it.

@@ -107,6 +107,8 @@ spec:
name: ${CLUSTER_NAME}
caBundle: $(aws eks describe-cluster --name ${CLUSTER_NAME} --query "cluster.certificateAuthority.data" --output json)
endpoint: $(aws eks describe-cluster --name ${CLUSTER_NAME} --query "cluster.endpoint" --output json)
utilization:
ttlSeconds: 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to suggest the ttlSecondsUtilization here to be 300, right? Setting it to 0 would cause a lot of thrashing.

This should also be a top level parameter now, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From @rothgar's PR, making this 30.

docs/aws/examples/provisioner.yaml Outdated Show resolved Hide resolved
pkg/apis/provisioning/v1alpha2/provisioner_validation.go Outdated Show resolved Hide resolved
See the License for the specific language governing permissions and
limitations under the License.
*/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You added this in a previous PR, right? Do you need a rebase?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I built on a previous PR -- now that it's merged, this will disappear.

See the License for the specific language governing permissions and
limitations under the License.
*/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rebase?

See the License for the specific language governing permissions and
limitations under the License.
*/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: rebase?

@@ -46,16 +46,12 @@ func (c *GenericController) Reconcile(ctx context.Context, req reconcile.Request
}
// 2. Copy object for merge patch base
persisted := resource.DeepCopyObject()
// 3. Set defaults to enforce invariants on object being reconciled
if _, ok := resource.(apis.Defaultable); ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we added this to enforce invariants in case the webhooks were down. Are you removing this because there are no fields that need defaulting right now? In this case, shouldn't we keep this in the case where some more apis need defaulting functionality?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually a little wary of the defaulting logic. I decided that we should use something like ptr.Int64Value() wherever possible. This is why I implemented the knative PR.

pkg/controllers/reallocation/utilization.go Show resolved Hide resolved
See the License for the specific language governing permissions and
limitations under the License.
*/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: re: re: rebase

Makefile Show resolved Hide resolved
Copy link
Contributor

@njtran njtran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ellistarn ellistarn merged commit 8aab1e5 into aws:main Jun 29, 2021
@ellistarn ellistarn deleted the ttl branch June 29, 2021 23:27
Copy link
Contributor

@bwagner5 bwagner5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

ellistarn added a commit to ellistarn/karpenter-provider-aws that referenced this pull request Jun 29, 2021
ellistarn added a commit that referenced this pull request Jun 30, 2021
ellistarn added a commit to ellistarn/karpenter-provider-aws that referenced this pull request Jun 30, 2021
ellistarn added a commit that referenced this pull request Jun 30, 2021
* Revert "Revert "v1alpha2 API changes (#477)" (#490)"

This reverts commit d48734f.

* Verified changes
gfcroft pushed a commit to gfcroft/karpenter-provider-aws that referenced this pull request Nov 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants