Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fix InstanceType cache invalidation on ICE eviction #5839

Merged
merged 1 commit into from
Mar 13, 2024

Conversation

jonathan-innis
Copy link
Contributor

@jonathan-innis jonathan-innis commented Mar 12, 2024

Fixes #N/A

Description

This change fixes a bug that causes us to keep ICEd offerings in the cache for beyond their unavailability cache interval (up to 8m using the current code [3m unavailable offering expiry + 5m instance type expiry]). The current code handles the addition of unavailable offerings to the cache, but it doesn't handle the eviction from the cache. This update ensures that cache eviction triggers the sequence number to increase.

This also drops the cleanup interval for the cache so that we react quicker to this eviction.

Before PR (no cache eviction handling) ~4.5m re-availability

{"level":"INFO","time":"2024-03-12T07:02:54.767Z","logger":"controller.interruption","message":"initiating delete from interruption message","commit":"1d7f91c-dirty","queue":"joinnis-karpenter-demo","messageKind":"SpotInterruptionKind","nodeclaim":"default-7qs7l","action":"CordonAndDrain","node":"ip-192-168-133-190.us-west-2.compute.internal"}
{"level":"INFO","time":"2024-03-12T07:02:54.992Z","logger":"controller.node.termination","message":"got new cache key","commit":"1d7f91c-dirty","node":"ip-192-168-133-190.us-west-2.compute.internal","id":"i-05671079e1e807cc0","key":"1-1-1-55e10a19da109e52-a8c7f832281a39c5-0000000000000000-AL2"}
{"level":"DEBUG","time":"2024-03-12T07:03:22.176Z","logger":"controller.disruption","message":"discovered subnets","commit":"1d7f91c-dirty","subnets":["subnet-01095f4c202083b01 (us-west-2a)","subnet-01a75589aa6237be1 (us-west-2b)","subnet-060f9fb17941d125c (us-west-2c)","subnet-0e60f26cf52c5a6df (us-west-2c)","subnet-0c89532267e680e6f (us-west-2a)","subnet-09d3d3e0bd7f68ee9 (us-west-2b)"]}
{"level":"INFO","time":"2024-03-12T07:03:22.176Z","logger":"controller.disruption","message":"got new cache key","commit":"1d7f91c-dirty","key":"1-1-1-6f0db7749c51be05-a8c7f832281a39c5-0000000000000000-AL2"}
{"level":"INFO","time":"2024-03-12T07:06:22.521Z","logger":"controller.disruption","message":"got new cache key","commit":"1d7f91c-dirty","key":"1-1-1-92036e12e424db69-a8c7f832281a39c5-0000000000000000-AL2"}
{"level":"INFO","time":"2024-03-12T07:06:25.991Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"1d7f91c-dirty","pods":"default/inflate-6cf94b79ff-jqhxw","duration":"9.652202ms"}
{"level":"INFO","time":"2024-03-12T07:06:25.991Z","logger":"controller.provisioner","message":"computed new nodeclaim(s) to fit pod(s)","commit":"1d7f91c-dirty","nodeclaims":1,"pods":1}
{"level":"INFO","time":"2024-03-12T07:06:26.019Z","logger":"controller.provisioner","message":"created nodeclaim","commit":"1d7f91c-dirty","nodepool":"default","nodeclaim":"default-t6nxg","requests":{"cpu":"1150m","memory":"100Mi","pods":"4"},"instance-types":"c5.large"}

After PR (with cache eviction handling) - ~3m re-availability

{"level":"DEBUG","time":"2024-03-12T07:34:23.039Z","logger":"controller.interruption","message":"removing offering from offerings","commit":"4b5d79b-dirty","queue":"joinnis-karpenter-demo","messageKind":"SpotInterruptionKind","nodeclaim":"default-dzrj9","action":"CordonAndDrain","node":"ip-192-168-42-5.us-west-2.compute.internal","reason":"SpotInterruptionKind","instance-type":"c5.xlarge","zone":"us-west-2a","capacity-type":"spot","ttl":"3m0s"}
{"level":"INFO","time":"2024-03-12T07:34:23.061Z","logger":"controller.interruption","message":"initiating delete from interruption message","commit":"4b5d79b-dirty","queue":"joinnis-karpenter-demo","messageKind":"SpotInterruptionKind","nodeclaim":"default-dzrj9","action":"CordonAndDrain","node":"ip-192-168-42-5.us-west-2.compute.internal"}
{"level":"INFO","time":"2024-03-12T07:34:23.297Z","logger":"controller.node.termination","message":"got new cache key","commit":"4b5d79b-dirty","node":"ip-192-168-42-5.us-west-2.compute.internal","id":"i-05aa6eb54c305cb52","key":"1-1-2-ed61da4234eb805d-a8c7f832281a39c5-0000000000000000--AL2-cbf29ce484222325-0000000000000000-0000000000000000"}
{"level":"INFO","time":"2024-03-12T07:37:31.167Z","logger":"controller.disruption","message":"got new cache key","commit":"4b5d79b-dirty","key":"1-1-4-ed61da4234eb805d-a8c7f832281a39c5-0000000000000000--AL2-cbf29ce484222325-0000000000000000-0000000000000000"}
{"level":"INFO","time":"2024-03-12T07:37:34.979Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"4b5d79b-dirty","pods":"default/inflate-6cf94b79ff-mgwm5","duration":"19.0521ms"}
{"level":"INFO","time":"2024-03-12T07:37:34.979Z","logger":"controller.provisioner","message":"computed new nodeclaim(s) to fit pod(s)","commit":"4b5d79b-dirty","nodeclaims":1,"pods":1}

How was this change tested?

make presubmit
/karpenter snapshot

Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: #
  • No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@jonathan-innis jonathan-innis requested a review from a team as a code owner March 12, 2024 07:25
@jonathan-innis jonathan-innis requested a review from bwagner5 March 12, 2024 07:25
Copy link

netlify bot commented Mar 12, 2024

Deploy Preview for karpenter-docs-prod ready!

Name Link
🔨 Latest commit be35df0
🔍 Latest deploy log https://app.netlify.com/sites/karpenter-docs-prod/deploys/65f09e0ab17cc90008fe9f93
😎 Deploy Preview https://deploy-preview-5839--karpenter-docs-prod.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@coveralls
Copy link

coveralls commented Mar 12, 2024

Pull Request Test Coverage Report for Build 8253944779

Details

  • 17 of 17 (100.0%) changed or added relevant lines in 2 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage decreased (-0.05%) to 82.312%

Files with Coverage Reduction New Missed Lines %
pkg/providers/amifamily/ami.go 1 90.32%
Totals Coverage Status
Change from base Build 8253107821: -0.05%
Covered Lines: 5291
Relevant Lines: 6428

💛 - Coveralls

Copy link
Contributor Author

@jonathan-innis jonathan-innis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/karpenter snapshot

Copy link
Contributor

Snapshot successfully published to oci://021119463062.dkr.ecr.us-east-1.amazonaws.com/karpenter/snapshot/karpenter:0-4b5d79bd987f1da04dd4568a22ec584783fd303a.
To install you must login to the ECR repo with an AWS account:

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 021119463062.dkr.ecr.us-east-1.amazonaws.com

helm upgrade --install karpenter oci://021119463062.dkr.ecr.us-east-1.amazonaws.com/karpenter/snapshot/karpenter --version "0-4b5d79bd987f1da04dd4568a22ec584783fd303a" --namespace "kube-system" --create-namespace \
  --set "settings.clusterName=${CLUSTER_NAME}" \
  --set "settings.interruptionQueue=${CLUSTER_NAME}" \
  --set controller.resources.requests.cpu=1 \
  --set controller.resources.requests.memory=1Gi \
  --set controller.resources.limits.cpu=1 \
  --set controller.resources.limits.memory=1Gi \
  --wait

@jonathan-innis jonathan-innis force-pushed the fix-ice-cache-invalidation branch from 4b5d79b to be35df0 Compare March 12, 2024 18:25
Copy link
Contributor

@jmdeal jmdeal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

@jonathan-innis jonathan-innis merged commit 982f2eb into aws:main Mar 13, 2024
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants