Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add cpu sustained clock speed label to instance metadata #7043

Merged

Conversation

aidan-canva
Copy link
Contributor

Description
Some workloads are sensitive to variations instance CPU clock speed - either preferring a specific threshold or at least ensuring consistency across replicas. This PR adds the EC2 SustainedClockSpeedInGhz value as a Karpenter label (karpenter.k8s.aws/instance-cpu-sustained-clock-speed-mhz) so that workloads can add their preference.

The upstream value from the EC2 API is in Ghz and represented as a float (ie 2.4). nodeSelectors only support ints or strings and most use-cases for this will want to leverage the Gt or Lt operators to set minimum/maximum values. To make this usable, this implementation converts the Ghz value into Mhz and represents it as an int.

How was this change tested?

  • make test
  • Local deployment / manual testing

Does this change impact docs?
Do codegen docs count? website/content/en/preview/reference/instance-types.md has been updated to reflect this new instance attribute.

  • Yes, PR includes docs updates
  • Yes, issue opened: #
  • No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Sorry, something went wrong.

@aidan-canva aidan-canva requested a review from a team as a code owner September 19, 2024 22:34
Copy link

netlify bot commented Sep 19, 2024

Deploy Preview for karpenter-docs-prod ready!

Name Link
🔨 Latest commit e36c1a9
🔍 Latest deploy log https://app.netlify.com/sites/karpenter-docs-prod/deploys/673d3a427ebce90008158dd9
😎 Deploy Preview https://deploy-preview-7043--karpenter-docs-prod.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@rschalo
Copy link
Contributor

rschalo commented Sep 23, 2024

Thanks for your contribution! Running the test workflows.

@coveralls
Copy link

coveralls commented Sep 23, 2024

Pull Request Test Coverage Report for Build 11924912359

Details

  • 5 of 5 (100.0%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.01%) to 82.461%

Totals Coverage Status
Change from base Build 11924046371: 0.01%
Covered Lines: 5689
Relevant Lines: 6899

💛 - Coveralls

@njtran
Copy link
Contributor

njtran commented Sep 30, 2024

@aidan-canva are you able to fix the CI errors?

Copy link
Contributor

This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.

@aidan-canva
Copy link
Contributor Author

@njtran apologies for the delay, I've been on vacation for a period. I've just pushed a fix which should hopefully address the CI failures. Can I kick checks off myself?

Copy link
Contributor

This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.

@anteliano
Copy link

Without this feature it is very tricky to avoid the slow instance types like m5a which is 2.5Ghz while seemingly similar m5 is on 3.1Ghz, the difference is significant for many workloads.

@aidan-canva
Copy link
Contributor Author

I'm still motivated to get this PR merged - I believe it was in a mergable stable 3 weeks ago and just waiting for a repo owner to trigger the CI checks. Since then, there are now some merge conflicts that need to be resolved (I can do that) - but it seems wasteful to do it and not get an indication someone can help get this merged.

@rschalo
Copy link
Contributor

rschalo commented Nov 20, 2024

Hi @aidan-canva, apologies for the delay and timing. We're working on the next minor right now, think you could have this ready for review by Thursday morning and targeting merge Thursday EOD? Also, mind removing the instance type generation from the diff for this PR? We can do a fast-follow PR.

@aidan-canva
Copy link
Contributor Author

Hi @aidan-canva, apologies for the delay and timing. We're working on the next minor right now, think you could have this ready for review by Thursday morning and targeting merge Thursday EOD? Also, mind removing the instance type generation from the diff for this PR? We can do a fast-follow PR.

@rschalo - No worries, I appreciate its a hectic time of year for most of AWS!

I've just merged main and cleaned up some merge conflicts and validated this is passing tests via make test. I've also reset website/content/en/preview/reference/instance-types.md back to main to avoid further drift/conflicts.

Should be ready for a CI run now.

Copy link
Contributor

@rschalo rschalo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/karpenter snapshot

Copy link
Contributor

Snapshot successfully published to oci://021119463062.dkr.ecr.us-east-1.amazonaws.com/karpenter/snapshot/karpenter:0-e36c1a9db20b28fc61b941cc725f7574293c0f9d.
To install you must login to the ECR repo with an AWS account:

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 021119463062.dkr.ecr.us-east-1.amazonaws.com

helm upgrade --install karpenter oci://021119463062.dkr.ecr.us-east-1.amazonaws.com/karpenter/snapshot/karpenter --version "0-e36c1a9db20b28fc61b941cc725f7574293c0f9d" --namespace "kube-system" --create-namespace \
  --set "settings.clusterName=${CLUSTER_NAME}" \
  --set "settings.interruptionQueue=${CLUSTER_NAME}" \
  --set controller.resources.requests.cpu=1 \
  --set controller.resources.requests.memory=1Gi \
  --set controller.resources.limits.cpu=1 \
  --set controller.resources.limits.memory=1Gi \
  --wait

@rschalo
Copy link
Contributor

rschalo commented Nov 21, 2024

Two tests failing from snapshot:

  • [It] should delete pod with do-not-disrupt when it reaches its terminationGracePeriodSeconds which passes locally so appears to be a flake.
  • [It] should provision nodes for a deployment that requests vpc.amazonaws.com/pod-eni (security groups for pods) also seems to be a flake based on the instance type it launches. Will verify again in AM PST.

@aidan-canva
Copy link
Contributor Author

Two tests failling from snapshot:

  • [It] should delete pod with do-not-disrupt when it reaches its terminationGracePeriodSeconds which passes locally so appears to be a flake.
  • [It] should provision nodes for a deployment that requests vpc.amazonaws.com/pod-eni (security groups for pods) also seems to be a flake based on the instance type it launches. Will verify again in AM PST.

I looked at a few recently merged PR's and some of them also have these tests failing. Looks like a flake like you suggested.

Copy link
Contributor

@rschalo rschalo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. We'll need a fast-follow for the instance types and scheduling docs. Thanks again for the contribution!

@rschalo rschalo merged commit c3e7098 into aws:main Nov 21, 2024
30 of 32 checks passed
edibble21 pushed a commit to edibble21/karpenter-provider-aws that referenced this pull request Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants