-
Notifications
You must be signed in to change notification settings - Fork 983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Karpenter Fails to Fall Back to On-Demand in Local Zone with InvalidFleetConfiguration Error #6483
Comments
I suspect this issue may be addressed by #5704, which was included in Karpenter v0.36.0. This PR enables use of the spot pricing API even in isolated VPCs. With information from the pricing API, Karpenter should be able to determine if the instance type is available as a spot offering in the local zone, even in an isolated VPC. |
@jmdeal I just checked this does not fix the issue of Karpenter trying to launch the node though. It's an issue when Karpenter tries to execute the |
@jmdeal as promised, the node pools we have set up currently. On-demand without any weight and spot with a weight of 50. In localzones (like chi1 and scl1) we're not able to use spot at all it seems with this setup. It just fallsback to on-demand. On-demand node pool, no weight set.
Spot with weight set.
|
In our setup, we are using Karpenter with two weighted node pools in an isolated VPC. We have spot set to 50 and on-demand set to 0, so ideally, we’d expect spot to be the priority. It works consistently in most regions. That said, we have one region which is a local zone that only has one instance type available for node claims. We will be looking to expand these instance types later, but for now, we don’t understand why Karpenter fails to fall back to on-demand. It seems to repeatedly throw an InvalidFleetConfiguration error until the nodeclaim times out after 15 minutes. After talking to AWS, we learned that this spot instance type isn’t available, but it is for on-demand. That said, we think Karpenter should be able to respond to InvalidFleetConfiguration just like it would with an InsufficientInstanceCapacity error.
This issue appears to be similar to issue #2243. It is possible that when this issue was addressed, the fix did not account for private/isolated VPCs or local zones.
Observed Behavior:
Region: us-east-1-dfw-1a
Karpenter Version: 0.35.4
Instance Type: r5d.4xlarge
Error Message:
Launching node claim and creating an instance with fleet error: InvalidFleetConfiguration: Your requested instance type (r5d.4xlarge) is not supported in your requested Availability Zone (us-east-1-dfw-1a). It will keep trying to launch the same node claim instead of using the fallback mechanism. Note that when InsufficientInstanceCapacity is encountered in other regions, the fallback works as expected.
Expected Behavior:
Karpenter should handle InvalidFleetConfiguration errors by triggering a fallback mechanism, similar to how it handles InsufficientInstanceCapacity errors. For context, we have two node pools, one for spot and one for on-demand. We expect that if Karpenter encounters this InvalidFleetConfiguration error, it will fallback to on-demand.
Looking here this could be a potential fix:
karpenter-provider-aws/pkg/cloudprovider/aws/errors.go
Lines 35 to 41 in 37933ea
Supporting Information:
In the eu-central-1 region, Karpenter handles insufficient capacity errors as expected and fall back is successful:
Region: eu-central-1
Instance Type: c7i.16xlarge
**Error Message:**creating instance, insufficient capacity, with fleet error(s), InsufficientInstanceCapacity: We currently do not have sufficient c7i.16xlarge capacity in zones with support for 'gp3' volumes. Our system will be working on provisioning additional capacity.
Environment Variables:
Karpenter log output:
This log is observed every minute, and after 15 minutes, the node claim times out:
And from a "working" nodepool for comparison where it immediately falls back:
Reproduction Steps (Please include YAML):
Versions:
Chart Version: 0.35.4
Kubernetes Version: v1.28.1
The text was updated successfully, but these errors were encountered: