-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CA does not work properly while using AWS EC2 IMDSv2 only in EKS #3592
Comments
Got hit with this too, EKS 1.17 |
We worked around this issue by injecting the |
I was not able to workaround this issue by injecting Error log / behavior with IMDSv2 [token required]: I1130 21:13:10.946968 1 aws_cloud_provider.go:371] Successfully load 392 EC2 Instance Types [...truncated...]
E1130 21:13:14.176281 1 aws_manager.go:262] Failed to regenerate ASG cache: cannot autodiscover ASGs: NoCredentialProviders: no valid providers in chain. Deprecated.
For verbose messaging see aws.Config.CredentialsChainVerboseErrors
F1130 21:13:14.176302 1 aws_cloud_provider.go:376] Failed to create AWS Manager: cannot autodiscover ASGs: NoCredentialProviders: no valid providers in chain. Deprecated.
For verbose messaging see aws.Config.CredentialsChainVerboseErrors Here's our cluster-autoscaler helm release [chart v9.1.0 setting resource "helm_release" "cluster_autoscaler" {
depends_on = [
module.eks, # Wait for cluster to be ready
]
repository = "https://kubernetes.github.io/autoscaler"
chart = "cluster-autoscaler"
version = "9.1.0"
name = "cluster-autoscaler"
namespace = "kube-system"
values = [
# Values set from terraform outputs
<<EOL
awsRegion: ${module.eks.cluster_region}
autoDiscovery:
clusterName: ${module.eks.cluster_name}
EOL
,
# Workaround issue with IMDSv2
# Inject AWS_DEFAULT_REGION into environment
# https://github.com/kubernetes/autoscaler/issues/3592
<<EOL
extraEnv:
AWS_DEFAULT_REGION: ${module.eks.cluster_region}
EOL
,
] # End helm_release.values[]
} and resulting pod description -- AWS_REGION is already set from the chart: Name: cluster-autoscaler-aws-cluster-autoscaler-c4b7bdd58-cm2d2
Namespace: kube-system
Priority: 0
Node: ip-10-100-1-57.us-west-2.compute.internal/10.100.1.57
Start Time: Mon, 30 Nov 2020 13:06:38 -0800
Labels: app.kubernetes.io/instance=cluster-autoscaler
app.kubernetes.io/name=aws-cluster-autoscaler
pod-template-hash=c4b7bdd58
Annotations: kubernetes.io/psp: eks.privileged
Status: Running
IP: 10.100.0.110
IPs:
IP: 10.100.0.110
Controlled By: ReplicaSet/cluster-autoscaler-aws-cluster-autoscaler-c4b7bdd58
Containers:
aws-cluster-autoscaler:
Container ID: docker://f91c44b21712ebcf385dfd687c5631dd44ceeb76d25afb765e6b9a5cfc43f96c
Image: us.gcr.io/k8s-artifacts-prod/autoscaling/cluster-autoscaler:v1.18.1
Image ID: docker-pullable://us.gcr.io/k8s-artifacts-prod/autoscaling/cluster-autoscaler@sha256:1f5b11617389b8e4ce15eb45fdbbfd4321daeb63c234d46533449ab780b6ca9a
Port: 8085/TCP
Host Port: 0/TCP
Command:
./cluster-autoscaler
--cloud-provider=aws
--namespace=kube-system
--node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/kg-cet-917-staging-us-west-2
--logtostderr=true
--stderrthreshold=info
--v=4
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 255
Started: Mon, 30 Nov 2020 13:10:10 -0800
Finished: Mon, 30 Nov 2020 13:10:16 -0800
Ready: False
Restart Count: 5
Liveness: http-get http://:8085/health-check delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
AWS_REGION: us-west-2
AWS_DEFAULT_REGION: us-west-2
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from cluster-autoscaler-aws-cluster-autoscaler-token-dlxmc (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
cluster-autoscaler-aws-cluster-autoscaler-token-dlxmc:
Type: Secret (a volume populated by a Secret)
SecretName: cluster-autoscaler-aws-cluster-autoscaler-token-dlxmc
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4m43s default-scheduler Successfully assigned kube-system/cluster-autoscaler-aws-cluster-autoscaler-c4b7bdd58-cm2d2 to ip-10-100-1-57.us-west-2.compute.internal
Normal Pulling 4m42s kubelet Pulling image "us.gcr.io/k8s-artifacts-prod/autoscaling/cluster-autoscaler:v1.18.1"
Normal Pulled 4m40s kubelet Successfully pulled image "us.gcr.io/k8s-artifacts-prod/autoscaling/cluster-autoscaler:v1.18.1"
Warning BackOff 2m52s (x9 over 4m10s) kubelet Back-off restarting failed container
Normal Created 2m38s (x5 over 4m40s) kubelet Created container aws-cluster-autoscaler
Normal Started 2m38s (x5 over 4m39s) kubelet Started container aws-cluster-autoscaler
Normal Pulled 2m38s (x4 over 4m16s) kubelet Container image "us.gcr.io/k8s-artifacts-prod/autoscaling/cluster-autoscaler:v1.18.1" already present on machine
|
I was not able to workaround this issue by injecting Also, there are other issues #3276 #3216 related to the load the Instance Type list from pricing API. Thus, I upgraded to the latest version 1.20, and added Here are error log message with IMDSv2 [token required]:
Rollback to the worker node with IMDSv1.
|
Hi Contributors @mwielgus @losipiuk @aleksandra-malinowska @bskiba. As this is causing eks cluster not be upgraded to IMDSv2 support, Can this issue be prioritized, I suspect CA does not use token-backed sessions to access IMDS. CA pod, it keeps OOMed and results in CrashLoopBackOff. Thank you. |
It appears there are multiple symptoms here.
My guess is that (1) is a spurious error, it's difficult to tell. @hans72118, can you follow up with your memory settings? I'll take a look at how IMDSv2 works and what the path forward is here to make sure CAS can use these tokens. |
It looks like there's some custom imds logic here https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/aws_util.go#L77. It's not clear why we don't rely on https://docs.aws.amazon.com/sdk-for-go/api/aws/ec2metadata/#EC2Metadata.GetMetadata |
It should be possible to skip this logic by using autoscaler/cluster-autoscaler/main.go Line 175 in 43ab030
Alternatively, it should be possible to skip by including the @focaaby, it's not clear from your logs or describe pods that this wasn't working for you. Looks like the CA started up normally and populated all listers/watchers? |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
We still experiencing this error:
eks: v1.19.13-eks-8df270
Update: using this terraform snippet in module: https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/modules/self-managed-node-group/main.tf#L186 works fine:
|
We found this issue after setting HttpTokens to required on the EC2 instance for the k8s node. We found this note here:
So we updated the "HttpPutResponseHopLimit" to 2 and it is working now. |
We just ran into this option after disabling IMDSv1 but we've had to set |
This doesn't seem to work with
|
Recently AWS EKS supports EC2 Instance Metadata Service v2.
In my testing environment, I create a worker node with IMDSv2 only and it requires to use token-backed sessions to access IMDS.
However with this condition, CA seems cannot unmarshall it.
Check the CA pod, it keeps OOMed and results in CrashLoopBackOff.
If uses IMDSv1 back, it works without issue as following:
I suspect CA does not use token-backed sessions to access IMDS.
The text was updated successfully, but these errors were encountered: