Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added several more troubleshooting issues to docs #1618

Merged
merged 6 commits into from
Apr 6, 2022
Merged
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
108 changes: 107 additions & 1 deletion website/content/en/preview/troubleshooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -138,6 +138,112 @@ Provisioners created without those tags and run in more recent Karpenter version
If you are providing a [custom launch template]({{<ref "./aws/launch-templates" >}}), specifiying a `subnetSelector` is still required.
However, specifying a `securityGroupSelector` will cause a validation error.

## Reapply credentials with Terraform connection failure
chrisnegus marked this conversation as resolved.
Show resolved Hide resolved

If you are using the [Getting Started with Terraform]({{<ref "./getting-started/getting-started-with-terraform/" >}}) instructions and you see the following error:

```text
63: resource "kubernetes_config_map" "aws_auth" {
Error: Post "http://localhost/api/v1/namespaces/kube-system/configmaps": dial tcp 127.0.0.1:80: connect: connection refused
with module.eks.kubernetes_config_map.aws_auth[0],
on .terraform/modules/eks/aws_auth.tf line 63, in resource "kubernetes_config_map" "aws_auth":
```

You can fix the problem by exporting your kubeconfig credentials:

```text
export KUBECONFIG=${PWD}/kubeconfig_${CLUSTER_NAME}
export KUBE_CONFIG_PATH=$KUBECONFIG
```

or you could run the following aws command:

```text
aws eks --region $AWS_DEFAULT_REGION update-kubeconfig --name $CLUSTER_NAME
```

Then rerun:

```text
terraform apply -var cluster_name=$CLUSTER_NAME
```

## Terraform fails to create instance profile when name is too long

In the Getting Started with Terraform instructions to [Configure the KarpenterNode IAM Role]({{<ref "./getting-started/getting-started-with-terraform/#configure-the-karpenternode-iam-role" >}}), the name assigned to the aws_iam_instance_profile cannot exceed 38 characters. If it does, it will fail with a message similar to:

```text
Error: expected length of name_prefix to be in the range (1 - 38), got with module.eks.aws_iam_role.cluster[0],
on .terraform/modules/eks/main.tf line 131, in resource "aws_iam_role" "cluster":
131: name_prefix = var.cluster_iam_role_name != "" ? null : var.cluster_name
```

Note that it can be easy to run over the 38-character limit considering that the example includes KarpenterNodeInstanceProfile- (29 characters) and -karpenter-demo (15 characters).
That leaves only four characters for your user name.
You can reduce the number of characters consumed by changing `KarpenterNodeInstanceProfile-` to something like `KarpenterNode-`.

## Karpenter Role names exceeding 64-character limit

If you use a tool such as AWS CDK to generate your Kubernetes cluster name, when you add Karpenter to your cluster you could end up with a cluster name that is too long to incorporate into your KarpenterNodeRole name (which is limited to 64 characters).

Node role names for Karpenter are created in the form `KarpenterNodeRole-${Cluster_Name}` in the [Create the KarpenterNode IAM Role]({{<ref "./getting-started/getting-started-with-eksctl/#create-the-karpenternode-iam-role" >}}) section of the getting started guide.
If a long cluster name causes the Karpenter node role name to exceed 64 characters, creating that object will fail.

Keep in mind that `KarpenterNodeRole-` is just a recommendation from the getting started guide.
Instead using of the eksctl role, you can shorten the name to anything you like, as long as it has the right permissions.

## Node terminates before ready on failed encrypted EBS volume
If you are using a custom launch template and an encrypted EBS volume, the IAM principal launching the node may not have sufficient permissions to use the KMS custom managed key (CMK) for the EC2 EBS root volume.
chrisnegus marked this conversation as resolved.
Show resolved Hide resolved
This issue also applies to [Block Device Mappings]({{<ref "./aws/provisioning/#block-device-mappings" >}}) specified in the Provisioner.
In either case, this results in the node terminating almost immediately upon creation.

Keep in mind that it is possible that EBS Encryption can be enabled without your knowledge.
EBS encryption could have been enabled by an account administrator or by default on a per region basis.
See [Encryption by default](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSEncryption.html#encryption-by-default) for details.

To correct the problem if it occurs, you can use the approach that AWS EBS uses, which avoids adding particular roles to the KMS policy:

```json
[
{
"Sid": "Allow access through EBS for all principals in the account that are authorized to use EBS",
"Effect": "Allow",
"Principal": {
"AWS": ""
},
"Action": [
"kms:Encrypt",
"kms:Decrypt",
"kms:ReEncrypt",
"kms:GenerateDataKey*",
"kms:CreateGrant",
"kms:DescribeKey"
],
"Resource": "",
"Condition": {
"StringEquals": {
"kms:ViaService": "ec2.${REGION}.amazonaws.com",
chrisnegus marked this conversation as resolved.
Show resolved Hide resolved
"kms:CallerAccount": "${AWS_ACCOUNT_ID}"
}
}
},
{
"Sid": "Allow direct access to key metadata to the account",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::${AWS_ACCOUNT_ID}:root"
},
"Action": [
"kms:Describe",
"kms:Get*",
"kms:List*",
"kms:RevokeGrant"
],
"Resource": "*"
}
]
```

## Pods using Security Groups for Pods stuck in "Pending" state for up to 30 minutes before transitioning to "Running"

When leveraging [Security Groups for Pods](https://docs.aws.amazon.com/eks/latest/userguide/security-groups-for-pods.html), Karpenter will launch nodes as expected but pods will be stuck in "Pending" state for up to 30 minutes before transitioning to "Running". This is related to an interaction between Karpenter and the [amazon-vpc-resource-controller](https://github.com/aws/amazon-vpc-resource-controller-k8s) when a pod requests `vpc.amazonaws.com/pod-eni` resources. More info can be found in [issue #1252](https://github.com/aws/karpenter/issues/1252).
Expand All @@ -157,4 +263,4 @@ spec:
securityGroupSelector:
karpenter.sh/discovery: karpenter-demo
ttlSecondsAfterEmpty: 30
```
```