Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helm not installing? #28

Closed
mbonig opened this issue Apr 28, 2022 · 21 comments
Closed

Helm not installing? #28

mbonig opened this issue Apr 28, 2022 · 21 comments

Comments

@mbonig
Copy link

mbonig commented Apr 28, 2022

Ok, I don't think this is exactly on your construct, but I'm seeing a wild problem after installing the latest version. I've set it up to deploy to a cluster. I can review the eks provider logs and see it running the command:

Running command: ['kubectl', 'apply', '--kubeconfig', '/tmp/kubeconfig', '-f', '/tmp/manifest.yaml', '--prune', '-l', 'aws.cdk.eks/prune-c85936281e947e3c6cf66002393da6e780f6ed634e']

b'provisioner.karpenter.sh/default configured\n`

But when I check the cluster, I see no pods, no deployments, no replicasets. Helm doesn't show anything deployed to the karpenter namespace.

Any ideas why this could be happening?

@robertd
Copy link
Owner

robertd commented Apr 28, 2022 via email

@mbonig
Copy link
Author

mbonig commented Apr 29, 2022

Nope, 1.21

@robertd
Copy link
Owner

robertd commented Apr 29, 2022

Would you mind sharing your CDK code snippet for cluster and Karpenter configuration?

@robertd
Copy link
Owner

robertd commented Apr 29, 2022

And which version of CDK are you using?

@mbonig
Copy link
Author

mbonig commented Apr 29, 2022

2.22.0

@mbonig
Copy link
Author

mbonig commented Apr 29, 2022

Would you mind sharing your CDK code snippet for cluster and Karpenter configuration?

 const karpenter = new Karpenter(this, 'Karpenterv2', {
      vpc: this.props.vpc,
      cluster: this.cluster,
    });
    karpenter.addProvisioner('default', {
      requirements: {
        archTypes: [ArchType.AMD64],
        capacityTypes: [KarpenterCapcityType.SPOT],
        instanceTypes: [
          InstanceType.of(InstanceClass.T3A, InstanceSize.SMALL),
          InstanceType.of(InstanceClass.C5A, InstanceSize.SMALL),
          InstanceType.of(InstanceClass.T3A, InstanceSize.MEDIUM),
          InstanceType.of(InstanceClass.T3A, InstanceSize.LARGE),
          InstanceType.of(InstanceClass.M5A, InstanceSize.LARGE),
        ],
      },
    });

for the cluster:

   this.cluster = new Cluster(this, 'Cluster', {
     version: KubernetesVersion.V1_21,
     clusterName: this.props.clusterName,
     vpc: this.props.vpc,
     defaultCapacity: this.props.clusterName === 'prod' ? 3 : 1,
     defaultCapacityInstance: InstanceType.of(InstanceClass.T3A, InstanceSize.SMALL),
     endpointAccess: EndpointAccess.PRIVATE,
   });

@robertd
Copy link
Owner

robertd commented Apr 29, 2022

@mbonig Can you check the contents of this file in your cdk app (node_modules/aws-cdk-lib/lambda-layer-kubectl/layer/Dockerfile) and see which versions KUBECTL_VERSION and HELM_VERSION variables point to?

@mbonig
Copy link
Author

mbonig commented Apr 29, 2022

1.20.0 and 3.8.1

@mbonig
Copy link
Author

mbonig commented Apr 29, 2022

k, I think I might be seeing the issue... looking at the manifest that's getting deployed, it contains this prune label:

\"labels\":{\"aws.cdk.eks/prune-c85936281e947e3c6cf66002393da6e780f6ed634e\":\"\"}}

and when looking at the command being run against the server:

'--prune', '-l', 'aws.cdk.eks/prune-c85936281e947e3c6cf66002393da6e780f6ed634e'

I'm not familiar with the prune flag, but reading the docs it sounds like maybe it's pruning the resource it's deploying?

@mbonig
Copy link
Author

mbonig commented Apr 29, 2022

Hmm, actually I'm seeing something different now. I tried taking that manifest I see in my logs and applying it manually and I'm getting an invalid format. Isn't this manifest generated from helm?

"[{\"apiVersion\":\"karpenter.sh/v1alpha5\",\"kind\":\"Provisioner\",\"metadata\":{\"name\":\"default\",\"labels\":{\"aws.cdk.eks/prune-c85936281e947e3c6cf66002393da6e780f6ed634e\":\"\"}},\"spec\":{\"requirements\":[{\"key\":\"karpenter.sh/capacity-type\",\"operator\":\"In\",\"values\":[\"spot\"]},{\"key\":\"kubernetes.io/arch\",\"operator\":\"In\",\"values\":[\"amd64\"]},{\"key\":\"topology.kubernetes.io/zone\",\"operator\":\"In\",\"values\":[\"us-east-1a\",\"us-east-1b\",\"us-east-1c\"]},{\"key\":\"node.kubernetes.io/instance-type\",\"operator\":\"In\",\"values\":[\"t3a.small\",\"t3a.medium\",\"t3a.large\",\"m5a.medium\"]}],\"labels\":{\"cluster-name\":\"qa5\"},\"provider\":{\"subnetSelector\":{\"karpenter.sh/discovery/qa5\":\"*\"},\"securityGroupSelector\":{\"kubernetes.io/cluster/qa5\":\"owned\"},\"instanceProfile\":\"Karpenterv2InstanceProfileD53B9029\"}}}]"

@robertd
Copy link
Owner

robertd commented Apr 29, 2022

I'm getting a different error:

2022-04-29T15:25:02.660Z	ERROR	controller.provisioning	Launching node, creating cloud provider machine, with fleet error(s), InvalidParameterValue: Value (karpenterInstanceProfile13C1F80D) for parameter iamInstanceProfile.name is invalid. Invalid IAM Instance Profile name	{"commit": "9b1f078", "provisioner": "spgprovisioner"}

I wonder if Karpenter itself is set on KarpenterNodeInstanceProfile- prefix.

I'll have to troubleshoot this further (and possibly revert instance profile name changes) to see what's going on.

@robertd
Copy link
Owner

robertd commented Apr 29, 2022

Hmm, actually I'm seeing something different now. I tried taking that manifest I see in my logs and applying it manually and I'm getting an invalid format. Isn't this manifest generated from helm?

"[{\"apiVersion\":\"karpenter.sh/v1alpha5\",\"kind\":\"Provisioner\",\"metadata\":{\"name\":\"default\",\"labels\":{\"aws.cdk.eks/prune-c85936281e947e3c6cf66002393da6e780f6ed634e\":\"\"}},\"spec\":{\"requirements\":[{\"key\":\"karpenter.sh/capacity-type\",\"operator\":\"In\",\"values\":[\"spot\"]},{\"key\":\"kubernetes.io/arch\",\"operator\":\"In\",\"values\":[\"amd64\"]},{\"key\":\"topology.kubernetes.io/zone\",\"operator\":\"In\",\"values\":[\"us-east-1a\",\"us-east-1b\",\"us-east-1c\"]},{\"key\":\"node.kubernetes.io/instance-type\",\"operator\":\"In\",\"values\":[\"t3a.small\",\"t3a.medium\",\"t3a.large\",\"m5a.medium\"]}],\"labels\":{\"cluster-name\":\"qa5\"},\"provider\":{\"subnetSelector\":{\"karpenter.sh/discovery/qa5\":\"*\"},\"securityGroupSelector\":{\"kubernetes.io/cluster/qa5\":\"owned\"},\"instanceProfile\":\"Karpenterv2InstanceProfileD53B9029\"}}}]"

This is generated by this... I'm not sure when prune flag gets added.

    this.karpenterHelmChart = new HelmChart(this, 'HelmChart', {
      chart: 'karpenter',
      createNamespace: true,
      version: '0.9.0',
      cluster: this.cluster,
      namespace: 'karpenter',
      release: 'karpenter',
      repository: 'https://charts.karpenter.sh',
      timeout: Duration.minutes(15),
      wait: true,
      values: {
        clusterName: this.cluster.clusterName,
        clusterEndpoint: this.cluster.clusterEndpoint,
        serviceAccount: {
          annotations: {
            'eks.amazonaws.com/role-arn': this.karpenterControllerRole.roleArn,
          },
        },
        aws: {
          // instanceProfile is created using L1 construct (CfnInstanceProfile), thus we're referencing logicalId directly 
          // TODO: revisit this when L2 InstanceProfile construct is released
          defaultInstanceProfile: this.instanceProfile.logicalId,
        },
      },
    });

@mbonig
Copy link
Author

mbonig commented Apr 29, 2022

When I compare that manifest to what is templated by helm (running locally) I get a much different deployment.

I think I have an idea of what could have happened. I had the original Karpenter deployed, it took care of all the CRDS. Then, I deployed 'v2', which is only trying to add the provisioner, since all the other CRDS existed. Now it's trying to deploy and all the CRDS and other controllers and deployments are gone, so the 'Provisioner' RD isn't enough to succeed.

Maybe??

@robertd
Copy link
Owner

robertd commented Apr 29, 2022

First of all ... thanks for helping me troubleshoot the issue. I think we have two different issues here. I think I'll have to revert #27 because logicalId is not the instance profile name. I believe L2 construct for InstanceProfile is needed to enable automatic name generation down the road (that shouldn't be that hard to implement in CDK).

@robertd
Copy link
Owner

robertd commented Apr 29, 2022

Also... as a rule of thumb... when I update my clusters I tend to comment out cdk-karpenter and provisioner sections, deploy and then un-comment them and apply. From time to time manifest propagation on the EKS cluster through CDK can be finicky. I wonder if introducing cdk8s to this construct to handle cluster related operations would be a better fit.

@robertd
Copy link
Owner

robertd commented Apr 29, 2022

Few other questions (some of them unrelated)

  • endpointAccess: EndpointAccess.PRIVATE ... my EKS cluster is using EndpointAccess.PUBLIC_AND_PRIVATE (default), and I'm not sure if this has to do with anything.
  • KarpenterCapcityType.SPOT vs CapacityType.SPOT. Just curious, why are you creating your own enum for this?

@mbonig
Copy link
Author

mbonig commented Apr 29, 2022

Few other questions (some of them unrelated)

  • endpointAccess: EndpointAccess.PRIVATE ... my EKS cluster is using EndpointAccess.PUBLIC_AND_PRIVATE (default), and I'm not sure if this has to do with anything.
  • KarpenterCapcityType.SPOT vs CapacityType.SPOT. Just curious, why are you creating your own enum for this?

If there was an API connection issue I don't think it'd make it as far in the process as it does.

As for the CapacityType:

import { ArchType, CapacityType as KarpenterCapcityType, Karpenter } from 'cdk-karpenter';

to avoid a name collision on the ECS enum

@mbonig
Copy link
Author

mbonig commented Apr 29, 2022

Also... as a rule of thumb... when I update my clusters I tend to comment out cdk-karpenter and provisioner sections, deploy and then un-comment them and apply. From time to time manifest propagation on the EKS cluster through CDK can be finicky. I wonder if introducing cdk8s to this construct to handle cluster related operations would be a better fit.

K, Ill try this. Was hoping for an easier way forward because this code will be deployed out to a lot of environments and I don't want reduced capacity along the way. I assume if I delete Karpenter it won't necessarily delete the underlying nodes and remove capacity.

@mbonig
Copy link
Author

mbonig commented Apr 29, 2022

Ok, well now I have a whole new problem, should I open a new issue for it?

From my Karpenter pod:

 Launching node, creating cloud provider machine, with fleet error(s), InvalidParamete │
│ rValue: Value (Karpenterv2InstanceProfileD53B9029) for parameter iamInstanceProfile.name is invalid. Invalid IAM Instance Profile name    {"commit": "9b1f078",  │
│ "provisioner": "default"}

Looking in IAM I do see the role with the right instanceProfileArn:

arn:aws:iam::<redacted>:instance-profile/qa5-Cluster-Karpenterv2InstanceProfileD53B9029-aKGFWFXKxtOj

@robertd
Copy link
Owner

robertd commented Apr 29, 2022

@mbonig Just pinged you on slack.

@robertd
Copy link
Owner

robertd commented May 1, 2022

Closing this for now. Reopen if needed.

@robertd robertd closed this as completed May 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants