Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Post Install Hook doesn't ever sync via ArgoCD #6765

Closed
danielloader opened this issue Aug 15, 2024 · 36 comments
Closed

Post Install Hook doesn't ever sync via ArgoCD #6765

danielloader opened this issue Aug 15, 2024 · 36 comments
Labels
bug Something isn't working triage/accepted Indicates that the issue has been accepted as a valid issue

Comments

@danielloader
Copy link

danielloader commented Aug 15, 2024

Description

Observed Behavior:

Due to

in conjunction with this issue argoproj/argo-cd#6880 (comment) means the hook never syncs and installation via ArgoCD isn't possible.

Expected Behavior:

Chart syncs via ArgoCD installation.

Reproduction Steps (Please include YAML):

Install the karpenter chart via ArgoCD and default values.

Versions:

  • Chart Version: 1.0.0
  • Kubernetes Version (kubectl version): 1.30.0
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@danielloader danielloader added bug Something isn't working needs-triage Issues that need to be triaged labels Aug 15, 2024
@Vinaum8
Copy link

Vinaum8 commented Aug 15, 2024

I installed karpenter in a separate namespace and with a different service name than the default.

I can't configure the values ​​for the webhook and it returns the error:

service: karpenter not found.

However, it is obvious that I did not create the service with this name in the kube-system namespace.

I am using version 0.37.1 and I installed karpenter with ArgoCD.

@engedaam engedaam added triage/accepted Indicates that the issue has been accepted as a valid issue and removed needs-triage Issues that need to be triaged labels Aug 15, 2024
@inclemenstv
Copy link

I had the same bug when I tried to install Karpenter using Argo cd
Chart Version: 1.0.0

@snieg
Copy link

snieg commented Aug 16, 2024

I had the same bug when I tried to install Karpenter using Argo cd Chart Version: 1.0.0

Same here. As a temporary solution, I had to edit each CRD and change the namespace to the proper one.

image

CRDs from Chart refer to these files, it looks like the namespace is hardcoded to kube-system

https://github.com/aws/karpenter-provider-aws/blob/main/pkg/apis/crds/karpenter.k8s.aws_ec2nodeclasses.yaml#L1342

@inclemenstv
Copy link

inclemenstv commented Aug 16, 2024

I had the same bug when I tried to install Karpenter using Argo cd Chart Version: 1.0.0

Same here. As a temporary solution, I had to edit each CRD and change the namespace to the proper one.

image

yes I have same DIFF did you install Karpenter-crd chart separatly from karpenter main chart? and you edited CRD manually?

@snieg
Copy link

snieg commented Aug 16, 2024

yes I have same DIFF did you install Karpenter-crd chart separatly from karpenter main chart? and you edited CRD manually?

I installed it from the main chart and edited manually.

@Vinaum8
Copy link

Vinaum8 commented Aug 16, 2024

yes I have same DIFF did you install Karpenter-crd chart separatly from karpenter main chart? and you edited CRD manually?

I installed it from the main chart and edited manually.

@snieg has your argocd gone out of sync?

@snieg
Copy link

snieg commented Aug 16, 2024

yes I have same DIFF did you install Karpenter-crd chart separatly from karpenter main chart? and you edited CRD manually?

I installed it from the main chart and edited manually.

@snieg has your argocd gone out of sync?

Yes, after editing CRDs, ArgoCD shows "out of sync" now.

Luckily, I only updated one cluster, I'll hold off on updating until the stable version is actually stable. ;)

@virtualdom
Copy link

I'm seeing the same behavior when upgrading to v0.35.6 (prerequisite to upgrading to v1.0.0)

@engedaam
Copy link
Contributor

Can you relay on the karpenter-crd helm chart as the mechanism to maintain the CRDs? https://karpenter.sh/v1.0/upgrading/upgrade-guide/#crd-upgrades We have values there that allows you to edit the conversion block. https://github.com/aws/karpenter-provider-aws/blob/main/charts/karpenter-crd/values.yaml

@inclemenstv
Copy link

inclemenstv commented Aug 16, 2024

Can you relay on the karpenter-crd helm chart as the mechanism to maintain the CRDs? https://karpenter.sh/v1.0/upgrading/upgrade-guide/#crd-upgrades We have values there that allows you to edit the conversion block. https://github.com/aws/karpenter-provider-aws/blob/main/charts/karpenter-crd/values.yaml

I tried to install Karpenter-crd using ArgoCD but when I install main karpenter I had diff
image
because I can't to change the namespace block
service:
name: karpenter name: karpenter
namespace: karpenter-fargate namespace: kube-system

@engedaam
Copy link
Contributor

What namespace are you intending to run the Karpenter controller? In karpenter-fargate? If so, did you include --set webhook.serviceNamespace as part of the karpenter-crd helm upgrade?

@inclemenstv
Copy link

What namespace are you intending to run the Karpenter controller? In karpenter-fargate? If so, did you include --set webhook.serviceNamespace as part of the karpenter-crd helm upgrade?

yes I had following values.yaml
webhook: # -- Whether to enable the webhooks and webhook permissions. enabled: true serviceName: karpenter serviceNamespace: karpenter-fargate # -- The container port to use for the webhook. port: 8443

But when I install karpenter it tries to change the namespace to cube-system

@inclemenstv
Copy link

also argocd is waiting for completion of hook batch/Job/karpenter-post-install-hook
image
but I checked karpenter-post-install-hook job and it finished
customresourcedefinition.apiextensions.k8s.io/nodepools.karpenter.sh patched customresourcedefinition.apiextensions.k8s.io/nodeclaims.karpenter.sh patched customresourcedefinition.apiextensions.k8s.io/ec2nodeclasses.karpenter.k8s.aws patched

@engedaam
Copy link
Contributor

Yes, the post install hook will run and patch the CRDs to make the karpenter service consistent with the helm chart in the Karpenter controller. You can disable the post install hook by setting --no-hooks as part of the karpenter helm chart, but keep in mind that the service could be different then what is specified for the karpenter controller.

@inclemenstv
Copy link

inclemenstv commented Aug 16, 2024

Yes, the post install hook will run and patch the CRDs to make the karpenter service consistent with the helm chart in the Karpenter controller. You can disable the post install hook by setting --no-hooks as part of the karpenter helm chart, but keep in mind that the service could be different then what is specified for the karpenter controller.

But I am deploying Karpenter-crd and karpenter using ArgoCD and I can't use --no-hooks

@danielloader
Copy link
Author

danielloader commented Aug 16, 2024

It's worth interjecting here, as far as I can tell without making ttlSecondsAfterFinished value larger than 0 ArgoCD can never install this chart, no matter how you fiddle it. And even then I'm not 100% sure this will fix it.

The only other option available to me is to use a tool like kyervno to put a mutation webhook on the job and enforce this change in the cluster, but that's a lot of moving parts and things that can go wrong to be able to change an integer in a manifest.

If we do want this configurable I think wrapping the manifest object in a helm conditional configured by a value like .Values.postInstallHook.enabled=true would be saner than installing the chart with an arbitrary flag enabled by default.

This is actually an ArgoCD problem but considering it's a multi year issue I have no hope it's going to be fixed any time soon.

@snieg
Copy link

snieg commented Aug 16, 2024

Yes, the post install hook will run and patch the CRDs to make the karpenter service consistent with the helm chart in the Karpenter controller. (...)

I'm using the main Chart (because it contains both CRD and controller deployment).

And ArgoCD never reaches the post-hook moment (as described in the main message).

Because the newly installed CRDs have a hardcoded namespace kube-system so everything stops.

ArgoCD gets stuck because it is unable to list the current CRD resources, such as current nodeclaims or nodepools (because it's looking for a service in the wrong namespace) and therefore it never reaches the post-hook part.

@danielloader
Copy link
Author

Guess if there's a desire to force this into kube-system namespace, might be pertinent to swap all the namespace: {{ .Release.Namespace }} on the objects to namespace: kube-system and take the choice away formally.

@Vinaum8
Copy link

Vinaum8 commented Aug 16, 2024

Guess if there's a desire to force this into kube-system namespace, might be pertinent to swap all the namespace: {{ .Release.Namespace }} on the objects to namespace: kube-system and take the choice away formally.

@danielloader The ideal is to leave it dynamic for the user to choose.

@snieg
Copy link

snieg commented Aug 16, 2024

Can you relay on the karpenter-crd helm chart as the mechanism to maintain the CRDs? https://karpenter.sh/v1.0/upgrading/upgrade-guide/#crd-upgrades We have values there that allows you to edit the conversion block. https://github.com/aws/karpenter-provider-aws/blob/main/charts/karpenter-crd/values.yaml

Even if I install CRDs using the karpenter-crd helm chart. The helm chart with the controller will also try to install CRDs and they will overlap so it will try to restore the namespace to kube-system.

An optional solution would be the ability to parameterize in the values whether to install or not install CRDs via the controller helm chart and disable the post-hook from the helm template.

... or put the CRDs into the template and the ability to set the namespace from values.

@Vinaum8
Copy link

Vinaum8 commented Aug 17, 2024

@snieg @engedaam Some helm charts have the option to install or not the CRDs, I think it would be interesting to add karpenter/karpenter to the helm chart, this would help to have an alternative.

example:

crds:
  # This option decides if the CRDs should be installed
  # as part of the Helm installation.
  enabled: false

@tico24
Copy link

tico24 commented Aug 19, 2024

You have the option to do this with Argo CD by passing in the skipCrds flag even if the values file doesn't offer it (assuming you are using the helm feature of Argo CD directly.. which I am not).

https://argo-cd.readthedocs.io/en/stable/user-guide/helm/#helm-skip-crds

If you do happen to be using Lovely, you can pass

      env:
        - name: LOVELY_HELM_CRDS
          value: "false"

To the application. Using Lovely will also allow you to kustomize the helm chart so the other issue is not relevant to me.

@thomasgouveia
Copy link

Is there any reason why the karpenter chart doesn't declare a dependency on karpenter-crd?

If it did, it would be straightforward to control whether to deploy the CRDs (as mentioned above by @Vinaum8) as part of the release. Additionally, it would allow the ability to customize the values passed to the karpenter-crd, without needing to install it in a dedicated release. For example:

crds:
  enabled: false

  # Custom values to be passed to karpenter-crd dependency
  serviceNamespace: my-custom-namespace

If anyone here struggles with CRDs that are out of sync using ArgoCD (caused by the namespace that is updated by the post-install hook job), you can temporarily work around this issue by using the ignoreDifferences block in your application template:

ignoreDifferences:
  - group: apiextensions.k8s.io
    kind: CustomResourceDefinition
    name: nodepools.karpenter.sh
    jsonPointers:
      - /spec/conversion/webhook/clientConfig/service/namespace
  - group: apiextensions.k8s.io
    kind: CustomResourceDefinition
    name: ec2nodeclasses.karpenter.k8s.aws
    jsonPointers:
      - /spec/conversion/webhook/clientConfig/service/namespace
  - group: apiextensions.k8s.io
    kind: CustomResourceDefinition
    name: nodeclaims.karpenter.sh
    jsonPointers:
      - /spec/conversion/webhook/clientConfig/service/namespace

@snieg
Copy link

snieg commented Aug 19, 2024

You have the option to do this with Argo CD by passing in the skipCrds flag even if the values file doesn't offer it (assuming you are using the helm feature of Argo CD directly.. which I unfortunately am not).

https://argo-cd.readthedocs.io/en/stable/user-guide/helm/#helm-skip-crds

This worked, but we are still back to the first point where ArgoCD cannot complete the post-hook (even though it has been completed)

15m         Normal    SuccessfulCreate    job/karpenter-post-install-hook                    Created pod: karpenter-post-install-hook-gdx8c
15m         Normal    Completed           job/karpenter-post-install-hook                    Job completed
9m42s       Normal    SuccessfulCreate    job/karpenter-post-install-hook                    Created pod: karpenter-post-install-hook-qmg6m
9m38s       Normal    Completed           job/karpenter-post-install-hook                    Job completed
image

@jetersen
Copy link

As a workaround / permanent fix, I used argo cd and kustomize + helm. To patch it.

To enable kustomize + helm, add the following:

kustomize.buildOptions: "--enable-helm"

Either via the argocd-cm configmap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-cm
  namespace: argocd
data:
  kustomize.buildOptions: "--enable-helm"

or adding to helm values

configs:
  cm:
    kustomize.buildOptions: "--enable-helm"

If you want to test it locally beware you need to use a recent kustomize. The embedded kubectl kustomize does not support OCI repo because of how it is invoking helm pull.

kustomization.yaml:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

helmCharts:
  - name: karpenter
    repo: oci://public.ecr.aws/karpenter
    version: 1.0.0
    releaseName: karpenter
    namespace: kube-system
    valuesFile: values.yaml

patches:
  - target:
      group: batch
      version: v1
      kind: Job
      name: karpenter-post-install-hook
    patch: |-
      - op: replace
        path: /spec/ttlSecondsAfterFinished
        value: 30

@snieg
Copy link

snieg commented Aug 19, 2024

As a workaround / permanent fix, I used argo cd and kustomize + helm. To patch it.

To enable kustomize + helm, add the following:
(...)
kustomization.yaml:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

helmCharts:
  - name: karpenter
    repo: oci://public.ecr.aws/karpenter
    version: 1.0.0
    releaseName: karpenter
    namespace: kube-system
    valuesFile: values.yaml

patches:
  - target:
      group: batch
      version: v1
      kind: Job
      name: karpenter-post-install-hook
    patch: |-
      - op: replace
        path: /spec/ttlSecondsAfterFinished
        value: 30

This trick worked for me too.

Added below to ApplicationSet:

    spec:
      sources:
        - chart: karpenter
          repoURL: public.ecr.aws/karpenter
          targetRevision: "{{values.version}}"
          helm:
            releaseName: karpenter
(...)
          kustomize:
            patches:
              - target:
                  group: batch
                  version: v1
                  kind: Job
                  name: karpenter-post-install-hook
                patch: |-
                  - op: replace
                    path: /spec/ttlSecondsAfterFinished
                    value: 30

@jetersen
Copy link

jetersen commented Aug 19, 2024

I was not aware that you could mix helm and kustomize in one application 🤯
Just spent a bunch of time to rewrite the helm chart to kustomize...

@danielloader
Copy link
Author

danielloader commented Aug 19, 2024

The real wins are deep in the comments, this is new to me also and might have made me considerably happier outside of this problem. Thanks @snieg

Edit (cried):

Failed to load target state: failed to generate manifest for source 2 of 3: rpc error: code = Unknown desc = error getting app source type: multiple application sources defined: Kustomize,Helm

@snieg
Copy link

snieg commented Aug 19, 2024

I was not aware that you could mix helm and kustomize in one application 🤯 Just spent a bunch of time to rewrite the helm chart to kustomize...

I'm actually using ApplicationSet and I have a mix of helm (for maintaing the controller) and kustomize for NodeClasses and NodePool:

---
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: karpenter
spec:
  generators:
    - clusters:
        selector:
          matchExpressions:
            - key: cluster_name
              operator: In
              values:
                - clusterA
                - clusterB
                - clusterC
        values:
          version: "1.0.0"
    - clusters:
        selector:
          matchExpressions:
            - key: cluster_name
              operator: In
              values:
                - clusterD
                - cluster...
        values:
          version: "0.37.0"  # this is the default version
  template:
    metadata:
      name: karpenter-{{metadata.labels.short_name}}
      labels:
        cluster_newrelic_guid: "{{metadata.labels.newrelic_guid}}"
    spec:
      destination:
        name: "{{name}}"
        namespace: kube-system
      project: infra
      sources:
        - chart: karpenter
          repoURL: public.ecr.aws/karpenter
          targetRevision: "{{values.version}}"
          helm:
            releaseName: karpenter
            valueFiles:
              - "$values/apps/karpenter/values/{{metadata.labels.environment}}/values.yaml"
              - "$values/apps/karpenter/values/{{metadata.labels.environment}}/{{metadata.labels.cluster_name}}/values.yaml"
            # skipCrds: true
        - repoURL: https://...
          targetRevision: HEAD
          ref: values
        - repoURL: https://...
          targetRevision: HEAD
          path: apps/karpenter/values/{{metadata.labels.environment}}/{{metadata.labels.cluster_name}}
          kustomize:
            patches:
              - target:
                  group: batch
                  version: v1
                  kind: Job
                  name: karpenter-post-install-hook
                patch: |-
                  - op: replace
                    path: /spec/ttlSecondsAfterFinished
                    value: 30
      info:
        - name: ApplicationSet
          value: This app belongs "karpenter" Application
      syncPolicy:
        syncOptions:
          - CreateNamespace=true
  syncPolicy:
    preserveResourcesOnDeletion: true # do not delete Applications if ApplicationSet is deleted

@danielloader
Copy link
Author

I've got to the point where I just don't want this post install hook, I have a handful of custom objects I'm capable of migrating myself I just want the option to not have to deal with this any more. I get the good intention but this has totally broken my production environment from being able to configure karpenter and that's not ideal.

@snieg
Copy link

snieg commented Aug 19, 2024

To summarize, what I tried and my current method to update karpenter from 0.37 to 1.0:

Solution provided by @jetersen, doesn't quite work for me, sometimes the post-hook in ArgoCD remains stuck.
I tried updating ArgoCD to 2.12, unfortunately it didn't help.

As a workaround, a method that works for me is to not sync everything.
Select any one resource (even if it is already synced) and sync it, like in the image below:

image

This way the post-hook will not be triggered and the application in ArgoCD will be marked as synced without any error.

BEFORE updating to 1.0.0, I moved karpenter to the kube-system namespace, so during the update, ArgoCD will be able to list old CRD resources correctly and won't get stuck and I won't have to edit CRDs manually.

So to sum up, karpenter must be in namespace kube-system and we CANNOT sync everything, only selectively, then the post-hook will not be triggered.

I know that this is a very ugly solution, but for now it's probably the only one that works for me.

@danielloader
Copy link
Author

I've just forked the chart code into a repo with my PR in it and pulling from there until it's done. If it ever gets merged or resolved so I don't need to use it I can revert back to the upstream.

@jetersen
Copy link

jetersen commented Aug 19, 2024

Be careful upgrading with argocd...
I just spent all day fixing karpenter issues...

The missing link for me was karpenter was downgrading my v1 nodepools to v1beta because I missed the CRDs for production clusters because I did helm upgrade in staging cluster, discovered issues and converted to kustomize and deployed it without the new CRDs in production (missing includeCRDs in kustomization.yaml) and now I got several dead clusters... Because conflicts and old webhook validations leftovers.

I had to delete all webhook validation and mutation for karpenter, and remember to install CRDs.

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

helmCharts:
  - name: karpenter
    repo: oci://public.ecr.aws/karpenter
    version: 1.0.0
    releaseName: karpenter
    namespace: kube-system
    valuesFile: values.yaml
    includeCRDs: true

This is an application inside applicationset called cluster-addons that has helm templating for their argocd apps.

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: "karpenter-{{ .Values.spec.destination.clusterName }}"
  namespace: argocd
  annotations:
    argocd.argoproj.io/manifest-generate-paths: .
  labels:
    platform: "true"
spec:
  destination:
    server: "{{ .Values.spec.destination.server }}"
    namespace: kube-system
  source:
    path: charts/karpenter
    repoURL: https://...
    targetRevision: main
    kustomize:
      patches:
        - target:
            group: batch
            version: v1
            kind: Job
            name: karpenter-post-install-hook
          patch: |-
            - op: replace
              path: /spec/ttlSecondsAfterFinished
              value: 30
        - target:
            version: v1
            kind: ServiceAccount
            name: karpenter
          patch: |-
            - op: add
              path: /metadata/annotations
              value:
                eks.amazonaws.com/role-arn: "arn:aws:iam::{{ .Values.eks.cluster.accountId }}:role/eks-{{ .Values.eks.cluster.name }}-karpenter-role"
        - target:
            version: v1
            kind: Deployment
            name: karpenter
          patch: |-
            - op: add
              path: /spec/template/spec/containers/0/env/-
              value:
                name: CLUSTER_NAME
                value: "{{ .Values.eks.cluster.name }}"
            - op: add
              path: /spec/template/spec/containers/0/env/-
              value:
                name: INTERRUPTION_QUEUE
                value: "eks-{{ .Values.eks.cluster.name }}-interruption-queue"
            - op: add
              path: /spec/template/spec/containers/0/env/-
              value:
                name: AWS_REGION
                value: "{{ .Values.eks.cluster.region }}"

msvechla added a commit to msvechla/karpenter-provider-aws that referenced this issue Aug 21, 2024
Fixes an argocd issue where helm hooks never finish syncing when they
have ttlSecondsAfterFinished set to 0.

See related argocd issue: argoproj/argo-cd#6880

Suggested workaround as implemented by argocd team: argoproj/argo-helm#2861
@danielloader
Copy link
Author

danielloader commented Aug 22, 2024

Closing this issue tentatively now the hook has been removed (but before publishing a new chart/version).

#6827

@booleanbetrayal
Copy link

Just a note that I had opened up a similar issue to this yesterday @ #6847 . In our case, the name of the Service defined in the static CRDs differed from the ArgoCD / Helm generated resource name for the service leading to a borked migration and rollback.

@booleanbetrayal
Copy link

You have the option to do this with Argo CD by passing in the skipCrds flag even if the values file doesn't offer it (assuming you are using the helm feature of Argo CD directly.. which I am not).

The problem with the Application-wide skipCrds flag is that is that there might be multiple charts that are part of an ArgoCD Application and this will prevent CRD deployment on all those charts, whereas only this chart is problematic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage/accepted Indicates that the issue has been accepted as a valid issue
Projects
None yet
10 participants