Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some kustomizations using SOPS have long running times #600

Closed
log2 opened this issue Mar 28, 2022 · 1 comment
Closed

Some kustomizations using SOPS have long running times #600

log2 opened this issue Mar 28, 2022 · 1 comment

Comments

@log2
Copy link

log2 commented Mar 28, 2022

Some (*) kustomizations exhibit a running time of 7m30s + a seemingly random fraction of a second (or twice that amount), such as:

  Normal  ReconciliationSucceeded  52m   kustomize-controller  Reconciliation finished in 7m30.619913479s, next run in 24h0m0s
  Normal  ReconciliationSucceeded  45m   kustomize-controller  Reconciliation finished in 7m30.409957246s, next run in 24h0m0s

I'm not sure what the actual cause might be, however I collected a number of examples and counter-examples, which I attach here.

Examples of long reconciliation time (7m30s or 15m):

SOPS, substituteFrom, secret+HR+PDB
apiVersion: kustomize.toolkit.fluxcd.io/v1beta2
kind: Kustomization
metadata:
  name: lab-common-mongodb
  namespace: flux-system
spec:
  decryption:
    provider: sops
  force: false
  interval: 24h
  path: ...omitted...
  postBuild:
    substituteFrom:
    - kind: ConfigMap
      name: kustomization-config
      optional: false
    - kind: ConfigMap
      name: lab-common-kustomization-config
      optional: false
  prune: false
  retryInterval: 1m
  sourceRef:
    kind: GitRepository
    name: staging-lab
  targetNamespace: lab
  timeout: 1m
SOPS, substitute+substituteFrom, imageRepositories+cronjob+secret+rbac, healthcheck on cronjob
apiVersion: kustomize.toolkit.fluxcd.io/v1beta2
kind: Kustomization
metadata:
  name: common-imagerepositories
  namespace: flux-system
spec:
  decryption:
    provider: sops
  force: false
  healthChecks:
  - apiVersion: batch/v1
    kind: CronJob
    name: ecr-credentials-sync
    namespace: flux-system
  interval: 24h
  path: ...omitted...
  postBuild:
    substitute:
      accountId: "...omitted..."
      imageRepoNamespace: flux-system
      region: ...omitted...
    substituteFrom:
    - kind: ConfigMap
      name: kustomization-config
      optional: false
  prune: true
  retryInterval: 1m
  sourceRef:
    kind: GitRepository
    name: staging-system
  timeout: 1m
SOPS, substitute+substituteFrom, gitrepository+HR+secret+configmap (this reaches 15m of reconciliation time)
apiVersion: kustomize.toolkit.fluxcd.io/v1beta2
kind: Kustomization
metadata:
  name: lab-ha-core
  namespace: flux-system
spec:
  decryption:
    provider: sops
  dependsOn:
  - name: lab-common-prerequisites
    namespace: flux-system
  - name: lab-common-nats
    namespace: flux-system
  - name: lab-common-mongodb
    namespace: flux-system
  - name: lab-ha-prerequisites
    namespace: flux-system
  force: false
  interval: 24h
  path: ...omitted...
  postBuild:
    substitute:
      externalHostnamePrefix: ...omitted...
      externalHostnameSuffix: ...omitted...
      release: ..omitted...
    substituteFrom:
    - kind: ConfigMap
      name: kustomization-config
      optional: false
    - kind: ConfigMap
      name: lab-common-kustomization-config
      optional: false
    - kind: ConfigMap
      name: lab-ha-kustomization-config
      optional: false
    - kind: ConfigMap
      name: lab-common-coturn-params
      optional: false
  prune: true
  retryInterval: 1m
  sourceRef:
    kind: GitRepository
    name: staging-lab
  targetNamespace: lab
  timeout: 1m

Note that, in all previous examples, timeout of 1m for reconciliation is not enforced.

Example of good reconciliation time (<10s):

SOPS, secrets
apiVersion: kustomize.toolkit.fluxcd.io/v1beta2
kind: Kustomization
metadata:
  name: flux-system-secrets
  namespace: flux-system
spec:
  decryption:
    provider: sops
  force: false
  interval: 24h
  path: ./flux-system/secrets
  prune: true
  retryInterval: 1m
  sourceRef:
    kind: GitRepository
    name: staging-flux
  timeout: 1m

Note that, in all examples, no secretRef is used to configure SOPS decryptor, due to #595 .

My environment:

Flux core patches
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - flux-components.yaml
  - priorityclass.yaml
#see https://fluxcd.io/docs/cheatsheets/bootstrap/
patches:
# see https://fluxcd.io/docs/cheatsheets/bootstrap/#increase-the-number-of-workers

  # see https://github.com/fluxcd/helm-controller/blob/main/main.go#L79
  # see https://github.com/fluxcd/kustomize-controller/blob/main/main.go#L79
- patch: |
    - op: add
      path: /spec/template/spec/containers/0/args/-
      value: --concurrent=8
    - op: add
      path: /spec/template/spec/containers/0/args/-
      value: --kube-api-qps=500
    - op: add
      path: /spec/template/spec/containers/0/args/-
      value: --kube-api-burst=1000
    - op: add
      path: /spec/template/spec/containers/0/args/-
      value: --requeue-dependency=15s
  target:
      kind: Deployment
      name: "(kustomize-controller|helm-controller)"

- patch: |
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: all
    spec:
      template:
        spec:
          containers:
            - name: manager
              resources:
                requests:
                  cpu: 0.1
                  memory: 200Mi
                limits:
                  cpu: 1500m
                  memory: 1500Mi
  target:
    kind: Deployment
    name: "(kustomize-controller|helm-controller)"

# see https://fluxcd.io/docs/cheatsheets/bootstrap/#safe-to-evict
- patch: |
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: all
    spec:
      template:
        metadata:
          annotations:
            cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
  target:
    kind: Deployment
    labelSelector: app.kubernetes.io/part-of=flux

# Fixed priority class for all Flux controllers
- patch: |
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: all
    spec:
      template:
        spec:
          priorityClassName: flux-system
  target:
    kind: Deployment
    labelSelector: app.kubernetes.io/part-of=flux

# Add HA to kustomize, helm and notification controller
- patch: |
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: all
    spec:
      replicas: 2
  target:
    kind: Deployment
    name: "(kustomize-controller|helm-controller|notification-controller)"

# Add customized, extended limits for all remaining Flux controllers
- patch: |
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: all
    spec:
      template:
        spec:
          containers:
            - name: manager
              resources:
                requests:
                  cpu: 0.1
                  memory: 200Mi
                limits:
                  cpu: 0.3
                  memory: 500Mi
  target:
    kind: Deployment
    name: "(image-automation-controller|image-reflector-controller|notification-controller)"

# Add Azure KV options for kustomize controller
- patch: |
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: kustomize-controller
      namespace: flux-system
    spec:
      template:
        metadata:
          labels:
            aadpodidbinding: sops-akv-decryptor  # match the AzureIdentityBinding selector
        spec:
          containers:
            - name: manager
              env:
              - name: AZURE_AUTH_METHOD
                value: msi
  target:
    kind: Deployment
    name: "kustomize-controller"

# Add brave deployment options
- patch: |
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: all
    spec:
      strategy:
        rollingUpdate:
          maxSurge: 100%
          maxUnavailable: 50%
        type: RollingUpdate
  target:
    kind: Deployment
    labelSelector: app.kubernetes.io/part-of=flux

# Enable HTTP retry ok kustomize-controller and helm-controller
- patch: |
    - op: add
      path: /spec/template/spec/containers/0/args/-
      value: --http-retry=60
  target:
      kind: Deployment
      name: "(kustomize-controller|helm-controller)"

# Options for source-controller

- patch: |
    - op: add
      path: /spec/template/spec/containers/0/args/-
      value: --concurrent=2
    - op: add
      path: /spec/template/spec/containers/0/args/-
      value: --kube-api-qps=500
    - op: add
      path: /spec/template/spec/containers/0/args/-
      value: --kube-api-burst=1000
    - op: add
      path: /spec/template/spec/containers/0/args/-
      value: --requeue-dependency=10s
  target:
      kind: Deployment
      name: "(source-controller)"

- patch: |
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: all
    spec:
      template:
        spec:
          containers:
            - name: manager
              resources:
                requests:
                  cpu: 0.1
                  memory: 200Mi
                limits:
                  cpu: 2000m
                  memory: 2Gi
  target:
    kind: Deployment
    name: "(source-controller)"

@log2
Copy link
Author

log2 commented Mar 30, 2022

This used to happen with kustomize-controller up to 0.22.2, but with version 0.22.3 (suggested by @hiddeco for another issue, see here) running times of any reconciliation came back to normal values.

@log2 log2 closed this as completed Mar 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant