Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issues with self hosted kubernetes deployment (zrok, ziti-controller, ziti-router) #272

Open
pavars opened this issue Nov 8, 2024 · 15 comments
Assignees

Comments

@pavars
Copy link

pavars commented Nov 8, 2024

Hi,

Trying to deploy self-hosted ZROK with openziti. The idea and the product seems nice but there is clear lack of documentation for properly secured and working configuration and seems like it is still in PoC stages. First of all helm charts don't support adding extraEnv variables from Secret Mounts (We are using external-secrets operator that pulls in secret data from GCP Secret Manager so we don't expose keys and secrets in plaintext manifests) which means that enrollmentJWT, ziti admin secret needs to be passed into helm chart as plaintext values.
Secondly, the helm hooks which create users, etc sometimes misbehave when deploying with ArgoCD. Having so many configuration issues that I keep redeploying zrok/ ziti and the users get created as part of bootstrap process and this leads to config drift from secrets/ ziti controller. Biggest issue is that secrets get regenerated and what is written in Ziti controller database doesn't match up to what is in the K8S secrets so initial login doesn't work. I see that there is support for postgres database for Zrok, so it can be scaled horizontally and still retain the same data, however the config part responsible for data store doesn't provide any flexibility to make required changes, it is hardcoded to use sqlite3. Also it is also unclear wether ziti-router is needed or is it enough to set ziti-controller-edge api as LoadBalancer service (docs say one thing but after testing it, conclusion is that ziti-router is required).

I tried mounting enrollmentJWT as additionalVolume and set .Values.enrollJwtFile to the mounted volume but that fails miserably. I can see and read the mounted token on the pod filesystem but for some reason Zrok controller fails, fallback to setting the same token explicitly in enrollmentJwt works fine. I might be wrong but feels like enrollmentJwt for ziti-router could also be bootstrapped from a script, so there is no need to manually login to ziti controller and create the router.

Another problem I ran into was creating new identities configs. When Zrok initially starts it tries to bootstrap and create required identity - ran into issue that identity "public" already exists so I had to manually drop the identity and create again. Additionally the new identity was created with ID -D3xLHGw2 and when zrok frontend tried to start it was failing because it doesn't recognise configuration flag "-D3xLHGw2" passed on cli, this needs some proper escaping as seems like this is one of edge cases. In a DR scenario when these resources would be recreated, then all persistent data would be lost and all clients would have to reauthenticate with new tokens/ passwords, correct me if I'm wrong.

The initial config might be good enough to server Zrok/ ziti locally but it is far from production-ready or even just to serve dev resources on GKE cluster.

Below added our current config for helm chart, however we will probably have to keep our own version of these helm charts since they seem to be lacking vital configuration options. My only concern is with maintaining scripts which are called for bootstrap etc. I could open a PR for helm charts to include support for mount envFrom: secrets/ configmaps properly if existingSecret is defined and also option to configure zrok ctrl.yaml with postgres DB.

--- ziti-controller values
        clientApi:
          advertisedHost: ziti-controller-client.ziti
          advertisedPort: 443
          service:
            enabled: true
            type: ClusterIP
          ingress:
            enabled: false
        ctrlPlaneCasBundle:
          namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: "ziti"
        trust-manager:
          enabled: true
          crds:
            enabled: true
          app:
            trust:
              namespace: ziti
              
--- ziti-router values
        advertisedHost: ziti.dev.company
        # enrollJwtFile: /etc/enrollment-jwt/enrollmentJwt # NOT working
        enrollmentJwt: plainTextToken
        edge:
          advertisedPort: 443
          service:
            enabled: true
            type: ClusterIP
        ctrl:
          endpoint: ziti-controller-ctrl.ziti:443
        tunnel:
          mode: host
        additionalVolumes:
          - name: enrollment-jwt
            volumeType: secret
            mountPath: /etc/enrollment-jwt
            secretName: ziti-router-secret
        env:
          DEBUG: "1"
          
--- ziti-router IngressRouteTCP (for Traefik)
            apiVersion: traefik.containo.us/v1alpha1
            kind: IngressRouteTCP
            metadata:
              name: ziti-router-ingress-dev
              namespace: ziti
              annotations:
                kubernetes.io/ingress.class: traefik-public
            spec:
              entryPoints:
                - websecure
              routes:
              - match: HostSNI(`ziti.dev.company`)
                services:
                  - name: ziti-router-edge
                    port: 443
              tls:
                passthrough: true

--- zrok values
        controller:
          ingress:
            enabled: true
            className: "traefik-public"
            hosts:
              - "zrok.dev.company"
            tls:
               - secretName: wildcard-cert
                 hosts:
                   - zrok.dev.company
        frontend:
          ingress:
            enabled: true
            className: "traefik-public"
            tls:
               - secretName: wildcard-cert
        ziti:
          advertisedHost: ziti-controller-client.ziti
          password: plainTextPassword

        dnsZone: "zrok.dev.company"
@qrkourier qrkourier self-assigned this Nov 8, 2024
@qrkourier
Copy link
Member

the new identity was created with ID -D3xLHGw2 and when zrok frontend tried to start it was failing because it doesn't recognise configuration flag "-D3xLHGw2" passed on cli, this needs some proper escaping as seems like this is one of edge cases

Pull request to mitigate leading hyphens in Ziti ID strings - #274

Issue to raise concern about the underlying problem - openziti/ziti#2534

@qrkourier
Copy link
Member

qrkourier commented Nov 12, 2024

helm charts don't support adding extraEnv variables from Secret Mounts (We are using external-secrets operator that pulls in secret data from GCP Secret Manager so we don't expose keys and secrets in plaintext manifests) which means that enrollmentJWT, ziti admin secret needs to be passed into helm chart as plaintext values

I see that there is support for postgres database for Zrok, so it can be scaled horizontally and still retain the same data, however the config part responsible for data store doesn't provide any flexibility to make required changes, it is hardcoded to use sqlite3.

I could open a PR for helm charts to include support for mount envFrom: secrets/ configmaps properly if existingSecret is defined and also option to configure zrok ctrl.yaml with postgres DB.

That would be most welcome. ☺️

There's a pattern in the ziti-controller and ziti-router charts for mounting additional volumes, but you may have a better way already established you could use for extraEnv vars from secret mounts or existing identities, or both.

Another user reported this issue too in #273

My only concern is with maintaining scripts which are called for bootstrap etc.

That's understandable. The scripts for zrok controller and zrok frontend show a bias for simplicity at the cost of flexibility. If we need significantly more flexibility, it would be wise to consider if there's another approach better suited than shell scripts. I'm reluctant to try to accomplish too much with shell scripts because they can become quite challenging to maintain.

@qrkourier
Copy link
Member

qrkourier commented Nov 12, 2024

Secondly, the helm hooks which create users, etc sometimes misbehave when deploying with ArgoCD. ...snip... Biggest issue is that secrets get regenerated and what is written in Ziti controller database doesn't match up to what is in the K8S secrets so initial login doesn't work.

Do you have any theories how this breaks with ArgoCD? Does ArgoCD significantly depart from the typical workflow of running the helm CLI to render templates and call KubeAPI to create the Helm Release?

Zrok initially starts it tries to bootstrap and create required identity - ran into issue that identity "public" already exists

This sounds like it could stem from the same root. I haven't seen the problem you're describing myself, so I'm guessing it could be related to how ArgoCD works.

Here's the part of bootstrap-frontend.bash that atomically provisions the first zrok account if the account token secret does not already exist.

        # granted permission to read secrets in namespace by SA managed by this chart
        if kubectl -n {{ .Release.Namespace }} get secret \
            {{ include "zrok.fullname" . }}-ziggy-account-token &>/dev/null; then
            echo "INFO: ziggy account enable token secret exists"
        else
            echo "INFO: ziggy account enable token secret does not exist, creating secret"
            # create a default user account named "ziggy" and save the enable token in a Secret resource
            zrok admin create account \
                ziggy@{{ .Values.dnsZone }} \
                {{ $ziggyPassword | b64dec | quote }} \
            | xargs -I TOKEN kubectl -n {{ .Release.Namespace }} create secret generic \
                {{ include "zrok.fullname" . }}-ziggy-account-token \
                --from-literal=token=TOKEN
            # xargs -r is NOT used here because this command must fail loudly if the account token was not created
        fi

And, here's the part of that script that creates the zrok "public" frontend if it does not already exist in Ziti.

        # if default "public" frontend already exists
        ZROK_PUBLIC_TOKEN=$(getZrokPublicFrontend token)
        if [[ -n "${ZROK_PUBLIC_TOKEN}" ]]; then
            
            # ensure the Ziti ID of the public frontend's identity is the same in Ziti and zrok
            ZROK_PUBLIC_ZID=$(getZrokPublicFrontend zid)
            if [[ "${ZITI_PUBLIC_ID}" != "${ZROK_PUBLIC_ZID}" ]]; then
                echo "ERROR: existing Ziti Identity named 'public' with id '$ZITI_PUBLIC_ID' is from a previous zrok"\
                "instance life cycle. Delete it then re-run zrok." >&2
                exit 1
            fi

            echo "INFO: updating frontend"
            zrok admin update frontend "${ZROK_PUBLIC_TOKEN}" \
                --url-template "{{ .Values.frontend.ingress.scheme }}://{token}.{{ .Values.dnsZone }}"
        else
            echo "INFO: creating frontend"
            zrok admin create frontend -- "${ZITI_PUBLIC_ID}" public \
                "{{ .Values.frontend.ingress.scheme }}://{token}.{{ .Values.dnsZone }}"
        fi

@qrkourier
Copy link
Member

it is also unclear wether ziti-router is needed or is it enough to set ziti-controller-edge api as LoadBalancer service

A zrok instance requires a Ziti network, and a Ziti network requires at least one router and controller. The router(s) and controller(s) are typically separate deployments, and we're starting to explore using StatefulSets to describe multi-router and multi-controller deployments.

feels like enrollmentJwt for ziti-router could also be bootstrapped from a script, so there is no need to manually login to ziti controller and create the router.

I was thinking the same thing but never finished working on that branch. I like the idea of the Ziti controller immediately creating a first router named like "default" or "public" and storing the enrollment token in a K8S secret to simplify the router deployment that typically follows on its heels. Another option in mind is a separate umbrella chart like "ziti-stack" that orchestrates the router enrollment parcel to the controller deployment. That might work, but an Operator feels like the better tool for the job of automating life cycle, ops, etc.

@qrkourier
Copy link
Member

set .Values.enrollJwtFile to the mounted volume but that fails

Now I see enrollJwtFile is obsolete. Whatever strategy emerges for mounting extra secrets will easily adapt to meet the same need that value must've originally met. For example, if input value existingSecretEnrollmentJwt is passed, then the template should mount that secret on a predictable path and use it during enrollment.

pull request to prune the obsolete value: #275

@pavars
Copy link
Author

pavars commented Nov 20, 2024

Sorry for the late response, after some fiddling around managed to start zrok with ziti. Ziti-controller needs to start first, then ziti-router + need to create router policies on ziti-controller, and then start zrok which in turn will successfully create a private/ public share.

ziti edge create edge-router router-dev \
  --role-attributes "public" --tunneler-enabled --jwt-output-file /tmp/router-dev.jwt

ziti edge create edge-router-policy all-endpoints-public-routers --edge-router-roles "#public" --identity-roles "#all"

ziti edge create service-edge-router-policy all-routers-all-services --edge-router-roles "#all" --service-roles "#all"

Do you have any theories how this breaks with ArgoCD? Does ArgoCD significantly depart from the typical workflow of running the helm CLI to render templates and call KubeAPI to create the Helm Release?

Issues from ArgoCD mostly arouse when deleting resources which in turn also deleted the secret and PVC which stored sqlite database meaning zrok tried to bootstrap once again but the identities were already existing in ziti which caused error. In theory hooks and all the resources could be managed better by a kubernetes operator pattern with some custom CRDs but I'm not so experienced with it but it would probably make most sense. Umbrella chart probably would be easier to maintain but that gives less flexibility. ArgoCD essentially renders the helm chart with helm template and applies those manifests with kubectl, most hooks are working the same way and are mapped to argo-cd hooks on injection.

That would be most welcome. ☺️

There's a pattern in the ziti-controller and ziti-router charts for mounting additional volumes, but you may have a better way already established you could use for extraEnv vars from secret mounts or existing identities, or both.

I will try to get to it this week and open PR.

@pavars
Copy link
Author

pavars commented Nov 21, 2024

There might be a problem with setting db password for zrok using ArgoCD. I am not sure if ZROK supports environment variables in the ctrl.yaml file which is generated here: https://github.com/openziti/helm-charts/blob/main/charts/zrok/templates/controller-secrets-configmap.yaml#L272

I wanted to use a lookup for secret to replace the value for db password but I'm afraid that wont work with argocd argoproj/argo-cd#5202

An easier way would be just setting env variables there and application would read them from env. If that is not an option then as a dirty workaround initContainer could expand the script with envsubst and mount it on zrok.

@qrkourier
Copy link
Member

Correct, zrok doesn't support env vars in its configs yet. Here's a couple of GitHub issues tracking improved config handling, including env vars:

I used envsubst for the Docker zrok sample.

Does this accurately summarize the password issue with ArgoCD?

ArgoCD consumes the and applies the manifests generated by the helm template command. The template command does not query KubeAPI, and so any Helm/Sprig functions in the templates are unable to effect logic like "generate a password unless mySecretPassword Secret resource already exists." In this scenario, a new password is always generated, so there's a mismatch between the assumptions built in to the templates and the assumptions of a template-to-GitOPs workflow like Helm+ArgoCD.

@pavars
Copy link
Author

pavars commented Nov 26, 2024

Does this accurately summarize the password issue with ArgoCD?

Yes, If secret exists then it shouldn't try to recreate the secret but since helm lookup doesn't properly work there it is trying to regenerate password and ArgoCD is showing it out-of-sync for ziti-controller.
image

Another issue is with hooks - in case of zrok there is pre-delete hook which is not actually supproted by ArgoCD and probably should be moved to post-delete hook, I don't see that it would break anything.

@qrkourier
Copy link
Member

I couldn't think of a way to refactor the charts to be compatible with a GitOps workflow without adding manual steps to the main Helm-driven workflow, which involves calling the Kube API to manage existing resources and trigger life cycle hooks. I'm not giving up on GitOps by any means.

In the meantime, maybe you could insert a Kustomize step in your GitOps workflow like this:

  1. Render manifest with helm templates
  2. On first run, save the generated values in a patch.yaml file
  3. On subsequent runs, patch the manifest with Kustomize's patchesStrategicMerge from patch.yaml
  4. Commit manifest to Git
  5. Push to Git remote for ArgoCD to apply

@pavars
Copy link
Author

pavars commented Jan 6, 2025

Hey, Happy New Year! Hope you had good holidays :)

I started working on some changes which actually are working but probably needs more work as I got sick before holidays and basically stopped there. I need to run some additional tests but so far secrets are working with envsubst. There is a breaking change which I added to support existingSecret for jwtToken to have uniform resource definition between other helm charts as well but that can be moved to a separate variable as well without breaking changes. I will open PR and then let me know what you think.

I'm afraid that I would have to refactor our whole argocd repo for additional kustomize steps in some adhoc cases.

So am I understanding correct that moving pre-delete hook to post-delete hooks would not work properly?

@qrkourier
Copy link
Member

The purpose of that pre-delete hook is to delete the public frontend's identity secret from the cluster. When driven by Helm, it needs to run before the service account is deleted because that's how it gets permission to manage secrets in the release namespace.

@qrkourier
Copy link
Member

Since you're generating templates from the chart, you could delete that hook entirely if you're managing the life cycle of the identity secret another way.

@pavars
Copy link
Author

pavars commented Jan 10, 2025

Yeah, we are using external secrets operator to manage secrets to avoid putting them in plaintext in git, but all it does is just creating a secret in k8s from GCP secret manager. The rest is then taken by zrok scripts during bootstrap process (to add identities, etc). Maybe it's not a bad idea to add something like .Values.useArgoCD which would change how the hooks are configured and also add serviceAccount, configmap as pre-sync hook because these resources need to exist before deploying the rest of the application otherwise it fails or needs manual intervention. Right now on resource deletion the pre-delete hook is basically skipped and resources aren't cleaned up properly.

@jan94
Copy link
Contributor

jan94 commented Jan 23, 2025

@pavars as @qrkourier has recognized I have already done a PR that is also about the helm pre-upgrade hook. My suggested change allows to omit this hook, by setting a value in the respective values file. This mechanism should also help you here in your case, I assume.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants