Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubeflow 1.7 terraform not working for static pipeline credentials #714

Closed
AlexandreBrown opened this issue May 1, 2023 · 14 comments
Closed
Labels
bug Something isn't working work in progress Has been assigned and is in progress

Comments

@AlexandreBrown
Copy link
Contributor

AlexandreBrown commented May 1, 2023

Describe the bug
image

Steps To Reproduce

  1. Create dockerfile
FROM ubuntu:18.04

ARG KUBEFLOW_RELEASE_VERSION
ARG AWS_RELEASE_VERSION

WORKDIR /tmp/

RUN apt update \
    && apt install --yes \
        git \
        curl \
        unzip \
        tar \
        make \
        sudo \
        vim \
        wget \
    && git clone https://github.com/awslabs/kubeflow-manifests.git \
    && cd kubeflow-manifests \
    && git checkout ${AWS_RELEASE_VERSION} \
    && git clone --branch ${KUBEFLOW_RELEASE_VERSION} https://github.com/kubeflow/manifests.git upstream \
    && make install-tools


WORKDIR /tmp/kubeflow-manifests/deployments/cognito-rds-s3/terraform

# Disable automatic subdomain creation since our root domain is not on AWS Route53
ARG CREATE_SUBDOMAIN="false"

ARG CLUSTER_REGION
# ENV because we need it in other Dockerfiles
ENV CLUSTER_REGION=${CLUSTER_REGION}

ARG CLUSTER_NAME
# ENV because we need it in other Dockerfiles
ENV CLUSTER_NAME=${CLUSTER_NAME}

ARG EKS_VERSION

# Name of an existing Route53 root domain (e.g. example.com)
ARG ROOT_DOMAIN
# Name of the subdomain to create (e.g. platform.example.com)
ARG SUBDOMAIN
# ENV because we need it in other Dockerfiles
ENV SUBDOMAIN=${SUBDOMAIN}

ARG USER_POOL_NAME

ARG USE_RDS="true"

ARG USE_S3="true"

ARG USE_COGNITO="true"

ARG LOAD_BALANCER_SCHEME=internet-facing

ARG NOTEBOOK_ENABLE_CULLING=true

ARG NOTEBOOK_CULL_IDLE_TIMEOUT_MINUTES=120

ARG NOTEBOOK_IDLENESS_CHECK_PERIOD=10

# For prod use 30 days
ARG SECRET_RECOVERY_WINDOW_IN_DAYS=0

ARG NODE_INSTANCE_TYPE

ARG AWS_ACCESS_KEY_ID
ENV AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}

ARG AWS_SECRET_ACCESS_KEY
ENV AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}

# Used to specify that we want to use static IAM credentials for Kubeflow Pipelines (required since KFP v2 does not support IRSA yet).
ENV PIPELINE_S3_CREDENTIAL_OPTION="static"

ARG MINIO_AWS_ACCESS_KEY_ID

ARG MINIO_AWS_SECRET_ACCESS_KEY

ENV AWS_DEFAULT_REGION=${CLUSTER_REGION}

ARG S3_BUCKET_NAME

ARG PATH_TO_BACKUP

ARG BUCKET_REGION

RUN echo "terraform { \n\
  backend \"s3\" { \n\
    bucket = \"$S3_BUCKET_NAME\" \n\
    key    = \"$PATH_TO_BACKUP\" \n\
    region = \"$BUCKET_REGION\" \n\
  } \n\
}" > backend.tf


RUN    echo "minio_aws_access_key_id=\"${MINIO_AWS_ACCESS_KEY_ID}\"" >> sample.auto.tfvars \
    && echo "minio_aws_secret_access_key=\"${MINIO_AWS_SECRET_ACCESS_KEY}\"" >> sample.auto.tfvars \
    && echo "pipeline_s3_credential_option=\"${PIPELINE_S3_CREDENTIAL_OPTION}\"" >> sample.auto.tfvars \
    && echo "create_subdomain=\"${CREATE_SUBDOMAIN}\"" >> sample.auto.tfvars \
    && echo "cluster_name=\"${CLUSTER_NAME}\"" >> sample.auto.tfvars \
    && echo "cluster_region=\"${CLUSTER_REGION}\"" >> sample.auto.tfvars \
    && echo "eks_version=\"${EKS_VERSION}\"" >> sample.auto.tfvars \
    && echo "generate_db_password=\"true\"" >> sample.auto.tfvars \
    && echo "aws_route53_root_zone_name=\"${ROOT_DOMAIN}\"" >> sample.auto.tfvars \
    && echo "aws_route53_subdomain_zone_name=\"${SUBDOMAIN}\"" >> sample.auto.tfvars \
    && echo "cognito_user_pool_name=\"${USER_POOL_NAME}\"" >> sample.auto.tfvars \
    && echo "use_rds=\"${USE_RDS}\"" >> sample.auto.tfvars \
    && echo "use_s3=\"${USE_S3}\"" >> sample.auto.tfvars \
    && echo "use_cognito=\"${USE_COGNITO}\"" >> sample.auto.tfvars \
    && echo "load_balancer_scheme=\"${LOAD_BALANCER_SCHEME}\"" >> sample.auto.tfvars \
    && echo "notebook_enable_culling=\"${NOTEBOOK_ENABLE_CULLING}\"" >> sample.auto.tfvars \
    && echo "notebook_cull_idle_time=\"${NOTEBOOK_CULL_IDLE_TIMEOUT_MINUTES}\"" >> sample.auto.tfvars \
    && echo "notebook_idleness_check_period=\"${NOTEBOOK_IDLENESS_CHECK_PERIOD}\"" >> sample.auto.tfvars \
    && echo "secret_recovery_window_in_days=\"${SECRET_RECOVERY_WINDOW_IN_DAYS}\"" >> sample.auto.tfvars \
    && echo "node_instance_type=\"${NODE_INSTANCE_TYPE}\"" >> sample.auto.tfvars \
    && echo "deletion_protection=\"false\"" >> sample.auto.tfvars \
    && echo "force_destroy_s3_bucket=\"true\"" >> sample.auto.tfvars \
    && terraform init \
    && terraform plan
  1. build it
docker build \
    --build-arg S3_BUCKET_NAME=$DEPLOYMENT_STATE_S3_BUCKET_NAME \
    --build-arg PATH_TO_BACKUP=$PATH_TO_BACKUP \
    --build-arg BUCKET_REGION=$BUCKET_REGION \
    --build-arg AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
    --build-arg AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \
    --build-arg MINIO_AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
    --build-arg MINIO_AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \
    --build-arg KUBEFLOW_RELEASE_VERSION=$KUBEFLOW_RELEASE_VERSION \
    --build-arg AWS_RELEASE_VERSION=$AWS_RELEASE_VERSION \
    --build-arg CLUSTER_NAME=$CLUSTER_NAME \
    --build-arg CLUSTER_REGION=$CLUSTER_REGION \
    --build-arg EKS_VERSION=$EKS_VERSION \
    --build-arg ROOT_DOMAIN=$ROOT_DOMAIN \
    --build-arg SUBDOMAIN=$SUBDOMAIN \
    --build-arg USER_POOL_NAME=$USER_POOL_NAME \
    --build-arg NODE_INSTANCE_TYPE=$NODE_INSTANCE_TYPE \
    -t kf-deployment \
    . \
    -f kubeflow.Dockerfile

Note that for testing purposes, MINIO_AWS_ACCESS_KEY_ID and MINIO_AWS_SECRET_ACCESS_KEY used the same credentials as the one used to deploy (it's an admin test ccount with full access).
3. deploy

docker run --rm kf-deployment make deploy

Environment

  • Kubernetes version 1.25
  • Using EKS (yes/no), if so version? 1.25
  • Kubeflow version v1.7.0
  • AWS build number v1.7.0-aws-b1.0.0
  • AWS service targeted (S3, RDS, etc.) cognito-rds-s3
@AlexandreBrown AlexandreBrown added the bug Something isn't working label May 1, 2023
@ryansteakley
Copy link
Contributor

Hey @AlexandreBrown can you share any pod logs? Can you see where the issue came i.e(has the secret been created at all)?

@ryansteakley ryansteakley added the work in progress Has been assigned and is in progress label May 1, 2023
@AlexandreBrown
Copy link
Contributor Author

@ryansteakley Thanks for quickly tackling this. I deleted the cluster but I will re-try it and send you the logs.

@AlexandreBrown
Copy link
Contributor Author

Logs for reference:

NAMESPACE          NAME                                                       READY   STATUS             RESTARTS       AGE
ack-system         ack-sagemaker-controller-sagemaker-chart-cb9d5549b-cdbh6   1/1     Running            0              20m
cert-manager       cert-manager-7cd97d8d8f-bj9rj                              1/1     Running            0              21m
cert-manager       cert-manager-cainjector-5f44d58c4b-6k29n                   1/1     Running            0              21m
cert-manager       cert-manager-webhook-566bd88f7b-t7hpj                      1/1     Running            0              21m
istio-system       aws-authservice-7d9d757476-7tn7z                           1/1     Running            0              6m16s
istio-system       cluster-local-gateway-6955b67f54-d6jwj                     1/1     Running            0              5m30s
istio-system       istio-ingressgateway-67f7b5f88d-h2hxt                      1/1     Running            0              20m
istio-system       istiod-56f7cf9bd6-7bm5m                                    1/1     Running            0              20m
knative-eventing   eventing-controller-c6f5fd6cd-tktl8                        1/1     Running            0              5m10s
knative-eventing   eventing-webhook-79cd6767-vrhgl                            1/1     Running            0              5m10s
knative-serving    activator-67849589d6-t4gm4                                 2/2     Running            0              5m53s
knative-serving    autoscaler-6dbcdd95c7-z2x42                                2/2     Running            0              5m53s
knative-serving    controller-b9b8855b8-j9k5v                                 2/2     Running            0              5m53s
knative-serving    domain-mapping-75cc6d667f-q5b92                            2/2     Running            0              5m53s
knative-serving    domainmapping-webhook-6dfb78c944-4hvg6                     2/2     Running            0              5m53s
knative-serving    net-istio-controller-5fcd96d76f-lctpv                      2/2     Running            0              5m53s
knative-serving    net-istio-webhook-7ff9fdf999-hjx7v                         2/2     Running            0              5m53s
knative-serving    webhook-69cc5b9849-5cc5v                                   2/2     Running            0              5m53s
kube-system        aws-load-balancer-controller-67868c678f-tpwrq              1/1     Running            0              22m
kube-system        aws-load-balancer-controller-67868c678f-zrnt4              1/1     Running            0              22m
kube-system        aws-node-4hrfn                                             1/1     Running            0              21m
kube-system        aws-node-4mkkz                                             1/1     Running            0              21m
kube-system        aws-node-p5kss                                             1/1     Running            0              21m
kube-system        aws-node-ppxsn                                             1/1     Running            0              21m
kube-system        aws-node-wgkd7                                             1/1     Running            2 (22m ago)    22m
kube-system        cluster-proportional-autoscaler-coredns-57cbbccfc6-5ndb8   1/1     Running            0              22m
kube-system        coredns-8fd4db68f-gj96k                                    1/1     Running            0              27m
kube-system        coredns-8fd4db68f-snrhr                                    1/1     Running            0              27m
kube-system        csi-secrets-store-secrets-store-csi-driver-9ndzg           3/3     Running            0              22m
kube-system        csi-secrets-store-secrets-store-csi-driver-ks9s4           3/3     Running            0              22m
kube-system        csi-secrets-store-secrets-store-csi-driver-m85ww           3/3     Running            0              22m
kube-system        csi-secrets-store-secrets-store-csi-driver-svrr7           3/3     Running            0              22m
kube-system        csi-secrets-store-secrets-store-csi-driver-x75d7           3/3     Running            0              22m
kube-system        ebs-csi-controller-7c9d445f4c-7gxcd                        6/6     Running            0              22m
kube-system        ebs-csi-controller-7c9d445f4c-p5cnp                        6/6     Running            0              22m
kube-system        ebs-csi-node-5x5j2                                         3/3     Running            0              22m
kube-system        ebs-csi-node-g7s49                                         3/3     Running            0              22m
kube-system        ebs-csi-node-lwnp8                                         3/3     Running            0              22m
kube-system        ebs-csi-node-ntrps                                         3/3     Running            0              22m
kube-system        ebs-csi-node-s45xt                                         3/3     Running            0              22m
kube-system        efs-csi-controller-6dcb464885-brz5g                        3/3     Running            0              22m
kube-system        efs-csi-controller-6dcb464885-lr7c2                        3/3     Running            0              22m
kube-system        efs-csi-node-476sp                                         3/3     Running            0              22m
kube-system        efs-csi-node-b2d5j                                         3/3     Running            0              22m
kube-system        efs-csi-node-fvjqh                                         3/3     Running            0              22m
kube-system        efs-csi-node-jn7fl                                         3/3     Running            0              22m
kube-system        efs-csi-node-ldqph                                         3/3     Running            0              22m
kube-system        fsx-csi-controller-855b5d9f64-cm69p                        4/4     Running            0              22m
kube-system        fsx-csi-controller-855b5d9f64-trh5h                        4/4     Running            0              22m
kube-system        fsx-csi-node-8sn5j                                         3/3     Running            0              22m
kube-system        fsx-csi-node-c7dsg                                         3/3     Running            0              22m
kube-system        fsx-csi-node-cfjhx                                         3/3     Running            0              22m
kube-system        fsx-csi-node-t5f4f                                         3/3     Running            0              22m
kube-system        fsx-csi-node-wspnl                                         3/3     Running            0              22m
kube-system        kube-proxy-5lz8l                                           1/1     Running            0              23m
kube-system        kube-proxy-h8ttb                                           1/1     Running            0              23m
kube-system        kube-proxy-m52w4                                           1/1     Running            0              23m
kube-system        kube-proxy-n4x6d                                           1/1     Running            0              23m
kube-system        kube-proxy-tz9dl                                           1/1     Running            0              23m
kube-system        secrets-store-csi-driver-provider-aws-4m9lw                1/1     Running            0              22m
kube-system        secrets-store-csi-driver-provider-aws-59mzv                1/1     Running            0              22m
kube-system        secrets-store-csi-driver-provider-aws-5pxcd                1/1     Running            0              22m
kube-system        secrets-store-csi-driver-provider-aws-nxwf4                1/1     Running            0              22m
kube-system        secrets-store-csi-driver-provider-aws-vfggq                1/1     Running            0              22m
kubeflow           aws-secrets-sync-5c94c68ffc-qgjqx                          2/2     Running            0              8m51s
kubeflow           cache-server-76cb8f97f9-9qzcs                              2/2     Running            0              4m7s
kubeflow           kubeflow-pipelines-profile-controller-5b559b8d64-87gdd     1/1     Running            0              4m7s
kubeflow           metacontroller-0                                           1/1     Running            0              4m6s
kubeflow           metadata-envoy-deployment-5b6c575b98-fhzhz                 1/1     Running            0              4m7s
kubeflow           metadata-grpc-deployment-784b8b5fb4-kw4qs                  2/2     Running            1 (4m ago)     4m7s
kubeflow           metadata-writer-5899c74595-t7xw5                           2/2     Running            0              4m7s
kubeflow           ml-pipeline-547fd4964f-vvtd8                               2/2     Running            0              4m6s
kubeflow           ml-pipeline-persistenceagent-798dbf666f-7p6fc              2/2     Running            0              4m7s
kubeflow           ml-pipeline-scheduledworkflow-859ff9cf7b-vj42r             2/2     Running            0              4m7s
kubeflow           ml-pipeline-ui-75b9f4494b-jqcfc                            2/2     Running            0              4m6s
kubeflow           ml-pipeline-viewer-crd-56f7cfd7d9-b7j88                    2/2     Running            1 (4m1s ago)   4m7s
kubeflow           ml-pipeline-visualizationserver-64447ffc76-kl4xm           2/2     Running            0              4m6s
kubeflow           workflow-controller-6547f784cd-8m9hb                       1/2     CrashLoopBackOff   5 (63s ago)    4m6s

logs of workflow-controller:

time="2023-05-01T21:26:27Z" level=info msg="index config" indexWorkflowSemaphoreKeys=true
time="2023-05-01T21:26:27Z" level=info msg="cron config" cronSyncPeriod=10s
time="2023-05-01T21:26:27Z" level=info msg="Memoization caches will be garbage-collected if they have not been hit after" gcAfterNotHitDuration=30s
time="2023-05-01T21:26:27.429Z" level=info msg="not enabling pprof debug endpoints"
time="2023-05-01T21:26:27.430Z" level=info msg="config map" name=workflow-controller-configmap
time="2023-05-01T21:26:27.438Z" level=info msg="Get configmaps 200"
time="2023-05-01T21:26:27.439Z" level=fatal msg="Failed to register watch for controller config map: error converting YAML to JSON: yaml: line 7: did not find expected ',' or '}'"

@AlexandreBrown
Copy link
Contributor Author

AlexandreBrown commented May 2, 2023

@ryansteakley After analyzing the configmap used during the deployment, we can see that it has an invalid character:
image
If I manually edit the configmap while the module.kubeflow_components.module.kubeflow_pipelines.module.helm_addon.helm_release.addon[0] is still creating and I replace this character with a regular comma , it fixes the issue.

@ryansteakley
Copy link
Contributor

Thanks, as discussed we have merged in #715 to resolve this issue.

@AlexandreBrown
Copy link
Contributor Author

@ryansteakley Did you test it out ? I tried a deployment today and it's still using the old configmap for some reason.

@ryansteakley
Copy link
Contributor

@AlexandreBrown Have you pulled the latest from the github repo and run a deployment with the old one uninstalled/cleaned up?

@AlexandreBrown
Copy link
Contributor Author

@ryansteakley I will re-try on a new deployment just to be sure

@AlexandreBrown
Copy link
Contributor Author

AlexandreBrown commented May 3, 2023

@ryansteakley Interesting, I tried a new deployment (fresh, no previous tfstate and ensured the docker image does not use caching so it re-does a git clone of the latest changes of the release tag).
The deployment succeeds now (make deploy no longer raises) but when I check the pods I still get :

kubeflow                    workflow-controller-f974577d9-jmcf6                        1/2     CrashLoopBackOff   6 (2m5s ago)     7m53s

Logs:

time="2023-05-03T03:37:48Z" level=info msg="index config" indexWorkflowSemaphoreKeys=true
time="2023-05-03T03:37:48Z" level=info msg="cron config" cronSyncPeriod=10s
time="2023-05-03T03:37:48Z" level=info msg="Memoization caches will be garbage-collected if they have not been hit after" gcAfterNotHitDuration=30s
time="2023-05-03T03:37:48.297Z" level=info msg="not enabling pprof debug endpoints"
time="2023-05-03T03:37:48.297Z" level=info msg="config map" name=workflow-controller-configmap
time="2023-05-03T03:37:48.307Z" level=info msg="Get configmaps 200"
time="2023-05-03T03:37:48.307Z" level=fatal msg="Failed to register watch for controller config map: error converting YAML to JSON: yaml: line 7: did not find expected ',' or '}'"

@ryansteakley
Copy link
Contributor

ryansteakley commented May 3, 2023

Let me dive into this.
@AlexandreBrown
I've created a brand new FRESH terraform deployment with the static credentials option and verified running of a pipeline and uploading to s3 etc.. Am not running into the issue anymore. Do you have the bandwith to start from scratch?

@AlexandreBrown
Copy link
Contributor Author

AlexandreBrown commented May 3, 2023

@ryansteakley After some testing, it looks like the issue was not the docker image but rather the git checkout.
If you try the following locally :
1.

export KUBEFLOW_RELEASE_VERSION=v1.7.0
export AWS_RELEASE_VERSION=v1.7.0-aws-b1.0.0
git clone https://github.com/awslabs/kubeflow-manifests.git \
    && cd kubeflow-manifests \
    && git checkout ${AWS_RELEASE_VERSION} \
    && git clone --branch ${KUBEFLOW_RELEASE_VERSION} https://github.com/kubeflow/manifests.git upstream
git log --oneline | head -n 1

I get :

bba1bc91 Minor validation and docs update (#695)

It is the last commit done before the release but I do not get the new commits which includes the fix.
I tested this on 2 different machines.
Any idea how to fix this and can you reproduce?

@AlexandreBrown
Copy link
Contributor Author

AlexandreBrown commented May 3, 2023

Ok if I checkout the latest commit sha instead of the branch then it works.

export AWS_RELEASE_VERSION=cbc9105e531c05d22a53516204f844f02936d072

Is this expected ?

@ryansteakley
Copy link
Contributor

I see this is due to the fact that the docs instruct users to pull the tag which does not change as we push fixes to the release branch. We are making a new release today which should resolve this and the instructions will point towards that new tag.

@AlexandreBrown
Copy link
Contributor Author

@ryansteakley Awesome, thanks for clarifying.
Closing this issue now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working work in progress Has been assigned and is in progress
Projects
None yet
Development

No branches or pull requests

2 participants