Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP]feat(deployment):changed containerRuntimeExecutor to pns #4965

Closed
wants to merge 5 commits into from

Conversation

capri-xiyue
Copy link
Contributor

@capri-xiyue capri-xiyue commented Jan 7, 2021

Description of your changes:
Fxied #1654
changed containerRuntimeExecutor to pns
Checklist:

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: capri-xiyue
To complete the pull request process, please assign bobgy after the PR has been reviewed.
You can assign the PR to them by writing /assign @bobgy in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@capri-xiyue capri-xiyue changed the title [WIP]changed containerRuntimeExecutor to pns [WIP]feat:changed containerRuntimeExecutor to pns Jan 7, 2021
@capri-xiyue capri-xiyue changed the title [WIP]feat:changed containerRuntimeExecutor to pns [WIP]feat(deployment):changed containerRuntimeExecutor to pns Jan 7, 2021
@capri-xiyue
Copy link
Contributor Author

capri-xiyue commented Jan 8, 2021

@Ark-kun @Bobgy
I changed the containerRuntimeExecutor to pns. All sample tests passed.

Based on argoproj/argo-workflows#2679, looks like this argo issue blocks containerRuntimeExecutor k8sapi/kubelet, it may also block runAsNotRoot: true. Looks like it doesn't block we change the gcp default containerRuntimeExecutor from docker to pns since docker can'trunAsNonRoot (I checked https://argoproj.github.io/argo/workflow-executors/). It won't affect our users when we change docker to pns since user can't runAsNonRoot in docker.
By the way, I manually tested this artifact passing workflow with runAsNonRoot, it works in pns.

# This example demonstrates the ability to pass artifacts
# from one step to the next.
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: artifact-passing-non-root
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 8737 #; any non-root user
    privileged: false
  entrypoint: artifact-example
  templates:
  - name: artifact-example
    steps:
    - - name: generate-artifact
        template: whalesay
    - - name: consume-artifact
        template: print-message
        arguments:
          artifacts:
          - name: message
            from: "{{steps.generate-artifact.outputs.artifacts.hello-art}}"

  - name: whalesay
    container:
      image: docker/whalesay:latest
      command: [sh, -c]
      args: ["sleep 1; cowsay hello world | tee /tmp/hello_world.txt"]
    outputs:
      artifacts:
      - name: hello-art
        path: /tmp/hello_world.txt

  - name: print-message
    inputs:
      artifacts:
      - name: message
        path: /tmp/message
    container:
      image: alpine:latest
      command: [sh, -c]
      args: ["cat /tmp/message"]

@capri-xiyue
Copy link
Contributor Author

/retest

@capri-xiyue
Copy link
Contributor Author

Screen Shot 2021-01-07 at 6 40 57 PM

I added some logging to make sure "pns" is used in this PR.

@Bobgy
Copy link
Contributor

Bobgy commented Jan 8, 2021

PNS executor is used in https://github.com/kubeflow/website/blob/master/content/en/docs/pipelines/installation/localcluster-deployment.md.
We should first release next KFP version, and update that documentation after next release.

@capri-xiyue
Copy link
Contributor Author

capri-xiyue commented Jan 8, 2021

After we change it to pns,
it takes quite long for sample test "sidecar" to complete.
With "docker", it only takes 2 or 3 mins.
With "pns", it takes about 30 mins.
I can reproduce it in my own kfp cluster.
Somehow it got stuck during the download stage. But I checked the logs, looks like the download was completed. Not sure why it got stuck. Need further investigation.

@capri-xiyue
Copy link
Contributor Author

Uploading Screen Shot 2021-01-07 at 8.20.49 PM.png…

@capri-xiyue
Copy link
Contributor Author

capri-xiyue commented Jan 9, 2021

After we change it to pns,
it takes quite long for sample test "sidecar" to complete.
With "docker", it only takes 2 or 3 mins.
With "pns", it takes about 30 mins.
I can reproduce it in my own kfp cluster.
Somehow it got stuck during the download stage. But I checked the logs, looks like the download was completed. Not sure why it got stuck. Need further investigation.

Both docker and pns will have to force kill the sidecar echo container.

Here are the logs of docker:

xiyue-macbookpro:pipelines xiyue$ k logs -f -n kubeflow pipeline-with-sidecar-swqzq-2847679697 -c wait
time="2021-01-08T04:42:52Z" level=info msg="Starting Workflow Executor" version=v2.7.5+ede163e.dirty
time="2021-01-08T04:42:52Z" level=info msg="Creating a docker executor"
time="2021-01-08T04:42:52Z" level=info msg="Executor (version: v2.7.5+ede163e.dirty, build_date: 2020-04-21T01:12:08Z) initialized (pod: kubeflow/pipeline-with-sidecar-swqzq-2847679697) with template:\n{\"name\":\"download\",\"arguments\":{},\"inputs\":{\"parameters\":[{\"name\":\"sleep_sec\",\"value\":\"30\"}]},\"outputs\":{\"parameters\":[{\"name\":\"download-downloaded\",\"valueFrom\":{\"path\":\"/tmp/results.txt\"}}],\"artifacts\":[{\"name\":\"download-downloaded\",\"path\":\"/tmp/results.txt\"}]},\"metadata\":{\"annotations\":{\"sidecar.istio.io/inject\":\"false\"},\"labels\":{\"pipelines.kubeflow.org/cache_enabled\":\"true\"}},\"container\":{\"name\":\"\",\"image\":\"busybox:latest\",\"command\":[\"sh\",\"-c\"],\"args\":[\"sleep 30; wget localhost:5678 -O /tmp/results.txt\"],\"resources\":{}},\"sidecars\":[{\"name\":\"echo\",\"image\":\"hashicorp/http-echo:latest\",\"args\":[\"-text=\\\"hello world\\\"\"],\"resources\":{}}],\"archiveLocation\":{\"archiveLogs\":true,\"s3\":{\"endpoint\":\"minio-service.kubeflow:9000\",\"bucket\":\"mlpipeline\",\"insecure\":true,\"accessKeySecret\":{\"name\":\"mlpipeline-minio-artifact\",\"key\":\"accesskey\"},\"secretKeySecret\":{\"name\":\"mlpipeline-minio-artifact\",\"key\":\"secretkey\"},\"key\":\"artifacts/pipeline-with-sidecar-swqzq/pipeline-with-sidecar-swqzq-2847679697\"}}}"
time="2021-01-08T04:42:52Z" level=info msg="Waiting on main container"
time="2021-01-08T04:42:53Z" level=info msg="main container started with container ID: b2e0d5856d10815d4d7483bbf384b8d226efe394ad989b2166c286893582bbe6"
time="2021-01-08T04:42:53Z" level=info msg="Starting annotations monitor"
time="2021-01-08T04:42:53Z" level=info msg="docker wait b2e0d5856d10815d4d7483bbf384b8d226efe394ad989b2166c286893582bbe6"
time="2021-01-08T04:42:53Z" level=info msg="Starting deadline monitor"
time="2021-01-08T04:43:03Z" level=info msg="/argo/podmetadata/annotations updated"
time="2021-01-08T04:43:23Z" level=info msg="Main container completed"
time="2021-01-08T04:43:23Z" level=info msg="Saving logs"
time="2021-01-08T04:43:23Z" level=info msg="Annotations monitor stopped"
time="2021-01-08T04:43:23Z" level=info msg="[docker logs b2e0d5856d10815d4d7483bbf384b8d226efe394ad989b2166c286893582bbe6]"
time="2021-01-08T04:43:23Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: artifacts/pipeline-with-sidecar-swqzq/pipeline-with-sidecar-swqzq-2847679697/main.log"
time="2021-01-08T04:43:23Z" level=info msg="Creating minio client minio-service.kubeflow:9000 using static credentials"
time="2021-01-08T04:43:23Z" level=info msg="Saving from /tmp/argo/outputs/logs/main.log to s3 (endpoint: minio-service.kubeflow:9000, bucket: mlpipeline, key: artifacts/pipeline-with-sidecar-swqzq/pipeline-with-sidecar-swqzq-2847679697/main.log)"
time="2021-01-08T04:43:23Z" level=info msg="Saving output parameters"
time="2021-01-08T04:43:23Z" level=info msg="Saving path output parameter: download-downloaded"
time="2021-01-08T04:43:23Z" level=info msg="Copying /tmp/results.txt from base image layer"
time="2021-01-08T04:43:23Z" level=info msg="[sh -c docker cp -a b2e0d5856d10815d4d7483bbf384b8d226efe394ad989b2166c286893582bbe6:/tmp/results.txt - | tar -ax -O]"
time="2021-01-08T04:43:23Z" level=info msg="Deadline monitor stopped"
time="2021-01-08T04:43:23Z" level=info msg="Successfully saved output parameter: download-downloaded"
time="2021-01-08T04:43:23Z" level=info msg="Saving output artifacts"
time="2021-01-08T04:43:23Z" level=info msg="Staging artifact: download-downloaded"
time="2021-01-08T04:43:23Z" level=info msg="Copying /tmp/results.txt from container base image layer to /tmp/argo/outputs/artifacts/download-downloaded.tgz"
time="2021-01-08T04:43:23Z" level=info msg="Archiving b2e0d5856d10815d4d7483bbf384b8d226efe394ad989b2166c286893582bbe6:/tmp/results.txt to /tmp/argo/outputs/artifacts/download-downloaded.tgz"
time="2021-01-08T04:43:23Z" level=info msg="sh -c docker cp -a b2e0d5856d10815d4d7483bbf384b8d226efe394ad989b2166c286893582bbe6:/tmp/results.txt - | gzip > /tmp/argo/outputs/artifacts/download-downloaded.tgz"
time="2021-01-08T04:43:23Z" level=info msg="Archiving completed"
time="2021-01-08T04:43:23Z" level=info msg="S3 Save path: /tmp/argo/outputs/artifacts/download-downloaded.tgz, key: artifacts/pipeline-with-sidecar-swqzq/pipeline-with-sidecar-swqzq-2847679697/download-downloaded.tgz"
time="2021-01-08T04:43:23Z" level=info msg="Creating minio client minio-service.kubeflow:9000 using static credentials"
time="2021-01-08T04:43:23Z" level=info msg="Saving from /tmp/argo/outputs/artifacts/download-downloaded.tgz to s3 (endpoint: minio-service.kubeflow:9000, bucket: mlpipeline, key: artifacts/pipeline-with-sidecar-swqzq/pipeline-with-sidecar-swqzq-2847679697/download-downloaded.tgz)"
time="2021-01-08T04:43:23Z" level=info msg="Successfully saved file: /tmp/argo/outputs/artifacts/download-downloaded.tgz"
time="2021-01-08T04:43:23Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2021-01-08T04:43:23Z" level=info msg="Annotating pod with output"
time="2021-01-08T04:43:23Z" level=info msg="Killing sidecars"
time="2021-01-08T04:43:23Z" level=info msg="Killing sidecar echo (a83a892076cd1c6cf510955b841578ce42152bec7e1c33d68510eba9973b9588)"
time="2021-01-08T04:43:23Z" level=info msg="docker kill --signal TERM a83a892076cd1c6cf510955b841578ce42152bec7e1c33d68510eba9973b9588"
time="2021-01-08T04:43:23Z" level=info msg="[docker wait a83a892076cd1c6cf510955b841578ce42152bec7e1c33d68510eba9973b9588]"
time="2021-01-08T04:43:33Z" level=info msg="Timed out (10s) for containers to terminate gracefully. Killing forcefully"
time="2021-01-08T04:43:33Z" level=info msg="[docker kill --signal KILL a83a892076cd1c6cf510955b841578ce42152bec7e1c33d68510eba9973b9588]"
time="2021-01-08T04:43:33Z" level=info msg="Containers [a83a892076cd1c6cf510955b841578ce42152bec7e1c33d68510eba9973b9588] killed successfully"
time="2021-01-08T04:43:33Z" level=info msg="Alloc=5839 TotalAlloc=12502 Sys=70592 NumGC=4 Goroutines=10"

Here are the logs of pns:

(pipelines) xiyue-macbookpro:pipelines xiyue$ k logs -f -n kubeflow pipeline-with-sidecar-wwzvq-3142309382 -c wait
time="2021-01-08T04:47:44Z" level=info msg="Starting Workflow Executor" version=v2.7.5+ede163e.dirty
time="2021-01-08T04:47:44Z" level=info msg="Creating PNS executor (namespace: kubeflow, pod: pipeline-with-sidecar-wwzvq-3142309382, pid: 6, hasOutputs: true)"
time="2021-01-08T04:47:44Z" level=info msg="Executor (version: v2.7.5+ede163e.dirty, build_date: 2020-04-21T01:12:08Z) initialized (pod: kubeflow/pipeline-with-sidecar-wwzvq-3142309382) with template:\n{\"name\":\"download\",\"arguments\":{},\"inputs\":{\"parameters\":[{\"name\":\"sleep_sec\",\"value\":\"30\"}]},\"outputs\":{\"parameters\":[{\"name\":\"download-downloaded\",\"valueFrom\":{\"path\":\"/tmp/results.txt\"}}],\"artifacts\":[{\"name\":\"download-downloaded\",\"path\":\"/tmp/results.txt\"}]},\"metadata\":{\"annotations\":{\"sidecar.istio.io/inject\":\"false\"},\"labels\":{\"pipelines.kubeflow.org/cache_enabled\":\"true\"}},\"container\":{\"name\":\"\",\"image\":\"busybox:latest\",\"command\":[\"sh\",\"-c\"],\"args\":[\"sleep 30; wget localhost:5678 -O /tmp/results.txt\"],\"resources\":{}},\"sidecars\":[{\"name\":\"echo\",\"image\":\"hashicorp/http-echo:latest\",\"args\":[\"-text=\\\"hello world\\\"\"],\"resources\":{}}],\"archiveLocation\":{\"archiveLogs\":true,\"s3\":{\"endpoint\":\"minio-service.kubeflow:9000\",\"bucket\":\"mlpipeline\",\"insecure\":true,\"accessKeySecret\":{\"name\":\"mlpipeline-minio-artifact\",\"key\":\"accesskey\"},\"secretKeySecret\":{\"name\":\"mlpipeline-minio-artifact\",\"key\":\"secretkey\"},\"key\":\"artifacts/pipeline-with-sidecar-wwzvq/pipeline-with-sidecar-wwzvq-3142309382\"}}}"
time="2021-01-08T04:47:44Z" level=info msg="Waiting on main container"
time="2021-01-08T04:47:44Z" level=warning msg="Polling root processes (1m0s)"
time="2021-01-08T04:47:45Z" level=warning msg="Failed to stat /proc/19/root: stat /proc/19/root: permission denied"
time="2021-01-08T04:47:45Z" level=info msg="pid 19: &{root 4096 2147484141 {284213262 63745678065 0x2d10800} {211 3287564 1 16877 0 0 0 0 4096 4096 8 {1610081265 435224215} {1610081265 284213262} {1610081265 371219573} [0 0 0]}}"
time="2021-01-08T04:47:45Z" level=info msg="Secured filehandle on /proc/19/root"
time="2021-01-08T04:47:45Z" level=info msg="containerID dca8f93802e68b0507a78039b4e8256d990f6827d2ec15fd5fed411be3a9e746 mapped to pid 19"
time="2021-01-08T04:47:45Z" level=info msg="pid 19: &{root 4096 2147484141 {284213262 63745678065 0x2d10800} {211 3287564 1 16877 0 0 0 0 4096 4096 8 {1610081265 435224215} {1610081265 284213262} {1610081265 371219573} [0 0 0]}}"
time="2021-01-08T04:47:45Z" level=info msg="pid 19: &{root 4096 2147484141 {284213262 63745678065 0x2d10800} {211 3287564 1 16877 0 0 0 0 4096 4096 8 {1610081265 435224215} {1610081265 284213262} {1610081265 371219573} [0 0 0]}}"
time="2021-01-08T04:47:45Z" level=info msg="pid 19: &{root 4096 2147484141 {284213262 63745678065 0x2d10800} {211 3287564 1 16877 0 0 0 0 4096 4096 8 {1610081265 435224215} {1610081265 284213262} {1610081265 371219573} [0 0 0]}}"
time="2021-01-08T04:47:45Z" level=info msg="pid 19: &{root 4096 2147484141 {284213262 63745678065 0x2d10800} {211 3287564 1 16877 0 0 0 0 4096 4096 8 {1610081265 435224215} {1610081265 284213262} {1610081265 371219573} [0 0 0]}}"
time="2021-01-08T04:47:45Z" level=info msg="pid 19: &{root 4096 2147484141 {284213262 63745678065 0x2d10800} {211 3287564 1 16877 0 0 0 0 4096 4096 8 {1610081265 435224215} {1610081265 284213262} {1610081265 371219573} [0 0 0]}}"
time="2021-01-08T04:47:45Z" level=info msg="pid 19: &{root 4096 2147484141 {284213262 63745678065 0x2d10800} {211 3287564 1 16877 0 0 0 0 4096 4096 8 {1610081265 435224215} {1610081265 284213262} {1610081265 371219573} [0 0 0]}}"
time="2021-01-08T04:47:45Z" level=info msg="pid 19: &{root 4096 2147484141 {284213262 63745678065 0x2d10800} {211 3287564 1 16877 0 0 0 0 4096 4096 8 {1610081265 435224215} {1610081265 284213262} {1610081265 371219573} [0 0 0]}}"
time="2021-01-08T04:47:45Z" level=info msg="pid 19: &{root 4096 2147484141 {284213262 63745678065 0x2d10800} {211 3287564 1 16877 0 0 0 0 4096 4096 8 {1610081265 435224215} {1610081265 284213262} {1610081265 371219573} [0 0 0]}}"
time="2021-01-08T04:47:45Z" level=info msg="pid 19: &{root 4096 2147484141 {284213262 63745678065 0x2d10800} {211 3287564 1 16877 0 0 0 0 4096 4096 8 {1610081265 435224215} {1610081265 284213262} {1610081265 371219573} [0 0 0]}}"
time="2021-01-08T04:47:45Z" level=info msg="pid 19: &{root 4096 2147484141 {284213262 63745678065 0x2d10800} {211 3287564 1 16877 0 0 0 0 4096 4096 8 {1610081265 435224215} {1610081265 284213262} {1610081265 371219573} [0 0 0]}}"
time="2021-01-08T04:47:45Z" level=info msg="pid 19: &{root 4096 2147484141 {284213262 63745678065 0x2d10800} {211 3287564 1 16877 0 0 0 0 4096 4096 8 {1610081265 435224215} {1610081265 284213262} {1610081265 371219573} [0 0 0]}}"
time="2021-01-08T04:47:46Z" level=info msg="pid 19: &{root 4096 2147484141 {284213262 63745678065 0x2d10800} {211 3287564 1 16877 0 0 0 0 4096 4096 8 {1610081265 435224215} {1610081265 284213262} {1610081265 371219573} [0 0 0]}}"
time="2021-01-08T04:47:46Z" level=warning msg="Failed to stat /proc/26/root: stat /proc/26/root: permission denied"
time="2021-01-08T04:47:46Z" level=info msg="pid 19: &{root 4096 2147484141 {284213262 63745678065 0x2d10800} {211 3287564 1 16877 0 0 0 0 4096 4096 8 {1610081265 435224215} {1610081265 284213262} {1610081265 371219573} [0 0 0]}}"
time="2021-01-08T04:47:46Z" level=info msg="pid 26: &{root 4096 2147484141 {54269112 63745678066 0x2d10800} {218 3287616 1 16877 0 0 0 0 4096 4096 8 {1610081265 980263745} {1610081266 54269112} {1610081266 54269112} [0 0 0]}}"
time="2021-01-08T04:47:46Z" level=info msg="Secured filehandle on /proc/26/root"
time="2021-01-08T04:47:46Z" level=info msg="containerID 830229497dcc3d02214cf1ebd8694e771bd957b1490857b4373eb5ab9a9fa09d mapped to pid 26"
time="2021-01-08T04:47:46Z" level=info msg="pid 19: &{root 4096 2147484141 {284213262 63745678065 0x2d10800} {211 3287564 1 16877 0 0 0 0 4096 4096 8 {1610081265 435224215} {1610081265 284213262} {1610081265 371219573} [0 0 0]}}"
time="2021-01-08T04:47:46Z" level=info msg="pid 26: &{root 4096 2147484141 {54269112 63745678066 0x2d10800} {218 3287616 1 16877 0 0 0 0 4096 4096 8 {1610081266 110273174} {1610081266 54269112} {1610081266 54269112} [0 0 0]}}"
time="2021-01-08T04:47:46Z" level=info msg="pid 19: &{root 4096 2147484141 {284213262 63745678065 0x2d10800} {211 3287564 1 16877 0 0 0 0 4096 4096 8 {1610081265 435224215} {1610081265 284213262} {1610081265 371219573} [0 0 0]}}"
time="2021-01-08T04:47:46Z" level=info msg="pid 26: &{root 4096 2147484141 {54269112 63745678066 0x2d10800} {218 3287616 1 16877 0 0 0 0 4096 4096 8 {1610081266 110273174} {1610081266 54269112} {1610081266 54269112} [0 0 0]}}"
time="2021-01-08T04:47:46Z" level=info msg="pid 19: &{root 4096 2147484141 {284213262 63745678065 0x2d10800} {211 3287564 1 16877 0 0 0 0 4096 4096 8 {1610081265 435224215} {1610081265 284213262} {1610081265 371219573} [0 0 0]}}"
time="2021-01-08T04:47:46Z" level=info msg="pid 26: &{root 4096 2147484141 {54269112 63745678066 0x2d10800} {218 3287616 1 16877 0 0 0 0 4096 4096 8 {1610081266 110273174} {1610081266 54269112} {1610081266 54269112} [0 0 0]}}"
time="2021-01-08T04:47:46Z" level=info msg="pid 19: &{root 4096 2147484141 {284213262 63745678065 0x2d10800} {211 3287564 1 16877 0 0 0 0 4096 4096 8 {1610081265 435224215} {1610081265 284213262} {1610081265 371219573} [0 0 0]}}"
time="2021-01-08T04:47:46Z" level=info msg="pid 26: &{root 4096 2147484141 {54269112 63745678066 0x2d10800} {218 3287616 1 16877 0 0 0 0 4096 4096 8 {1610081266 110273174} {1610081266 54269112} {1610081266 54269112} [0 0 0]}}"
time="2021-01-08T04:47:46Z" level=info msg="pid 19: &{root 4096 2147484141 {284213262 63745678065 0x2d10800} {211 3287564 1 16877 0 0 0 0 4096 4096 8 {1610081265 435224215} {1610081265 284213262} {1610081265 371219573} [0 0 0]}}"
time="2021-01-08T04:47:46Z" level=info msg="pid 26: &{root 4096 2147484141 {54269112 63745678066 0x2d10800} {218 3287616 1 16877 0 0 0 0 4096 4096 8 {1610081266 110273174} {1610081266 54269112} {1610081266 54269112} [0 0 0]}}"
time="2021-01-08T04:47:46Z" level=info msg="pid 19: &{root 4096 2147484141 {284213262 63745678065 0x2d10800} {211 3287564 1 16877 0 0 0 0 4096 4096 8 {1610081265 435224215} {1610081265 284213262} {1610081265 371219573} [0 0 0]}}"
time="2021-01-08T04:47:46Z" level=info msg="pid 26: &{root 4096 2147484141 {54269112 63745678066 0x2d10800} {218 3287616 1 16877 0 0 0 0 4096 4096 8 {1610081266 110273174} {1610081266 54269112} {1610081266 54269112} [0 0 0]}}"
time="2021-01-08T04:47:46Z" level=info msg="pid 19: &{root 4096 2147484141 {284213262 63745678065 0x2d10800} {211 3287564 1 16877 0 0 0 0 4096 4096 8 {1610081265 435224215} {1610081265 284213262} {1610081265 371219573} [0 0 0]}}"
time="2021-01-08T04:47:46Z" level=info msg="pid 26: &{root 4096 2147484141 {54269112 63745678066 0x2d10800} {218 3287616 1 16877 0 0 0 0 4096 4096 8 {1610081266 110273174} {1610081266 54269112} {1610081266 54269112} [0 0 0]}}"
time="2021-01-08T04:47:46Z" level=info msg="pid 19: &{root 4096 2147484141 {284213262 63745678065 0x2d10800} {211 3287564 1 16877 0 0 0 0 4096 4096 8 {1610081265 435224215} {1610081265 284213262} {1610081265 371219573} [0 0 0]}}"
time="2021-01-08T04:47:46Z" level=info msg="pid 26: &{root 4096 2147484141 {54269112 63745678066 0x2d10800} {218 3287616 1 16877 0 0 0 0 4096 4096 8 {1610081266 110273174} {1610081266 54269112} {1610081266 54269112} [0 0 0]}}"
time="2021-01-08T04:47:46Z" level=info msg="main container started with container ID: dca8f93802e68b0507a78039b4e8256d990f6827d2ec15fd5fed411be3a9e746"
time="2021-01-08T04:47:46Z" level=info msg="Starting annotations monitor"
time="2021-01-08T04:47:46Z" level=info msg="Main pid identified as 19"
time="2021-01-08T04:47:46Z" level=info msg="Successfully secured file handle on main container root filesystem"
time="2021-01-08T04:47:46Z" level=info msg="Closing root filehandle for non-main pid 26"
time="2021-01-08T04:47:46Z" level=info msg="Waiting for main pid 19 to complete"
time="2021-01-08T04:47:46Z" level=info msg="Starting deadline monitor"
time="2021-01-08T04:47:46Z" level=info msg="Stopped root processes polling due to successful securing of main root fs"
time="2021-01-08T04:47:56Z" level=info msg="/argo/podmetadata/annotations updated"
time="2021-01-08T04:48:15Z" level=info msg="Main pid 19 completed"
time="2021-01-08T04:48:15Z" level=info msg="Main container completed"
time="2021-01-08T04:48:15Z" level=info msg="Saving logs"
time="2021-01-08T04:48:15Z" level=info msg="Annotations monitor stopped"
time="2021-01-08T04:48:15Z" level=info msg="Deadline monitor stopped"
time="2021-01-08T04:48:15Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: artifacts/pipeline-with-sidecar-wwzvq/pipeline-with-sidecar-wwzvq-3142309382/main.log"
time="2021-01-08T04:48:15Z" level=info msg="Creating minio client minio-service.kubeflow:9000 using static credentials"
time="2021-01-08T04:48:15Z" level=info msg="Saving from /tmp/argo/outputs/logs/main.log to s3 (endpoint: minio-service.kubeflow:9000, bucket: mlpipeline, key: artifacts/pipeline-with-sidecar-wwzvq/pipeline-with-sidecar-wwzvq-3142309382/main.log)"
time="2021-01-08T04:48:15Z" level=info msg="Saving output parameters"
time="2021-01-08T04:48:15Z" level=info msg="Saving path output parameter: download-downloaded"
time="2021-01-08T04:48:15Z" level=info msg="Copying /tmp/results.txt from base image layer"
time="2021-01-08T04:48:15Z" level=info msg="Successfully saved output parameter: download-downloaded"
time="2021-01-08T04:48:15Z" level=info msg="Saving output artifacts"
time="2021-01-08T04:48:15Z" level=info msg="Staging artifact: download-downloaded"
time="2021-01-08T04:48:15Z" level=info msg="Copying /tmp/results.txt from container base image layer to /tmp/argo/outputs/artifacts/download-downloaded.tgz"
time="2021-01-08T04:48:15Z" level=info msg="Taring /tmp/results.txt"
time="2021-01-08T04:48:15Z" level=info msg="S3 Save path: /tmp/argo/outputs/artifacts/download-downloaded.tgz, key: artifacts/pipeline-with-sidecar-wwzvq/pipeline-with-sidecar-wwzvq-3142309382/download-downloaded.tgz"
time="2021-01-08T04:48:15Z" level=info msg="Creating minio client minio-service.kubeflow:9000 using static credentials"
time="2021-01-08T04:48:15Z" level=info msg="Saving from /tmp/argo/outputs/artifacts/download-downloaded.tgz to s3 (endpoint: minio-service.kubeflow:9000, bucket: mlpipeline, key: artifacts/pipeline-with-sidecar-wwzvq/pipeline-with-sidecar-wwzvq-3142309382/download-downloaded.tgz)"
time="2021-01-08T04:48:15Z" level=info msg="Successfully saved file: /tmp/argo/outputs/artifacts/download-downloaded.tgz"
time="2021-01-08T04:48:15Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2021-01-08T04:48:15Z" level=info msg="Annotating pod with output"
time="2021-01-08T04:48:15Z" level=info msg="Killing sidecars"
time="2021-01-08T04:48:15Z" level=info msg="Killing sidecar echo (830229497dcc3d02214cf1ebd8694e771bd957b1490857b4373eb5ab9a9fa09d)"
time="2021-01-08T04:48:15Z" level=info msg="Sending SIGTERM to pid 26"
time="2021-01-08T04:48:25Z" level=warning msg="Timed out (10s) waiting for pid 26 to complete after SIGTERM. Issing SIGKILL"
time="2021-01-08T04:52:44Z" level=info msg="Alloc=4061 TotalAlloc=14330 Sys=70592 NumGC=7 Goroutines=8"
time="2021-01-08T04:57:44Z" level=info msg="Alloc=4061 TotalAlloc=14333 Sys=70592 NumGC=9 Goroutines=8"
time="2021-01-08T05:02:44Z" level=info msg="Alloc=4061 TotalAlloc=14336 Sys=70592 NumGC=12 Goroutines=8"
time="2021-01-08T05:07:44Z" level=info msg="Alloc=4061 TotalAlloc=14339 Sys=70592 NumGC=14 Goroutines=8"
time="2021-01-08T05:12:44Z" level=info msg="Alloc=4061 TotalAlloc=14342 Sys=70592 NumGC=16 Goroutines=8"
time="2021-01-08T05:17:44Z" level=info msg="Alloc=4061 TotalAlloc=14345 Sys=70592 NumGC=19 Goroutines=8"
time="2021-01-08T05:18:25Z" level=info msg="Alloc=4064 TotalAlloc=14348 Sys=70592 NumGC=19 Goroutines=7

for "pns", it got stuck for 30 mins is because of https://github.com/argoproj/argo/blob/release-2.7/workflow/executor/pns/pns.go#L271.
Currently we use argo 2.7.5 and in 2.7.5 pns executor will have to sleep for 30 mins before SIGKILL the container.

But "docker" won't wait for 30 mins before SIGKILL the container https://github.com/argoproj/argo/blob/54154c61eb4fe9f052b04328fb00128568dc20d0/workflow/executor/docker/docker.go#L143

Starting from release 2.9, argo no longer wait 30 mins before sending SIGKILL for"pns" https://github.com/argoproj/argo/blob/5759a0e198d333fa8c3e0aeee433d93808c0dc72/workflow/executor/pns/pns.go#L260
@Bobgy @Ark-kun I think it is not user friendly to let user wait so long(30mins) when force kill is needed and 30 mins is only for debugging in argo in release 2.7.5 (FYI: argoproj/argo-workflows#2995), can we upgrade argo to release 2.9 or further?

@Bobgy
Copy link
Contributor

Bobgy commented Jan 9, 2021

Really impressive investigation,
Great job!

I believe there were existing engagements to upgrade to argo 2.11. It's probably a better candidate.

@Bobgy
Copy link
Contributor

Bobgy commented Jan 9, 2021

#4553

@Ark-kun
Copy link
Contributor

Ark-kun commented Jan 15, 2021

Also #4693

@chensun chensun force-pushed the master branch 2 times, most recently from 7542f0e to 44d22a6 Compare February 12, 2021 09:23
@capri-xiyue
Copy link
Contributor Author

/retest

@capri-xiyue
Copy link
Contributor Author

/retest

@google-oss-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: capri-xiyue
To complete the pull request process, please assign bobgy after the PR has been reviewed.
You can assign the PR to them by writing /assign @bobgy in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Ark-kun
Copy link
Contributor

Ark-kun commented Mar 5, 2021

Just to check: Do the sample pipelines that use artifact passing work?
I remember that this was the biggest issue with the PNS executor - it could not grab files from the containers.

@capri-xiyue
Copy link
Contributor Author

Just to check: Do the sample pipelines that use artifact passing work?
I remember that this was the biggest issue with the PNS executor - it could not grab files from the containers.

It worked. I verified it.

@Bobgy
Copy link
Contributor

Bobgy commented Mar 5, 2021

Note that argo image hasn't been updated yet, I am working on that. This needs to wait for it.

@stale
Copy link

stale bot commented Jun 3, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jun 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes do-not-merge/work-in-progress lifecycle/stale The issue / pull request is stale, any activities remove this label. size/S
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants