Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait container failing: level=info msg="stopping progress monitor (context done)" error="context canceled" #10490

Closed
2 of 3 tasks
lvijnck opened this issue Feb 8, 2023 · 12 comments · Fixed by #10523
Closed
2 of 3 tasks
Labels
area/executor P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority type/bug

Comments

@lvijnck
Copy link

lvijnck commented Feb 8, 2023

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

I am trying to implement a mechanism to clone a repository and pass it downstream. This has been failing due to the "unable to resolve delta" that was happening in 3.4.4. I just updated to argo v3.4.5.

I am now getting errors on Argo's "wait" container. I performed several tests and concluded the following:

  • The repo is succesfully cloned, as the status is correctly displayed
  • Error does not happen when I omit the output artifact
  • Error does not happen for public repositories (even with output artifact defined)

Version

v.3.4.5

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: test-clone-repo
spec:
  entrypoint: run
  templates:
    - name: run
      inputs:
        parameters:
          artifacts:
          - name: repo-artifact
            path: /src
            git:
              repo: "https://gitlab.com/url"
              revision: "main"
              usernameSecret:
                name: argo-git-creds
                key: username
              passwordSecret:
                name: argo-git-creds
                key: token
      container:
        image: "python:3.9-buster"
        command: [ sh, -c ]
        args: [ "git status && ls && pwd" ]
        workingDir: /src
      outputs:
        artifacts:
          - name: repo-artifact
            path: /src

Logs from the workflow controller

time="2023-02-08T19:14:09.342Z" level=info msg="Processing workflow" namespace=fix-arg-workflows workflow=test-clone-repo-drqpb                                                                                                  │
│ time="2023-02-08T19:14:09.342Z" level=info msg="Task-result reconciliation" namespace=fix-arg-workflows numObjs=0 workflow=test-clone-repo-drqpb                                                                                 │
│ time="2023-02-08T19:14:09.342Z" level=info msg="Pod failed: Error (exit code 2)" displayName=test-clone-repo-drqpb namespace=fix-arg-workflows pod=test-clone-repo-drqpb templateName=run workflow=test-clone-repo-drqpb         │
│ time="2023-02-08T19:14:09.342Z" level=info msg="node changed" namespace=fix-arg-workflows new.message="Error (exit code 2)" new.phase=Error new.progress=0/1 nodeID=test-clone-repo-drqpb old.message= old.phase=Running old.pro │
│ time="2023-02-08T19:14:09.342Z" level=info msg="TaskSet Reconciliation" namespace=fix-arg-workflows workflow=test-clone-repo-drqpb                                                                                               │
│ time="2023-02-08T19:14:09.342Z" level=info msg=reconcileAgentPod namespace=fix-arg-workflows workflow=test-clone-repo-drqpb                                                                                                      │
│ time="2023-02-08T19:14:09.342Z" level=info msg="Updated phase Running -> Error" namespace=fix-arg-workflows workflow=test-clone-repo-drqpb                                                                                       │
│ time="2023-02-08T19:14:09.342Z" level=info msg="Updated message  -> Error (exit code 2)" namespace=fix-arg-workflows workflow=test-clone-repo-drqpb                                                                              │
│ time="2023-02-08T19:14:09.342Z" level=info msg="Marking workflow completed" namespace=fix-arg-workflows workflow=test-clone-repo-drqpb                                                                                           │
│ time="2023-02-08T19:14:09.342Z" level=info msg="Marking workflow as pending archiving" namespace=fix-arg-workflows workflow=test-clone-repo-drqpb                                                                                │
│ time="2023-02-08T19:14:09.342Z" level=info msg="Checking daemoned children of " namespace=fix-arg-workflows workflow=test-clone-repo-drqpb                                                                                       │
│ time="2023-02-08T19:14:09.342Z" level=info msg="Workflow to be dehydrated" Workflow Size=2398                                                                                                                                    │
│ time="2023-02-08T19:14:09.348Z" level=info msg="cleaning up pod" action=deletePod key=fix-arg-workflows/test-clone-repo-drqpb-1340600742-agent/deletePod                                                                         │
│ time="2023-02-08T19:14:09.352Z" level=info msg="Delete pods 404"                                                                                                                                                                 │
│ time="2023-02-08T19:14:09.359Z" level=info msg="Update workflows 200"                                                                                                                                                            │
│ time="2023-02-08T19:14:09.359Z" level=info msg="Workflow update successful" namespace=fix-arg-workflows phase=Error resourceVersion=80452200 workflow=test-clone-repo-drqpb                                                      │
│ time="2023-02-08T19:14:09.359Z" level=info msg="Create events 201"                                                                                                                                                               │
│ time="2023-02-08T19:14:09.371Z" level=info msg="DeleteCollection workflowtaskresults 200"                                                                                                                                        │
│ time="2023-02-08T19:14:09.371Z" level=info msg="Create events 201"                                                                                                                                                               │
│ time="2023-02-08T19:14:09.371Z" level=info msg="archiving workflow" namespace=fix-arg-workflows uid=4621fee1-4e34-46a5-a718-a955b880a4e2 workflow=test-clone-repo-drqpb                                                          │
│ time="2023-02-08T19:14:09.377Z" level=info msg="cleaning up pod" action=labelPodCompleted key=fix-arg-workflows/test-clone-repo-drqpb/labelPodCompleted                                                                          │
│ time="2023-02-08T19:14:09.394Z" level=info msg="Patch pods 200"                                                                                                                                                                  │
│ time="2023-02-08T19:14:09.424Z" level=info msg="Patch workflows 200"                                                                                                                                                             │
│ time="2023-02-08T19:14:12.044Z" level=info msg="Get leases 200"                                                                                                                                                                  │
│ time="2023-02-08T19:14:12.051Z" level=info msg="Update leases 200"                                                                                                                                                               │
│ time="2023-02-08T19:14:17.059Z" level=info msg="Get leases 200"                                                                                                                                                                  │
│ time="2023-02-08T19:14:17.068Z" level=info msg="Update leases 200"                                                                                                                                                               │
│ time="2023-02-08T19:14:18.099Z" level=info msg="Watch workflows 200"                                                                                                                                                             │
│                   
│ time="2023-02-08T19:14:21.086Z" level=info msg=healthz age=5m0s err="<nil>" instanceID= labelSelector="!workflows.argoproj.io/phase,!workflows.argoproj.io/controller-instanceid" managedNamespace=                              │
│ time="2023-02-08T19:14:22.076Z" level=info msg="Get leases 200"                                                                                                                                                                  │
│ time="2023-02-08T19:14:22.083Z" level=info msg="Update leases 200"                                                                                                                                                               │
│ time="2023-02-08T19:14:27.090Z" level=info msg="Get leases 200"                                                                                                                                                                  │
│ time="2023-02-08T19:14:27.096Z" level=info msg="Update leases 200"                                                                                                                                                               │
│ time="2023-02-08T19:14:27.303Z" level=info msg="cleaning up pod" action=killContainers key=fix-arg-workflows/test-clone-repo-drqpb/killContainers                                                                                │
│ time="2023-02-08T19:14:32.111Z" level=info msg="Get leases 200"                                                                                                                                                                  │
│ time="2023-02-08T19:14:32.119Z" level=info msg="Update leases 200"                                                                                                                                                               │
│

Logs from in your workflow's wait container

time="2023-02-08T18:56:22.825Z" level=info msg="Starting Workflow Executor" version=v3.4.5                                                                                                                                       │
│ time="2023-02-08T18:56:22.827Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5                                                                                                         │
│ time="2023-02-08T18:56:22.827Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=fix-arg-workflows podName=test-clone-repo-8nnkq template="{\"name\":\"run\",\ │
│ time="2023-02-08T18:56:22.828Z" level=info msg="Starting deadline monitor"                                                                                                                                                       │
│ time="2023-02-08T18:56:35.829Z" level=info msg="Main container completed" error="<nil>"                                                                                                                                          │
│ time="2023-02-08T18:56:35.829Z" level=info msg="Deadline monitor stopped"                                                                                                                                                        │
│ time="2023-02-08T18:56:35.829Z" level=info msg="stopping progress monitor (context done)" error="context canceled"                                                                                                               │
│ time="2023-02-08T18:56:35.829Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"                                                                                                         │
│ time="2023-02-08T18:56:35.829Z" level=info msg="No output parameters"                                                                                                                                                            │
│ time="2023-02-08T18:56:35.829Z" level=info msg="Saving output artifacts"                                                                                                                                                         │
│ time="2023-02-08T18:56:35.829Z" level=info msg="Staging artifact: repo-artifact"                                                                                                                                                 │
│ time="2023-02-08T18:56:35.829Z" level=info msg="Staging /src from mirrored volume mount /mainctrfs/src"                                                                                                                          │
│ time="2023-02-08T18:56:35.829Z" level=info msg="Taring /mainctrfs/src"                                                                                                                                                           │
│ time="2023-02-08T18:56:46.803Z" level=info msg="archived 651 files/dirs in /mainctrfs/src"                                                                                                                                       │
│ time="2023-02-08T18:56:46.803Z" level=info msg="Successfully staged /src from mirrored volume mount /mainctrfs/src"                                                                                                              │
│ time="2023-02-08T18:56:46.803Z" level=info msg="S3 Save path: /tmp/argo/outputs/artifacts/repo-artifact.tgz, key: fix-arg-workflows/test-clone-repo-8nnkq/test-clone-repo-8nnkq/repo-artifact.tgz"                               │
│ time="2023-02-08T18:56:46.803Z" level=info msg="Creating minio client using IAM role"                                                                                                                                            │
│ time="2023-02-08T18:56:46.811Z" level=info msg="Saving file to s3" bucket=bla-k8s endpoint=s3.amazonaws.com key=fix-arg-workflows/test-clone-repo-8nnkq/test-clone-repo-8nnkq/repo-artifact.tgz path=/tmp/argo/outputs/artifact │
│
@lvijnck lvijnck changed the title Wait container failing: 'msg="stopping progress monitor (context done)" error="context canceled"' Wait container failing: 'level=info msg="stopping progress monitor (context done)" error="context canceled"' Feb 8, 2023
@tooptoop4
Copy link
Contributor

tooptoop4 commented Feb 8, 2023

i am getting several "Error (exit code 2)" on v3.4.5 too (nothing special about the workflows but seems to affect the shorter running ones ie < 30 seconds long, reading inline source artifact i guess s3, not git), but this never happened on v3.4.4

can see the spike of 100s of instances (appearance of "Error (exit code 2)" in controller logs) of this after upgrading (the 2 bars on right are after upgrade, whereas in 30 days before that only happened 4 times):
image

@lvijnck
Copy link
Author

lvijnck commented Feb 8, 2023

This is actually not related to the repo artifact, but it seems to be related to artifact passing. I am also getting this on other stuff I am passing through workflows.

How did such regression come from a new update?

@sxllwx
Copy link
Contributor

sxllwx commented Feb 9, 2023

stopping progress monitor (context done)

I think we should pay more attention to the "unable to resolve delta" error. We may need the following information to help troubleshoot the problem (information can be desensitized):

  • Provide full log of main-container
  • Provide the complete yaml of the failed workflow (if it can be reproduced, it will speed up troubleshooting).
  • Check whether the configmap of workflow-controller contains ActiveDeadlineSeconds related settings,
  • Try to rule out whether it is caused by a git server error.

@tooptoop4
Copy link
Contributor

@sxllwx i don't have any "unable to resolve delta" messages

@sxllwx
Copy link
Contributor

sxllwx commented Feb 9, 2023

i am getting several "Error (exit code 2)" on v3.4.5 too (nothing special about the workflows but seems to affect the shorter running ones ie < 30 seconds long, reading inline source artifact i guess s3, not git), but this never happened on v3.4.4

can see the spike of 100s of instances (appearance of "Error (exit code 2)" in controller logs) of this after upgrading (the 2 bars on right are after upgrade, whereas in 30 days before that only happened 4 times):

emmm... I'm answering lvijnck's question. 😅

For your case, you can provide the following information for troubleshooting:

  1. The log of workflow-controller.
  2. (init, main, wait) container related logs.
  3. The manifest of the workflow.

@lvijnck
Copy link
Author

lvijnck commented Feb 9, 2023

@sxllwx Thanks for the response! Briefly updated the original post. The "unable to resolve reference delta" was in Argo v3.4.4, hence why I updated to 3.4.5. As a current workaround, I have setup a custom template where I implement the cloning myself.

As soon as I did set this up, my pipeline continued to crash signalling the exit 2 above. Some additional observations:

  1. It's not related to the Git input artifact, i.e.,
    • I was successfully able to swap out my repository for another private one
    • I was able to clone any other public repo, e.g., the argo workflows repo and pass it as an artifact
  2. It seems to be related to output artifacts
    • When I omit the output artifact I no longer get the error
    • The error is also happening for other steps in my pipeline with output artifacts

I ended up downgrading to v.3.4.4. and the issue was no longer present.

The only log notifications that resembles an error is the following:

│ time="2023-02-08T18:56:35.829Z" level=info msg="stopping progress monitor (context done)" error="context canceled"

Follow-ups:

  • Workflow above contains full example (only secrets to repo are omitted)
  • Adding the Helm deployment below (using ArgoCD + Helm to deploy)
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: argo-workflows
  namespace: argocd
spec:
  project: default
  source:
    chart: argo-workflows
    repoURL: https://argoproj.github.io/argo-helm
    targetRevision: 0.22.10
    helm:
      releaseName: argo-workflows
      values: |
        server:
          baseHref: /
          extraArgs:
            - --auth-mode=server
        useDefaultArtifactRepo: true
        useStaticCredentials: false
        artifactRepository:
          archiveLogs: true
          s3:
            insecure: false
            bucket: blabla-k8s
            endpoint: s3.amazonaws.com
            region: us-east-2
            keyFormat: "{{workflow.namespace}}/{{workflow.name}}/{{pod.name}}"
        controller:
          persistence:
            connectionPool:
              maxIdleConns: 100
              maxOpenConns: 0
            nodeStatusOffLoad: true
            # enable archiving of old workflows
            archive: true
            archiveTTL: 90d
            postgresql:
              host: host
              port: 5432
              database: postgres
              tableName: argo_workflows
              # the database secrets must be in the same namespace of the controller
              userNameSecret:
                name: argo-db-creds
                key: username
              passwordSecret:
                name: argo-db-creds
                key: password
  destination:
    server: https://kubernetes.default.svc
    namespace: argo-workflows
  syncPolicy:
    syncOptions:
    - CreateNamespace=true
    automated:
      prune: true
      selfHeal: true
    retry:
      limit: 20
      backoff:
        duration: 10s
        factor: 2

@sxllwx
Copy link
Contributor

sxllwx commented Feb 10, 2023

Thank you for the information

Can you provide the manifest of the Pod with Error (exit code 2)?

@tooptoop4
Copy link
Contributor

tooptoop4 commented Feb 12, 2023

@danajp perhaps #10039 is the cause? #9179 (comment) comment caught my attention "Most users will never need tty. Replacing os.Exec with pty is risky (reasons should be obvious). I think benefit outweights risk." there is also creack/pty#127

@danajp
Copy link
Contributor

danajp commented Feb 12, 2023

@tooptoop4 Seems unlikely. pty is only used for templates that set tty: true. It's also only used in the main container, not the waitcontainer.

@sxllwx
Copy link
Contributor

sxllwx commented Feb 13, 2023

The following facts have been confirmed so far:

  1. argoexec-wait may receive SIGTERM (15) more than once
  2. If we call cancel, it will cause argoexec-wait written in golang All signal processing of the app is reset to the default processing method.

There is a question here, why argo wait will receive sig-term? Who can provide some ideas?

@sxllwx
Copy link
Contributor

sxllwx commented Feb 14, 2023

Update:

Here is a summary:

It is indeed a problem introduced by #10140, and I believe #10523 can fix it

Some things we need to clarify (will involve a lot of code):


Why does #10140 bring (exit code 2)?

https://github.com/argoproj/argo-workflows/blob/v3.4.5/cmd/argoexec/commands/wait.go#L36-L46

The cancel here will be executed when the function exits. The semantics of cancel (ps: it should be called stop here) is: argoexec wait will not be willing to receive signals (sigterm )(ref: https://github.com/golang/go/blob/go1.19.2/src/os/signal/signal.go#L262-L276). If the signal is received again, argoexec wait will use the default behavior provided by golang, which is exit(2)

https://github.com/argoproj/argo-workflows/blob/v3.4.4/cmd/argoexec/commands/wait.go#L36-L46

Here cancel will be executed when func waitContainer exits. func waitContainer can receive multiple sigterms during execution.


Why is it sporadic?

From the above analysis, we can see that whether exit(2) depends on the timing of receiving sigterm for the second time, our workflow can succeed in the following situations:

  • The cleanup logic performed by our argoexec wait can be done very quickly. (@lvijnck If the repo-code is very large, will also encounter this problem. I use kubernetes/kubernetes, and it is relatively easy to reproduce)
  • When our workflow-controller executes podCleanupItem (this function will send a signal to argoexec wait)

I hope my answer can clarify your doubts (you are welcome to question again): @lvijnck @tooptoop4 @alexec

@omarkalloush
Copy link

Do you have an estimation when will this fix be released?

@caelan-io caelan-io added the P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority label Feb 16, 2023
alexec added a commit that referenced this issue Feb 23, 2023
terrytangyuan pushed a commit that referenced this issue Mar 29, 2023
@agilgur5 agilgur5 changed the title Wait container failing: 'level=info msg="stopping progress monitor (context done)" error="context canceled"' Wait container failing: level=info msg="stopping progress monitor (context done)" error="context canceled" Apr 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/executor P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority type/bug
Projects
None yet
7 participants