Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow stuck when step fails in a loop #1339

Closed
jgonthier opened this issue Apr 24, 2019 · 2 comments
Closed

Workflow stuck when step fails in a loop #1339

jgonthier opened this issue Apr 24, 2019 · 2 comments

Comments

@jgonthier
Copy link

Is this a BUG REPORT or FEATURE REQUEST?:
BUG REPORT

What happened:
When each element in a loop contains multiple steps, failure of one step causes the next elements to get stuck: the element finishes its current step properly but never proceeds to the next step and does not fail. See example below, it might be clearer to just see the argo get output.

For example, the following workflow:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: loops-
spec:
  entrypoint: loop-example
  templates:
  - name: loop-example
    steps:
    - - name: loop-steps
        template: loop-steps-t
        arguments:
          parameters:
          - {name: mystring, value: '{{item}}'}
        withItems:
        - item1
        - item2
        - item3

# Loop steps
  - name: loop-steps-t
    inputs:
      parameters:
      - {name: mystring}
    steps:
    - - name: item2-fail
        template: item2-fail
        arguments:
          parameters:
          - {name: item, value: '{{inputs.parameters.mystring}}'}
    - - name: print-message
        template: whalesay
        arguments:
          parameters:
          - {name: message, value: '{{inputs.parameters.mystring}}'}

  - name: item2-fail
    inputs:
      parameters:
      - {name: item}
    container:
      image: python:alpine3.6
      command: [python, -c]
      # fail for item2, sleep for others
      args: ["exec(\"import sys\\nimport time\\nif '{{inputs.parameters.item}}' == 'item2':\\n  sys.exit(1)\\nelse:\\n  time.sleep(5)\\nsys.exit(0)\")"]

  - name: whalesay
    inputs:
      parameters:
      - {name: message}
    container:
      image: docker/whalesay:latest
      command: [cowsay]
      args: ["{{inputs.parameters.message}}"]

results in the following:

$ argo get loops-5f6zr
Name:                loops-5f6zr
Namespace:           default
ServiceAccount:      default
Status:              Running
Created:             Wed Apr 24 12:34:06 -0400 (5 minutes ago)
Started:             Wed Apr 24 12:34:06 -0400 (5 minutes ago)
Duration:            5 minutes 9 seconds

STEP                        PODNAME                 DURATION  MESSAGE
 ● loops-5f6zr                                                
 └-·-✔ loop-steps(0:item1)                                    
   | ├---✔ item2-fail       loops-5f6zr-3149497651  7s        
   | └---✔ print-message    loops-5f6zr-2495876399  7s        
   ├-✖ loop-steps(1:item2)                                    child 'loops-5f6zr-2388831633' failed
   | └---✖ item2-fail       loops-5f6zr-2388831633  5s        failed with exit code 1
   └-● loop-steps(2:item3)                                    
     └---✔ item2-fail       loops-5f6zr-2617972919  6s        

Environment:

  • Argo version:
$ argo version
argo: v2.2.1
  BuildDate: 2018-10-11T16:25:59Z
  GitCommit: 3b52b26190163d1f72f3aef1a39f9f291378dafb
  GitTreeState: clean
  GitTag: v2.2.1
  GoVersion: go1.10.3
  Compiler: gc
  Platform: darwin/amd64
  • Kubernetes version :
$ kubectl version -o yaml
clientVersion:
  buildDate: "2019-02-04T04:48:03Z"
  compiler: gc
  gitCommit: 721bfa751924da8d1680787490c54b9179b1fed0
  gitTreeState: clean
  gitVersion: v1.13.3
  goVersion: go1.11.5
  major: "1"
  minor: "13"
  platform: darwin/amd64
serverVersion:
  buildDate: "2019-03-01T22:49:39Z"
  compiler: gc
  gitCommit: 7c34c0d2f2d0f11f397d55a46945193a0e22d8f3
  gitTreeState: clean
  gitVersion: v1.11.8-eks-7c34c0
  goVersion: go1.10.8
  major: "1"
  minor: 11+
  platform: linux/amd64

Let me know if any more info is needed.

@jessesuen
Copy link
Member

This appears to be fixed in v2.3, where we fixed some stuck workflows. Submitted your example and it seems to work expectedly (ignore timestamps, minikube vm time is out of sync):

Name:                loops-hdqrw
Namespace:           default
ServiceAccount:      default
Status:              Failed
Message:             child 'loops-hdqrw-3604865890' failed
Created:             Sun Apr 21 10:27:43 -0700 (2 weeks ago)
Started:             Tue May 07 11:50:34 -0700 (34 seconds ago)
Finished:            Tue May 07 11:51:08 -0700 (now)
Duration:            34 seconds

STEP                        PODNAME                 DURATION  MESSAGE
 ✖ loops-hdqrw                                                child 'loops-hdqrw-3604865890' failed
 └-·-✔ loop-steps(0:item1)
   | ├---✔ item2-fail       loops-hdqrw-755516374   16d
   | └---✔ print-message    loops-hdqrw-747117632   16d
   ├-✖ loop-steps(1:item2)                                    child 'loops-hdqrw-2409741172' failed
   | └---✖ item2-fail       loops-hdqrw-2409741172  16d       failed with exit code 1
   └-✔ loop-steps(2:item3)
     ├---✔ item2-fail       loops-hdqrw-3013206290  16d
     └---✔ print-message    loops-hdqrw-100592476   16d

@jgonthier
Copy link
Author

Great, thanks!

icecoffee531 pushed a commit to icecoffee531/argo-workflows that referenced this issue Jan 5, 2022
…#1339 (argoproj#1363)

* fix: disable bool simplifier due to performance issue

Signed-off-by: Derek Wang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants