Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inputs.parameters.image was not supplied using retryStrategy in a DAG since 2.4.0 #1659

Closed
kclaes opened this issue Oct 8, 2019 · 5 comments · Fixed by #1669
Closed

inputs.parameters.image was not supplied using retryStrategy in a DAG since 2.4.0 #1659

kclaes opened this issue Oct 8, 2019 · 5 comments · Fixed by #1669
Assignees
Milestone

Comments

@kclaes
Copy link

kclaes commented Oct 8, 2019

Is this a BUG REPORT or FEATURE REQUEST?:
BUG REPORT

What happened:
Using RetryStrategy on a container causes the retry-step to fail with inputs.parameters.image was not supplied, even though the resulting child-pod succeeds.
2019-10-08_17-33-38

What you expected to happen:
The retry step should succeed if the child-pod succeeds within the retry limit.

How to reproduce it (as minimally and precisely as possible):
Workflow: wf-strip.yaml.txt

Anything else we need to know?:
Removing the last 2 lines from the attached workflow, so removing the retryStrategy, causes the workflow to succeed. Also works without issue on 2.3.0

Environment:

  • Argo version: 2.4.0

  • Kubernetes version :

clientVersion:
  buildDate: "2019-08-19T11:13:49Z"
  compiler: gc
  gitCommit: 96fac5cd13a5dc064f7d9f4f23030a6aeface6cc
  gitTreeState: clean
  gitVersion: v1.14.6
  goVersion: go1.12.9
  major: "1"
  minor: "14"
  platform: windows/amd64
serverVersion:
  buildDate: "2019-02-28T13:30:26Z"
  compiler: gc
  gitCommit: c27b913fddd1a6c480c229191a087698aa92f0b1
  gitTreeState: clean
  gitVersion: v1.13.4
  goVersion: go1.11.5
  major: "1"
  minor: "13"
  platform: linux/amd64

Other debugging information (if applicable):

  • workflow result:
  Nodes:
    Argo 24 - Retry:
      Children:
        argo24-retry-2505001171
      Display Name:   argo24-retry
      Finished At:    2019-10-08T15:22:46Z
      Id:             argo24-retry
      Name:           argo24-retry
      Phase:          Error
      Started At:     2019-10-08T15:22:41Z
      Template Name:  dag
      Type:           DAG
    Argo 24 - Retry - 1352431966:
      Boundary ID:   argo24-retry
      Display Name:  task1(0)
      Finished At:   2019-10-08T15:22:45Z
      Id:            argo24-retry-1352431966
      Inputs:
        Parameters:
          Name:       image
          Value:      python:alpine3.6
      Name:           argo24-retry.task1(0)
      Phase:          Succeeded
      Started At:     2019-10-08T15:22:41Z
      Template Name:  container1
      Type:           Pod
    Argo 24 - Retry - 2505001171:
      Boundary ID:  argo24-retry
      Children:
        argo24-retry-1352431966
      Display Name:   task1
      Finished At:    2019-10-08T15:22:42Z
      Id:             argo24-retry-2505001171
      Message:        inputs.parameters.image was not supplied
      Name:           argo24-retry.task1
      Phase:          Error
      Started At:     2019-10-08T15:22:41Z
      Template Name:  container1
      Type:           Retry
  Phase:              Error
  • workflow-controller logs:
time="2019-10-08T15:22:41Z" level=info msg="Processing workflow" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:41Z" level=info msg="Updated phase  -> Running" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:41Z" level=info msg="DAG node argo24-retry (argo24-retry) initialized Pending" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:41Z" level=info msg="node argo24-retry (argo24-retry) phase Pending -> Running" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:41Z" level=info msg="All of node argo24-retry.task1 dependencies [] completed" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:41Z" level=info msg="Retry node argo24-retry.task1 (argo24-retry-2505001171) initialized Running" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:41Z" level=info msg="Pod node argo24-retry.task1(0) (argo24-retry-1352431966) initialized Pending" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:41Z" level=info msg="Created pod: argo24-retry.task1(0) (argo24-retry-1352431966)" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:41Z" level=info msg="Workflow update successful" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:42Z" level=info msg="Processing workflow" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:42Z" level=info msg="Updating node argo24-retry.task1(0) (argo24-retry-1352431966) message: ContainerCreating"
time="2019-10-08T15:22:42Z" level=info msg="node argo24-retry.task1 (argo24-retry-2505001171) phase Running -> Error" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:42Z" level=info msg="node argo24-retry.task1 (argo24-retry-2505001171) message: inputs.parameters.image was not supplied" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:42Z" level=info msg="node argo24-retry.task1 (argo24-retry-2505001171) finished: 2019-10-08 15:22:42.510629459 +0000 UTC" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:42Z" level=info msg="Workflow update successful" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:43Z" level=info msg="Processing workflow" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:43Z" level=info msg="Processing workflow" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:45Z" level=info msg="Processing workflow" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:45Z" level=info msg="Updating node argo24-retry.task1(0) (argo24-retry-1352431966) status Pending -> Running"
time="2019-10-08T15:22:45Z" level=info msg="Workflow update successful" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:46Z" level=info msg="Processing workflow" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:46Z" level=info msg="Processing workflow" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:46Z" level=info msg="Updating node argo24-retry.task1(0) (argo24-retry-1352431966) status Running -> Succeeded"
time="2019-10-08T15:22:46Z" level=info msg="node argo24-retry (argo24-retry) phase Running -> Error" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:46Z" level=info msg="node argo24-retry (argo24-retry) finished: 2019-10-08 15:22:46.806242053 +0000 UTC" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:46Z" level=info msg="Checking daemoned children of argo24-retry" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:46Z" level=info msg="Updated phase Running -> Error" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:46Z" level=info msg="Marking workflow completed" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:46Z" level=info msg="Checking daemoned children of " namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:46Z" level=info msg="Workflow update successful" namespace=dev-personal workflow=argo24-retry
time="2019-10-08T15:22:47Z" level=info msg="Labeled pod dev-personal/argo24-retry-1352431966 completed"
@thundergolfer
Copy link

thundergolfer commented Oct 9, 2019

After upgrade to 2.4.1 in our dev cluster we are also seeing a quite similar regression. The difference in our case is that the workflow doesn't actually stop with Error. It gets stuck:

Status:
  Finished At:  <nil>
  Nodes:
    Example - Pipeline - Hbnbf:
      Children:
        example-pipeline-hbnbf-1001031843
        example-pipeline-hbnbf-1754353090
      Display Name:   example-pipeline-hbnbf
      Finished At:    <nil>
      Id:             example-pipeline-hbnbf
      Name:           example-pipeline-hbnbf
      Phase:          Running                                     <<<<--- Stuck
      Started At:     2019-10-09T21:32:35Z
      Template Name:  example-pipeline
      Type:           DAG
    Example - Pipeline - Hbnbf - 1001031843:
      Boundary ID:  example-pipeline-hbnbf
      Children:
        example-pipeline-hbnbf-1450750190
      Display Name:   launch-cluster
      Finished At:    2019-10-09T21:32:36Z
      Id:             example-pipeline-hbnbf-1001031843
      Message:        inputs.parameters.flavor was not supplied <<<<----- Same
      Name:           example-pipeline-hbnbf.launch-cluster
      Phase:          Error
      Started At:     2019-10-09T21:32:35Z
      Template Name:  create-cluster
      Type:           Retry
    Example - Pipeline - Hbnbf - 1450750190:
      Boundary ID:   example-pipeline-hbnbf
      Display Name:  launch-cluster(0)
      Finished At:   2019-10-09T21:40:43Z
      Id:            example-pipeline-hbnbf-1450750190
      Inputs:
        Parameters:
          Name:   flavor
          Value:  ds-dev
          Name:   role-arn
          Value:  arn:aws:iam::111111111111111:role/xxx
          Name:   cluster-name
          Value:  ExampleCluster
          Name:   app-bundle-name
          Value:  example-pipeline

argo -n argo get example-pipeline-hbnf

Name:                example-pipeline-hbnbf
Namespace:           argo
ServiceAccount:      argo
Status:              Running
Created:             Thu Oct 10 08:32:35 +1100 (2 hours ago)
Started:             Thu Oct 10 08:32:35 +1100 (2 hours ago)
Duration:            2 hours 7 minutes
Parameters:
  dsFlavor:          ds-dev
  roleArn:           arn:aws:iam::11111111111111:role/xxx
  artifactVersion:   add6ce40
  clustersConfig:    { .... }

STEP                                          PODNAME                            DURATION  MESSAGE
 ● example-pipeline-hbnbf (example-pipeline)
 ├-✔ launch-cluster(0) (create-cluster)       example-pipeline-hbnbf-1450750190  8m
 └-✔ notify-started (notify-started)          example-pipeline-hbnbf-1754353090  9s

@thundergolfer
Copy link

thundergolfer commented Oct 10, 2019

Further info:

Every workflow using retryStrategy that got scheduled in our dev cluster got stuck in Running, so it's consistent. Once stuck the workflow won't respond to argo terminate.

@sarabala1979
Copy link
Member

Can you provide ‘init and wait ‘ container logs ?
You can use kubectl logs init/wait.

@thundergolfer
Copy link

thundergolfer commented Oct 10, 2019

For which pod?

In my example above both pods associated with succeeded steps (launch-cluster(0), notify-started) complete normally. I don't think there's a pod associated with the parent of launch-cluster(0) which is the component that is in an Error state.

@sarabala1979
Copy link
Member

I am able to reproduce in my dev environment. I will work on the fix. Thanks for finding it.

@sarabala1979 sarabala1979 self-assigned this Oct 10, 2019
@sarabala1979 sarabala1979 added this to the v2.4 milestone Oct 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants