Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod failed with error: Pod was active on the node longer than the specified deadline -remain in status Running #9934

Closed
2 of 3 tasks
shiraOvadia opened this issue Oct 31, 2022 · 17 comments
Assignees
Labels
area/controller Controller issues, panics P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority solution/superseded This PR or issue has been superseded by another one (slightly different from a duplicate) type/bug type/regression Regression from previous behavior (a specific type of bug)

Comments

@shiraOvadia
Copy link
Contributor

shiraOvadia commented Oct 31, 2022

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

I set a timeout of 10 seconds in a template.
activeDeadlineSeconds: 10
After 10 seconds the pod received an error: the pod was active on the node for more than the specified deadline
After a few minutes the pod is deleted in kubernetes, but its remain in status Pending or Running.
The flow is blocked and no continue to the next template.
I expected that the pod get failed status

In previous version the same workflow work ok.

Version

V3.4.2

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: hello-world-  # Name of this Workflow
spec:
  entrypoint: engine
  podGC:
    strategy: OnPodSuccess
  templates:
  - name: whalesay            # Defining the "whalesay" template
    activeDeadlineSeconds: 10
    container:
      image: docker/whalesay
      command: ['sh','-c']
      args: ["cowsay hello world && sleep 600"]   # This template runs "cowsay" in the "whalesay" image with arguments "hello world"
      resources:
        requests:
          memory: "3Gi"
          cpu: "2000m"
        limits:
          memory: "3Gi"
          cpu: "2000m"
  - name: engine
    parallelism: 7000
    steps:
      - - name: whalesay
          template: whalesay
          withSequence:
            count:  1

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}
workflow.log

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
wait.log

@shiraOvadia shiraOvadia added type/bug type/regression Regression from previous behavior (a specific type of bug) labels Oct 31, 2022
@shiraOvadia shiraOvadia changed the title Podim failed with error: Pod was active on the node longer than the specified deadline - still in status Running Pod failed with error: Pod was active on the node longer than the specified deadline -remain in status Running Oct 31, 2022
@shiraOvadia
Copy link
Contributor Author

This is a fatal error. It prevents us from upgrading argo version to latest.
Can someone please help here?

@sarabala1979
Copy link
Member

@shiraOvadia it looks like the k8s API is taking time to update pods when activedeadlineseconds reached. you can try the timeout instead of activedeadlineseconds.

// Timeout allows to set the total node execution timeout duration counting from the node's start time.
	// This duration also includes time in which the node spends in Pending state. This duration may not be applied to Step or DAG templates.

https://github.com/argoproj/argo-workflows/blob/ed351ff084c4524ff4b2a45b53e539f91f5d423a/sdks/python/client/docs/IoArgoprojWorkflowV1alpha1Template.md

@sarabala1979 sarabala1979 added the P3 Low priority label Nov 8, 2022
@shiraOvadia
Copy link
Contributor Author

I also tried "timeout" , and the behavior was the same as "activedeadlineseconds" .
The pods still in status Running and the never changes to Failed/Error.

@HRusby
Copy link

HRusby commented Nov 14, 2022

I'm also experiencing this issue.

Example Workflow
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: active-deadline-test
spec:
  entrypoint: active-deadline-test
  templates:
  - name: active-deadline-test
    parallelism: 10
    steps:
    - - name: active-deadline-test-timeout
        inline:
          activeDeadlineSeconds: '5'
          script:
              image: alpine:{{.Chart.AppVersion}}
              command: [bin/bash]
              source: |
                sleep 100s

My suspicion is that the deadlineExceeded node isn't having it's phase updated correctly here: https://github.com/argoproj/argo-workflows/blob/master/workflow/controller/steps.go#L249-L258
I think ErrDeadlineExceeded should have the same if not similar logic to ErrTimeout. Equivalent section of dag.go

Using timeout instead of activeDeadlineSeconds did however work

Using timeout instead
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: active-deadline-test
spec:
  entrypoint: active-deadline-test
  templates:
  - name: active-deadline-test
    parallelism: 10
    dag:
      tasks:
        - name: test-timeout-set
          template: test-timeout
          arguments:
            parameters:
              - name: timeout
                value: '5s'
        - name: test-timeout-unset
          template: test-timeout
        - name: test-timeout-set-empty
          template: test-timeout
          arguments:
            parameters:
              - name: timeout
                value: ''
        - name: test-timeout-set-zero
          template: test-timeout
          arguments:
            parameters:
              - name: timeout
                value: '0s'

  - name: test-timeout
    inputs:
      parameters:
        - name: timeout
          default: ''
    timeout: '{{`{{inputs.parameters.timeout}}`}}'
    script:
        image: alpine:{{.Chart.AppVersion}}
        command: [bin/bash]
        source: |
          sleep 100s

--- EDIT ---
Update to this, it seems a longer timeout ends up with the same behaviour as activeDeadlineSeconds. So it remains in running and doesn't exit

@Thearas
Copy link
Contributor

Thearas commented Dec 30, 2022

@sarabala1979 Hi, I got the same issue with timeout.
The pod status is DeadlineExceeded but the workflow step phase is still Running.

@stale

This comment was marked as resolved.

@stale stale bot added the problem/stale This has not had a response in some time label Jan 21, 2023
@tahiraha
Copy link

tahiraha commented Jan 24, 2023

I ran into the same issue with timeout and activedeadlineseconds, looks like it's still happening. There's a similar issue of hanging pods when OOMKilled, already logged here: #10063

@umialpha

This comment was marked as spam.

@stale stale bot removed the problem/stale This has not had a response in some time label Mar 3, 2023
@stale

This comment was marked as resolved.

@stale stale bot added the problem/stale This has not had a response in some time label Mar 25, 2023
@HRusby
Copy link

HRusby commented Mar 30, 2023

@sarabala1979 We're still experiencing this issue.

@stale stale bot removed the problem/stale This has not had a response in some time label Mar 30, 2023
@JPZ13 JPZ13 added P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important and removed P3 Low priority labels Mar 30, 2023
@ramandeepsharma
Copy link

ramandeepsharma commented Apr 17, 2023

We are also facing the same issue with v3.4.7. Is there any ETA to fix this issue?

@kalpanathanneeru21
Copy link

any one working on this issue? since latest version having fixes for all vulnerabilities because of workflows failure issue not able to upgrade to latest

@sakshimalani

This comment was marked as spam.

@juliev0 juliev0 added P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority and removed P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important labels Jun 8, 2023
@stale

This comment was marked as resolved.

@stale stale bot added the problem/stale This has not had a response in some time label Sep 17, 2023
@terrytangyuan terrytangyuan removed the problem/stale This has not had a response in some time label Sep 20, 2023
@cdemarco-drw

This comment was marked as spam.

@shuangkun shuangkun self-assigned this Mar 29, 2024
@shuangkun
Copy link
Member

I tested it and think it has been solved by #12761

@agilgur5 agilgur5 added the area/controller Controller issues, panics label Apr 19, 2024
@agilgur5 agilgur5 added this to the v3.5.x patches milestone Apr 19, 2024
@agilgur5 agilgur5 changed the title Pod failed with error: Pod was active on the node longer than the specified deadline -remain in status Running Pod failed with error: Pod was active on the node longer than the specified deadline -remain in status Running Apr 19, 2024
@agilgur5
Copy link

Effectively superseded by #12329. We should probably backport #12761 to release-3.4 then

@agilgur5 agilgur5 added the solution/superseded This PR or issue has been superseded by another one (slightly different from a duplicate) label Apr 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority solution/superseded This PR or issue has been superseded by another one (slightly different from a duplicate) type/bug type/regression Regression from previous behavior (a specific type of bug)
Projects
None yet
Development

No branches or pull requests