Pod failed with error: `Pod was active on the node longer than the specified deadline` -remain in status Running #9934

shiraOvadia · 2022-10-31T20:29:21Z

Pre-requisites

I have double-checked my configuration
I can confirm the issues exists when I tested with :latest
I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

I set a timeout of 10 seconds in a template.
activeDeadlineSeconds: 10
After 10 seconds the pod received an error: the pod was active on the node for more than the specified deadline
After a few minutes the pod is deleted in kubernetes, but its remain in status Pending or Running.
The flow is blocked and no continue to the next template.
I expected that the pod get failed status

In previous version the same workflow work ok.

Version

V3.4.2

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: hello-world-  # Name of this Workflow
spec:
  entrypoint: engine
  podGC:
    strategy: OnPodSuccess
  templates:
  - name: whalesay            # Defining the "whalesay" template
    activeDeadlineSeconds: 10
    container:
      image: docker/whalesay
      command: ['sh','-c']
      args: ["cowsay hello world && sleep 600"]   # This template runs "cowsay" in the "whalesay" image with arguments "hello world"
      resources:
        requests:
          memory: "3Gi"
          cpu: "2000m"
        limits:
          memory: "3Gi"
          cpu: "2000m"
  - name: engine
    parallelism: 7000
    steps:
      - - name: whalesay
          template: whalesay
          withSequence:
            count:  1

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}
workflow.log

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
wait.log

The text was updated successfully, but these errors were encountered:

shiraOvadia · 2022-11-07T08:41:09Z

This is a fatal error. It prevents us from upgrading argo version to latest.
Can someone please help here?

sarabala1979 · 2022-11-08T18:11:18Z

@shiraOvadia it looks like the k8s API is taking time to update pods when activedeadlineseconds reached. you can try the timeout instead of activedeadlineseconds.

// Timeout allows to set the total node execution timeout duration counting from the node's start time.
	// This duration also includes time in which the node spends in Pending state. This duration may not be applied to Step or DAG templates.

https://github.com/argoproj/argo-workflows/blob/ed351ff084c4524ff4b2a45b53e539f91f5d423a/sdks/python/client/docs/IoArgoprojWorkflowV1alpha1Template.md

shiraOvadia · 2022-11-10T13:41:33Z

I also tried "timeout" , and the behavior was the same as "activedeadlineseconds" .
The pods still in status Running and the never changes to Failed/Error.

HRusby · 2022-11-14T10:46:58Z

I'm also experiencing this issue.

Example Workflow

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: active-deadline-test
spec:
  entrypoint: active-deadline-test
  templates:
  - name: active-deadline-test
    parallelism: 10
    steps:
    - - name: active-deadline-test-timeout
        inline:
          activeDeadlineSeconds: '5'
          script:
              image: alpine:{{.Chart.AppVersion}}
              command: [bin/bash]
              source: |
                sleep 100s

My suspicion is that the deadlineExceeded node isn't having it's phase updated correctly here: https://github.com/argoproj/argo-workflows/blob/master/workflow/controller/steps.go#L249-L258
I think ErrDeadlineExceeded should have the same if not similar logic to ErrTimeout. Equivalent section of dag.go

Using timeout instead of activeDeadlineSeconds did however work

Using timeout instead

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: active-deadline-test
spec:
  entrypoint: active-deadline-test
  templates:
  - name: active-deadline-test
    parallelism: 10
    dag:
      tasks:
        - name: test-timeout-set
          template: test-timeout
          arguments:
            parameters:
              - name: timeout
                value: '5s'
        - name: test-timeout-unset
          template: test-timeout
        - name: test-timeout-set-empty
          template: test-timeout
          arguments:
            parameters:
              - name: timeout
                value: ''
        - name: test-timeout-set-zero
          template: test-timeout
          arguments:
            parameters:
              - name: timeout
                value: '0s'

  - name: test-timeout
    inputs:
      parameters:
        - name: timeout
          default: ''
    timeout: '{{`{{inputs.parameters.timeout}}`}}'
    script:
        image: alpine:{{.Chart.AppVersion}}
        command: [bin/bash]
        source: |
          sleep 100s

--- EDIT ---
Update to this, it seems a longer timeout ends up with the same behaviour as activeDeadlineSeconds. So it remains in running and doesn't exit

Thearas · 2022-12-30T07:11:01Z

@sarabala1979 Hi, I got the same issue with timeout.
The pod status is DeadlineExceeded but the workflow step phase is still Running.

tahiraha · 2023-01-24T15:02:54Z

I ran into the same issue with timeout and activedeadlineseconds, looks like it's still happening. There's a similar issue of hanging pods when OOMKilled, already logged here: #10063

HRusby · 2023-03-30T14:42:04Z

@sarabala1979 We're still experiencing this issue.

ramandeepsharma · 2023-04-17T09:52:56Z

We are also facing the same issue with v3.4.7. Is there any ETA to fix this issue?

kalpanathanneeru21 · 2023-04-25T07:45:07Z

any one working on this issue? since latest version having fixes for all vulnerabilities because of workflows failure issue not able to upgrade to latest

shuangkun · 2024-03-29T04:13:39Z

I tested it and think it has been solved by #12761

agilgur5 · 2024-04-19T16:48:13Z

Effectively superseded by #12329. We should probably backport #12761 to release-3.4 then

shiraOvadia added type/bug type/regression Regression from previous behavior (a specific type of bug) labels Oct 31, 2022

shiraOvadia changed the title ~~Podim failed with error: Pod was active on the node longer than the specified deadline - still in status Running~~ Pod failed with error: Pod was active on the node longer than the specified deadline -remain in status Running Oct 31, 2022

sarabala1979 added the P3 Low priority label Nov 8, 2022

HRusby mentioned this issue Dec 21, 2022

daemon container won't be delete if deadline exceeded happen #10029

Closed

3 tasks

This comment was marked as resolved.

Sign in to view

stale bot added the problem/stale This has not had a response in some time label Jan 21, 2023

JPZ13 mentioned this issue Feb 8, 2023

Workflow hangs after clicking stop and terminate #10491

Closed

3 tasks

This comment was marked as spam.

Sign in to view

stale bot removed the problem/stale This has not had a response in some time label Mar 3, 2023

This comment was marked as resolved.

Sign in to view

stale bot added the problem/stale This has not had a response in some time label Mar 25, 2023

stale bot removed the problem/stale This has not had a response in some time label Mar 30, 2023

JPZ13 added P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important and removed P3 Low priority labels Mar 30, 2023

This comment was marked as spam.

Sign in to view

sakshimalani mentioned this issue May 31, 2023

Pod was active on the node longer than the specified deadline issue with v3.4.2+ #11156

Closed

3 tasks

juliev0 added P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority and removed P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important labels Jun 8, 2023

vadasambar mentioned this issue Aug 17, 2023

fix(resource): match discovery burst and qps for kubectl with upstream kubectl binary #11603

Merged

This comment was marked as resolved.

Sign in to view

stale bot added the problem/stale This has not had a response in some time label Sep 17, 2023

terrytangyuan removed the problem/stale This has not had a response in some time label Sep 20, 2023

This comment was marked as spam.

Sign in to view

shuangkun self-assigned this Mar 29, 2024

shuangkun closed this as completed Mar 29, 2024

agilgur5 added the area/controller Controller issues, panics label Apr 19, 2024

agilgur5 added this to the v3.5.x patches milestone Apr 19, 2024

agilgur5 changed the title ~~Pod failed with error: Pod was active on the node longer than the specified deadline -remain in status Running~~ Pod failed with error: Pod was active on the node longer than the specified deadline -remain in status Running Apr 19, 2024

agilgur5 added the solution/superseded This PR or issue has been superseded by another one (slightly different from a duplicate) label Apr 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pod failed with error: `Pod was active on the node longer than the specified deadline` -remain in status Running #9934

Pod failed with error: `Pod was active on the node longer than the specified deadline` -remain in status Running #9934

shiraOvadia commented Oct 31, 2022 •

edited

Loading

shiraOvadia commented Nov 7, 2022

sarabala1979 commented Nov 8, 2022

shiraOvadia commented Nov 10, 2022

HRusby commented Nov 14, 2022 •

edited

Loading

Thearas commented Dec 30, 2022

This comment was marked as resolved.

tahiraha commented Jan 24, 2023 •

edited

Loading

This comment was marked as spam.

This comment was marked as resolved.

HRusby commented Mar 30, 2023

ramandeepsharma commented Apr 17, 2023 •

edited

Loading

kalpanathanneeru21 commented Apr 25, 2023

This comment was marked as spam.

This comment was marked as resolved.

This comment was marked as spam.

shuangkun commented Mar 29, 2024

agilgur5 commented Apr 19, 2024

Pod failed with error: Pod was active on the node longer than the specified deadline -remain in status Running #9934

Pod failed with error: Pod was active on the node longer than the specified deadline -remain in status Running #9934

Comments

shiraOvadia commented Oct 31, 2022 • edited Loading

Pre-requisites

What happened/what you expected to happen?

Version

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Logs from the workflow controller

Logs from in your workflow's wait container

shiraOvadia commented Nov 7, 2022

sarabala1979 commented Nov 8, 2022

shiraOvadia commented Nov 10, 2022

HRusby commented Nov 14, 2022 • edited Loading

Thearas commented Dec 30, 2022

This comment was marked as resolved.

tahiraha commented Jan 24, 2023 • edited Loading

This comment was marked as spam.

This comment was marked as resolved.

HRusby commented Mar 30, 2023

ramandeepsharma commented Apr 17, 2023 • edited Loading

kalpanathanneeru21 commented Apr 25, 2023

This comment was marked as spam.

This comment was marked as resolved.

This comment was marked as spam.

shuangkun commented Mar 29, 2024

agilgur5 commented Apr 19, 2024

Pod failed with error: `Pod was active on the node longer than the specified deadline` -remain in status Running #9934

Pod failed with error: `Pod was active on the node longer than the specified deadline` -remain in status Running #9934

shiraOvadia commented Oct 31, 2022 •

edited

Loading

HRusby commented Nov 14, 2022 •

edited

Loading

tahiraha commented Jan 24, 2023 •

edited

Loading

ramandeepsharma commented Apr 17, 2023 •

edited

Loading