Skip to content
This repository has been archived by the owner on Jun 20, 2024. It is now read-only.

There may be a deadlock in the task scheduler that freezes or slows pipeline execution #401

Open
clintval opened this issue Mar 1, 2022 · 0 comments

Comments

@clintval
Copy link
Member

clintval commented Mar 1, 2022

We have an older Dagr pipeline that has been run many times (updated to use 040d12e though).

In very rare non-reproducible cases we appear to hit a deadlock that causes the pipeline to halt or creep to a glacial pace.

Conditions that may relate to the issue, or could simply be coincidences:

  • Some tasks have been scheduled under a subsequent retry after failure, eventually succeeding
  • Some tasks have been started but others are unknown to the task manager
  • In one unbounded case, a job that was estimated to take a few hours, took days before we terminated it

Final logs (before prematurely cancelling the job) look like:

TaskManager | Warning] ********************************************************************************
TaskManager | Warning] A single step in execution was > 30s (31s). | Warning] Found 14 tasks with status: is unknown
TaskManager | Warning] Found 6 tasks with status: has been started
TaskManager | Warning] Found 49 tasks with status: has succeeded

Because this is rare, and we can enforce TTL policies on the running of this pipeline, it's not critical we fix any underlying issue.

Simply posting the issue in case anyone else hits something similar, and wants to feel less alone!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant