Fix zombie task handling with multiple schedulers #24906

jedcunningham · 2022-07-07T20:25:26Z

Each scheduler was looking at all running tasks for zombies, leading to
multiple schedulers handling the zombies. This causes problems with
retries (e.g. being marked as FAILED instead of UP_FOR_RETRY) and
callbacks (e.g. on_failure_callback being called multiple times).

When the second scheduler tries to determine if the task is able to be retried,
and it's already in UP_FOR_RETRY (the first scheduler already finished),
it sees the "next" try_number (as it's no longer running),
which then leads it to be FAILED instead.

The easy fix is to simply restrict each scheduler to its own TIs, as
orphaned running TIs will be adopted anyways.

Each scheduler was looking at all running tasks for zombies, leading to multiple schedulers handling the zombies. This causes problems with retries (e.g. being marked as FAILED instead of UP_FOR_RETRY) and callbacks (e.g. `on_failure_callback` being called multiple times). When the second scheduler tries to determine if the task is able to be retried, and it's already in UP_FOR_RETRY (the first scheduler already finished), it sees the "next" try_number (as it's no longer running), which then leads it to be FAILED instead. The easy fix is to simply restrict each scheduler to its own TIs, as orphaned running TIs will be adopted anyways.

collinmcnulty · 2022-07-07T20:29:08Z

I know we want atomic changes but do you also want to get rid of line 1373 which does nothing?

jedcunningham · 2022-07-07T20:37:07Z

Turns out it does do something, TaskInstance vs LocalTaskJob. I just completely overlooked it yesterday, and there is test coverage 😉.

collinmcnulty · 2022-07-07T20:39:05Z

Maybe that deserves a comment then?

jedcunningham · 2022-07-07T20:57:42Z

@collinmcnulty, how's that?

collinmcnulty · 2022-07-07T21:08:55Z

I meant that it might be good to comment that the line that seemingly does nothing actually helps distinguish between localtaskjob and taskinstance. I know that's a bit off topic from the thrust of this PR so feel free to ignore, but it's something we discussed in troubleshooting so maybe we can save the next person from re-discovering that the line does have a purpose after all.

jedcunningham · 2022-07-07T21:38:06Z

I don't think we need a comment in that section, frankly I'm not sure it would have helped me. I was just moving too quickly and didn't look closely enough.

dstandish · 2022-07-08T17:23:48Z

sorry @jedcunningham had a pending review that i forgot to finish

jedcunningham · 2022-07-08T17:26:28Z

Haha no worries, it happens. Were there changes you wanted?

Each scheduler was looking at all running tasks for zombies, leading to multiple schedulers handling the zombies. This causes problems with retries (e.g. being marked as FAILED instead of UP_FOR_RETRY) and callbacks (e.g. `on_failure_callback` being called multiple times). When the second scheduler tries to determine if the task is able to be retried, and it's already in UP_FOR_RETRY (the first scheduler already finished), it sees the "next" try_number (as it's no longer running), which then leads it to be FAILED instead. The easy fix is to simply restrict each scheduler to its own TIs, as orphaned running TIs will be adopted anyways. (cherry picked from commit 1c0d0a5)

jedcunningham requested review from kaxil, ashb and XD-DENG as code owners July 7, 2022 20:25

boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Jul 7, 2022

ashb added this to the Airflow 2.3.4 milestone Jul 7, 2022

ashb approved these changes Jul 7, 2022

View reviewed changes

Update comment

f4460c0

jedcunningham added the type:bug-fix Changelog: Bug Fixes label Jul 7, 2022

jedcunningham merged commit 1c0d0a5 into apache:main Jul 8, 2022

jedcunningham deleted the zombie_race branch July 8, 2022 16:49

potiuk mentioned this pull request Aug 5, 2022

Ensure that zombie tasks for dags with errors get cleaned up #25550

Merged

ephraimbuddy mentioned this pull request Aug 20, 2022

Status of testing of Apache Airflow 2.3.4rc1 #25846

Closed

53 tasks

eladkal mentioned this pull request Aug 20, 2022

Each scheduler may process same zombie tasks in HA mode #25843

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix zombie task handling with multiple schedulers #24906

Fix zombie task handling with multiple schedulers #24906

jedcunningham commented Jul 7, 2022

collinmcnulty commented Jul 7, 2022

jedcunningham commented Jul 7, 2022

collinmcnulty commented Jul 7, 2022

jedcunningham commented Jul 7, 2022

collinmcnulty commented Jul 7, 2022

jedcunningham commented Jul 7, 2022

dstandish commented Jul 8, 2022

jedcunningham commented Jul 8, 2022

Fix zombie task handling with multiple schedulers #24906

Fix zombie task handling with multiple schedulers #24906

Conversation

jedcunningham commented Jul 7, 2022

collinmcnulty commented Jul 7, 2022

jedcunningham commented Jul 7, 2022

collinmcnulty commented Jul 7, 2022

jedcunningham commented Jul 7, 2022

collinmcnulty commented Jul 7, 2022

jedcunningham commented Jul 7, 2022

dstandish commented Jul 8, 2022

jedcunningham commented Jul 8, 2022