Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix zombie task handling with multiple schedulers #24906

Merged
merged 2 commits into from
Jul 8, 2022

Conversation

jedcunningham
Copy link
Member

Each scheduler was looking at all running tasks for zombies, leading to
multiple schedulers handling the zombies. This causes problems with
retries (e.g. being marked as FAILED instead of UP_FOR_RETRY) and
callbacks (e.g. on_failure_callback being called multiple times).

When the second scheduler tries to determine if the task is able to be retried,
and it's already in UP_FOR_RETRY (the first scheduler already finished),
it sees the "next" try_number (as it's no longer running),
which then leads it to be FAILED instead.

The easy fix is to simply restrict each scheduler to its own TIs, as
orphaned running TIs will be adopted anyways.

Each scheduler was looking at all running tasks for zombies, leading to
multiple schedulers handling the zombies. This causes problems with
retries (e.g. being marked as FAILED instead of UP_FOR_RETRY) and
callbacks (e.g. `on_failure_callback` being called multiple times).

When the second scheduler tries to determine if the task is able to be retried,
and it's already in UP_FOR_RETRY (the first scheduler already finished),
it sees the "next" try_number (as it's no longer running),
which then leads it to be FAILED instead.

The easy fix is to simply restrict each scheduler to its own TIs, as
orphaned running TIs will be adopted anyways.
@boring-cyborg boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Jul 7, 2022
@collinmcnulty
Copy link
Contributor

I know we want atomic changes but do you also want to get rid of line 1373 which does nothing?

@jedcunningham
Copy link
Member Author

Turns out it does do something, TaskInstance vs LocalTaskJob. I just completely overlooked it yesterday, and there is test coverage 😉.

@collinmcnulty
Copy link
Contributor

Maybe that deserves a comment then?

@ashb ashb added this to the Airflow 2.3.4 milestone Jul 7, 2022
@jedcunningham
Copy link
Member Author

@collinmcnulty, how's that?

@collinmcnulty
Copy link
Contributor

I meant that it might be good to comment that the line that seemingly does nothing actually helps distinguish between localtaskjob and taskinstance. I know that's a bit off topic from the thrust of this PR so feel free to ignore, but it's something we discussed in troubleshooting so maybe we can save the next person from re-discovering that the line does have a purpose after all.

@jedcunningham jedcunningham added the type:bug-fix Changelog: Bug Fixes label Jul 7, 2022
@jedcunningham
Copy link
Member Author

I don't think we need a comment in that section, frankly I'm not sure it would have helped me. I was just moving too quickly and didn't look closely enough.

@jedcunningham jedcunningham merged commit 1c0d0a5 into apache:main Jul 8, 2022
@jedcunningham jedcunningham deleted the zombie_race branch July 8, 2022 16:49
@dstandish
Copy link
Contributor

sorry @jedcunningham had a pending review that i forgot to finish

@jedcunningham
Copy link
Member Author

Haha no worries, it happens. Were there changes you wanted?

ephraimbuddy pushed a commit to astronomer/airflow that referenced this pull request Jul 11, 2022
Each scheduler was looking at all running tasks for zombies, leading to
multiple schedulers handling the zombies. This causes problems with
retries (e.g. being marked as FAILED instead of UP_FOR_RETRY) and
callbacks (e.g. `on_failure_callback` being called multiple times).

When the second scheduler tries to determine if the task is able to be retried,
and it's already in UP_FOR_RETRY (the first scheduler already finished),
it sees the "next" try_number (as it's no longer running),
which then leads it to be FAILED instead.

The easy fix is to simply restrict each scheduler to its own TIs, as
orphaned running TIs will be adopted anyways.

(cherry picked from commit 1c0d0a5)
ephraimbuddy pushed a commit that referenced this pull request Aug 15, 2022
Each scheduler was looking at all running tasks for zombies, leading to
multiple schedulers handling the zombies. This causes problems with
retries (e.g. being marked as FAILED instead of UP_FOR_RETRY) and
callbacks (e.g. `on_failure_callback` being called multiple times).

When the second scheduler tries to determine if the task is able to be retried,
and it's already in UP_FOR_RETRY (the first scheduler already finished),
it sees the "next" try_number (as it's no longer running),
which then leads it to be FAILED instead.

The easy fix is to simply restrict each scheduler to its own TIs, as
orphaned running TIs will be adopted anyways.

(cherry picked from commit 1c0d0a5)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Scheduler including HA (high availability) scheduler type:bug-fix Changelog: Bug Fixes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants