-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix zombie task handling with multiple schedulers #24906
Conversation
Each scheduler was looking at all running tasks for zombies, leading to multiple schedulers handling the zombies. This causes problems with retries (e.g. being marked as FAILED instead of UP_FOR_RETRY) and callbacks (e.g. `on_failure_callback` being called multiple times). When the second scheduler tries to determine if the task is able to be retried, and it's already in UP_FOR_RETRY (the first scheduler already finished), it sees the "next" try_number (as it's no longer running), which then leads it to be FAILED instead. The easy fix is to simply restrict each scheduler to its own TIs, as orphaned running TIs will be adopted anyways.
I know we want atomic changes but do you also want to get rid of line 1373 which does nothing? |
Turns out it does do something, |
Maybe that deserves a comment then? |
@collinmcnulty, how's that? |
I meant that it might be good to comment that the line that seemingly does nothing actually helps distinguish between localtaskjob and taskinstance. I know that's a bit off topic from the thrust of this PR so feel free to ignore, but it's something we discussed in troubleshooting so maybe we can save the next person from re-discovering that the line does have a purpose after all. |
I don't think we need a comment in that section, frankly I'm not sure it would have helped me. I was just moving too quickly and didn't look closely enough. |
sorry @jedcunningham had a pending review that i forgot to finish |
Haha no worries, it happens. Were there changes you wanted? |
Each scheduler was looking at all running tasks for zombies, leading to multiple schedulers handling the zombies. This causes problems with retries (e.g. being marked as FAILED instead of UP_FOR_RETRY) and callbacks (e.g. `on_failure_callback` being called multiple times). When the second scheduler tries to determine if the task is able to be retried, and it's already in UP_FOR_RETRY (the first scheduler already finished), it sees the "next" try_number (as it's no longer running), which then leads it to be FAILED instead. The easy fix is to simply restrict each scheduler to its own TIs, as orphaned running TIs will be adopted anyways. (cherry picked from commit 1c0d0a5)
Each scheduler was looking at all running tasks for zombies, leading to multiple schedulers handling the zombies. This causes problems with retries (e.g. being marked as FAILED instead of UP_FOR_RETRY) and callbacks (e.g. `on_failure_callback` being called multiple times). When the second scheduler tries to determine if the task is able to be retried, and it's already in UP_FOR_RETRY (the first scheduler already finished), it sees the "next" try_number (as it's no longer running), which then leads it to be FAILED instead. The easy fix is to simply restrict each scheduler to its own TIs, as orphaned running TIs will be adopted anyways. (cherry picked from commit 1c0d0a5)
Each scheduler was looking at all running tasks for zombies, leading to
multiple schedulers handling the zombies. This causes problems with
retries (e.g. being marked as FAILED instead of UP_FOR_RETRY) and
callbacks (e.g.
on_failure_callback
being called multiple times).When the second scheduler tries to determine if the task is able to be retried,
and it's already in UP_FOR_RETRY (the first scheduler already finished),
it sees the "next" try_number (as it's no longer running),
which then leads it to be FAILED instead.
The easy fix is to simply restrict each scheduler to its own TIs, as
orphaned running TIs will be adopted anyways.