Fix race condition that could stall scheduling #712
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
During scheduling, the controller sends the task assignments to the workers then waits for the tasks to start up. Each worker engine then constructs its graph and starts of the "local nodes"—i.e., the ones that it is responsible for running.
Each operator on startup follows these steps:
on_start
run
If any of these steps panic, a TaskFailed message is sent to the controller.
However, if an operator panicked in step 2 at the wrong time, the pipeline could end up stuck while the controller thought it was healthy in the running state.
Why?
For the problem to occur, all three three issues are required.
This PR fixes the first and third issue, and ensures that a pipeline will either get into a true running state or fail and get restarted by the controller:
on_start
, once the operators are actually running; this means that we won't transition to running until the operators are actually runningFixing the second issue—for example by allowing the barrier to be canceled on panic—is left as a future improvement.