Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition that could stall scheduling #712

Merged
merged 1 commit into from
Aug 8, 2024
Merged

Conversation

mwylde
Copy link
Member

@mwylde mwylde commented Aug 8, 2024

During scheduling, the controller sends the task assignments to the workers then waits for the tasks to start up. Each worker engine then constructs its graph and starts of the "local nodes"—i.e., the ones that it is responsible for running.

Each operator on startup follows these steps:

  1. Notify the controller of task startup
  2. Call on_start
  3. Wait for all other operators to have started
  4. Call run

If any of these steps panic, a TaskFailed message is sent to the controller.

However, if an operator panicked in step 2 at the wrong time, the pipeline could end up stuck while the controller thought it was healthy in the running state.

Why?

  • The controller determined that the pipeline was ready when it received all of the task start notifications, but those are sent upfront
  • A failed operator will close its queues, which should cause a cascade of panics that shuts down the pipeline cleanly. However, if it panics during on_start, no other operator will be started up because they are waiting for all operators to wait on the barrier (step 3) so they will never try to produce messages to the closed queue
  • The failed task will send a TaskFailed message to the controller, but because this is a "RunningMessage," it's ignored in scheduling

For the problem to occur, all three three issues are required.

This PR fixes the first and third issue, and ensures that a pipeline will either get into a true running state or fail and get restarted by the controller:

  • We now send TaskStarted notifications after on_start, once the operators are actually running; this means that we won't transition to running until the operators are actually running
  • The scheduling state now handles TaskFailed notifications and will restart scheduling on receiving one

Fixing the second issue—for example by allowing the barrier to be canceled on panic—is left as a future improvement.

@mwylde mwylde merged commit c072ef8 into master Aug 8, 2024
6 checks passed
@mwylde mwylde deleted the scheduling_race branch August 8, 2024 23:23
mwylde added a commit that referenced this pull request Aug 8, 2024
mwylde added a commit that referenced this pull request Aug 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant