Fix race condition in scheduler reset #179
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In this PR I fix a race condition when resetting executors that made the flushing tests in faasm segfault sporadically.
The race condition happened between:
Scheduler::reset()
thenExecutor::finish
and, for each thread inthreadPoolThreads
: (1) check not null, (2) enqueue shutdown task, (3) join.selfShutdown = true
and assigning itself tonull
; all of this before (3).The consequence was that we tried to join a
nullptr
in (3) and segfault.The proposed solution changes the order in the
Executor::finish
loop to: (1) enqueue shutdown task, (2) check if thread isnull
, (3) if not join the thread. This way, when we start killing the thread pool, the thread either: (1) is still blocked dequeuing, thus adding thePOOL_SHUTDOWN
task will have the desired effect when it wakes up, or (2) has already timed-out, and will self-destroy itself, and will be pointing tonull
when we check for it.I think the chances that the thread has timed out when we enqueue but it is not
null
when we check and isnull
when we join are very remote as there's only one instruction between timing out and setting oneself to null, and I haven't been able to re-create it.Note that the tasks queues are cleared at the end of
Executor::finish
so it is not really a problem having non-empty queues withPOOL_SHUTDOWN
tasks.I also include a test that makes the current
master
branch crash with less than 20 tries (less than 10 in fact the 20 times I tried locally), but passes now.