Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition in scheduler reset #179

Merged
merged 2 commits into from
Nov 24, 2021
Merged

Fix race condition in scheduler reset #179

merged 2 commits into from
Nov 24, 2021

Conversation

csegarragonz
Copy link
Collaborator

@csegarragonz csegarragonz commented Nov 23, 2021

In this PR I fix a race condition when resetting executors that made the flushing tests in faasm segfault sporadically.

The race condition happened between:

  • The main thread calling Scheduler::reset() then Executor::finish and, for each thread in threadPoolThreads: (1) check not null, (2) enqueue shutdown task, (3) join.
  • The thread pool thread that: (i) times out dequeueing between (1) and (2), (ii) breaks out the main executor loop setting selfShutdown = true and assigning itself to null; all of this before (3).

The consequence was that we tried to join a nullptr in (3) and segfault.

The proposed solution changes the order in the Executor::finish loop to: (1) enqueue shutdown task, (2) check if thread is null, (3) if not join the thread. This way, when we start killing the thread pool, the thread either: (1) is still blocked dequeuing, thus adding the POOL_SHUTDOWN task will have the desired effect when it wakes up, or (2) has already timed-out, and will self-destroy itself, and will be pointing to null when we check for it.

I think the chances that the thread has timed out when we enqueue but it is not null when we check and is null when we join are very remote as there's only one instruction between timing out and setting oneself to null, and I haven't been able to re-create it.

Note that the tasks queues are cleared at the end of Executor::finish so it is not really a problem having non-empty queues with POOL_SHUTDOWN tasks.

I also include a test that makes the current master branch crash with less than 20 tries (less than 10 in fact the 20 times I tried locally), but passes now.

@csegarragonz csegarragonz self-assigned this Nov 23, 2021
@csegarragonz csegarragonz added the bug Something isn't working label Nov 23, 2021
@csegarragonz csegarragonz requested review from Shillaker and removed request for Shillaker November 23, 2021 16:48

// We sleep for the same timeout threads have, to force a race condition
// between the scheduler's flush and the thread's own cleanup timeout
SLEEP_MS(conf.boundTimeout);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could end up being quite a long sleep depending on the default config, and it's multiplied by 20, so it would be good to set this to something very short at the start of this test (e.g. 100ms).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, thanks

tests/test/scheduler/test_executor.cpp Outdated Show resolved Hide resolved
tests/test/scheduler/test_executor.cpp Outdated Show resolved Hide resolved
@csegarragonz csegarragonz merged commit 4930c61 into master Nov 24, 2021
@csegarragonz csegarragonz deleted the fix-race branch November 24, 2021 14:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants