Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use notify_one() in CallbackNotifier #140

Merged

Conversation

pentschev
Copy link
Member

It is unclear why but for some reason notify_all() is causing futexes never to return in some situations. This occurs very frequently in CI and is also less frequently reproducible locally.

The typical stack trace for the blocked thread is shown below:

Thread 6 (Thread 0x7f13ec84f700 (LWP 2823667) "pytest"):
#0  futex_wait (private=<optimized out>, expected=32765, futex_word=0x7ffd5186a874) at ../sysdeps/nptl/futex-internal.h:141
#1  futex_wait_simple (private=<optimized out>, expected=32765, futex_word=0x7ffd5186a874) at ../sysdeps/nptl/futex-internal.h:172
#2  __condvar_quiesce_and_switch_g1 (private=<optimized out>, g1index=<synthetic pointer>, wseq=<optimized out>, cond=0x7ffd5186a860) at pthread_cond_common.c:416
#3  __pthread_cond_broadcast (cond=0x7ffd5186a860) at pthread_cond_broadcast.c:73
#4  0x00007f140fe5f23c in ucxx::BaseDelayedSubmissionCollection<std::function<void ()> >::process() (this=0x560d0effafd0) at /repo/cpp/include/ucxx/delayed_submission.h:154
#5  0x00007f140fe5f399 in ucxx::DelayedSubmissionCollection::processPost (this=<optimized out>) at /repo/cpp/src/delayed_submission.cpp:84
#6  0x00007f140fe7ed71 in ucxx::WorkerProgressThread::progressUntilSync(std::function<bool ()>, bool const&, std::function<void (void*)>, void*, std::shared_ptr<ucxx::DelayedSubmissionCollection>) (progressFunction=..., stop=@0x560d0f6527f8: false, startCallback=..., startCallbackArg=<optimized out>, delayedSubmissionCollection=...) at /opt/conda/envs/test/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/shared_ptr_base.h:1295
#7  0x00007f140fe7f3ee in std::__invoke_impl<void, void (*)(std::function<bool ()>, bool const&, std::function<void (void*)>, void*, std::shared_ptr<ucxx::DelayedSubmissionCollection>), std::function<bool ()>, std::reference_wrapper<bool>, std::function<void (void*)>, void*, std::shared_ptr<ucxx::DelayedSubmissionCollection> >(std::__invoke_other, void (*&&)(std::function<bool ()>, bool const&, std::function<void (void*)>, void*, std::shared_ptr<ucxx::DelayedSubmissionCollection>), std::function<bool ()>&&, std::reference_wrapper<bool>&&, std::function<void (void*)>&&, void*&&, std::shared_ptr<ucxx::DelayedSubmissionCollection>&&) (__f=<optimized out>, __f=<optimized out>) at /opt/conda/envs/test/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/invoke.h:61
#8  std::__invoke<void (*)(std::function<bool ()>, bool const&, std::function<void (void*)>, void*, std::shared_ptr<ucxx::DelayedSubmissionCollection>), std::function<bool ()>, std::reference_wrapper<bool>, std::function<void (void*)>, void*, std::shared_ptr<ucxx::DelayedSubmissionCollection> >(void (*&&)(std::function<bool ()>, bool const&, std::function<void (void*)>, void*, std::shared_ptr<ucxx::DelayedSubmissionCollection>), std::function<bool ()>&&, std::reference_wrapper<bool>&&, std::function<void (void*)>&&, void*&&, std::shared_ptr<ucxx::DelayedSubmissionCollection>&&) (__fn=<optimized out>) at /opt/conda/envs/test/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/invoke.h:96
#9  std::thread::_Invoker<std::tuple<void (*)(std::function<bool ()>, bool const&, std::function<void (void*)>, void*, std::shared_ptr<ucxx::DelayedSubmissionCollection>), std::function<bool ()>, std::reference_wrapper<bool>, std::function<void (void*)>, void*, std::shared_ptr<ucxx::DelayedSubmissionCollection> > >::_M_invoke<0ul, 1ul, 2ul, 3ul, 4ul, 5ul>(std::_Index_tuple<0ul, 1ul, 2ul, 3ul, 4ul, 5ul>) (this=<optimized out>) at /opt/conda/envs/test/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/std_thread.h:259
#10 std::thread::_Invoker<std::tuple<void (*)(std::function<bool ()>, bool const&, std::function<void (void*)>, void*, std::shared_ptr<ucxx::DelayedSubmissionCollection>), std::function<bool ()>, std::reference_wrapper<bool>, std::function<void (void*)>, void*, std::shared_ptr<ucxx::DelayedSubmissionCollection> > >::operator()() (this=<optimized out>) at /opt/conda/envs/test/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/std_thread.h:266
#11 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)(std::function<bool ()>, bool const&, std::function<void (void*)>, void*, std::shared_ptr<ucxx::DelayedSubmissionCollection>), std::function<bool ()>, std::reference_wrapper<bool>, std::function<void (void*)>, void*, std::shared_ptr<ucxx::DelayedSubmissionCollection> > > >::_M_run() (this=<optimized out>) at /opt/conda/envs/test/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/std_thread.h:211
#12 0x00007f140f92fe95 in std::execute_native_thread_routine (__p=<optimized out>) at ../../../../../libstdc++-v3/src/c++11/thread.cc:104
#13 0x00007f1412647609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#14 0x00007f1412412133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

It is unclear why but for some reason `notify_all()` is causing futexes
never to return in some situations. This occurs very frequently in CI
and is also less frequently reproducible locally.
@pentschev pentschev requested a review from a team as a code owner December 4, 2023 21:18
@pentschev
Copy link
Member Author

I need to confirm that this is the only change that really (seems to have) fixed the bug, I have some other local changes I've been testing against for 8 hours and it didn't deadlock once, where before some of the runs would start deadlocking in <1 hour. This has been the latest change that actually resolved the problem, I just want to confirm the other changes have no part in it.

@pentschev
Copy link
Member Author

@wence- can you think of any reason notify_all() would be causing issues here? I don't think the way we currently use CallbackNotifier should make any difference on utilizing notify_one vs notify_all but I maybe ignoring some detail.

@wence-
Copy link
Contributor

wence- commented Dec 5, 2023

I think switching to notify_one is fine. I think originally we had that, and then I suggested moving to notify_all in case we ever used it in single producer, multiple consumer mode. But we're not doing that.

@pentschev
Copy link
Member Author

I think switching to notify_one is fine. I think originally we had that, and then I suggested moving to notify_all in case we ever used it in single producer, multiple consumer mode. But we're not doing that.

Yes, I agree. I'm wondering though if you have any ideas why notify_all() could be problematic the way we're using it, AFAIU they should be equivalent for a single consumer mode, so the fact we get stuck at __pthread_cond_broadcast seems very odd.

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we adjust the docstrings in callback_notifier.h to this change. They currently say things about notify_all and "waiting threads", whereas now (which is fine), this becomes a mechanism for communicating between two threads.

@pentschev
Copy link
Member Author

Can we adjust the docstrings in callback_notifier.h to this change. They currently say things about notify_all and "waiting threads", whereas now (which is fine), this becomes a mechanism for communicating between two threads.

Done in 54414d1, would you mind taking another look?

@wence-
Copy link
Contributor

wence- commented Dec 5, 2023

Yes, I agree. I'm wondering though if you have any ideas why notify_all() could be problematic the way we're using it, AFAIU they should be equivalent for a single consumer mode, so the fact we get stuck at __pthread_cond_broadcast seems very odd.

My vague understanding is that pthread_cond_broadcast is one of the tricksiest parts of the pthread signalling code to get right, so 🤷

@pentschev
Copy link
Member Author

Thanks @wence- for reviewing!

@pentschev
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit 197a0a9 into rapidsai:branch-0.36 Dec 5, 2023
20 checks passed
@pentschev pentschev deleted the callback-notifier-notify_one branch January 4, 2024 12:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants