Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the destruction of interruptible token registry #1229

Merged

Conversation

achirkin
Copy link
Contributor

@achirkin achirkin commented Feb 2, 2023

Because there's no way to control the order of destruction between the global and thread-local static objects, the token registry may sometimes be accessed after it has already been destructed (in the program exit handlers).

This fix wraps the registry in a shared pointer and keeps the weak pointers in the deleters which cause the problem, thus it avoids accessing the registry after it's been destroyed.

Closes #1225
Closes #1275

@achirkin achirkin requested a review from a team as a code owner February 2, 2023 18:47
@github-actions github-actions bot added the cpp label Feb 2, 2023
@@ -203,21 +205,25 @@ class interruptible {
{
std::lock_guard<std::mutex> guard_get(mutex_);
// the following constructs an empty shared_ptr if the key does not exist.
auto& weak_store = registry_[thread_id];
auto& weak_store = (*registry_)[thread_id];
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NB: since we know the registry can only be deleted on program exit, accessing it here without checks shouldn't cause any problems.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this fix, @achirkin! Does the reproducer in #1225 also no longer segfault with this change?

Copy link
Contributor Author

@achirkin achirkin Feb 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least on my machine, yes! Has been running for about ten minutes by now.

@cjnolet
Copy link
Member

cjnolet commented Feb 3, 2023

Can you think of any way we might be able to add a test for this just to make sure users don't encounter this issue in the future? I've not tried running our own tests in a loop but I've also not seen an indication that raft or cuml have suffered from this issue either. Just makes me wonder why, and if there's something we can do to force (or at least observe) the behavior to test it.

@achirkin
Copy link
Contributor Author

achirkin commented Feb 3, 2023

Good point. Perhaps, we can try to run a simple program with interruptible tokens in a subprocess, many times in a loop? And check the exit codes.

@cjnolet cjnolet changed the base branch from branch-23.02 to branch-23.04 February 3, 2023 19:51
@cjnolet
Copy link
Member

cjnolet commented Feb 3, 2023

@achirkin, I think we can go ahead and merge this in the meantime so we can re-establish the feature and we can create an issue to revist with a test in the future

@codecov-commenter
Copy link

codecov-commenter commented Feb 4, 2023

Codecov Report

❗ No coverage uploaded for pull request base (branch-23.04@88cb31d). Click here to learn what that means.
Patch has no changes to coverable lines.

Additional details and impacted files
@@               Coverage Diff               @@
##             branch-23.04    #1229   +/-   ##
===============================================
  Coverage                ?   87.99%           
===============================================
  Files                   ?       21           
  Lines                   ?      483           
  Branches                ?        0           
===============================================
  Hits                    ?      425           
  Misses                  ?       58           
  Partials                ?        0           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@cjnolet
Copy link
Member

cjnolet commented Feb 4, 2023

@achirkin I've rerun the dask wheel tests a few times now and it looks like there's a repeatable error in the raft-dask wheel tests, which I think could indicate that it's somehow related to these changes. From the logs it almost seems like a timeout is occurring.

@cjnolet
Copy link
Member

cjnolet commented Feb 8, 2023

@achirkin I still notice this in the raft-dask test logs:

[72cb0896ebde:1701 :0:1718] Caught signal 7 (Bus error: invalid address alignment)
==== backtrace (tid:   1718) ====
 0  /pyenv/versions/3.8.16/lib/python3.8/site-packages/raft_dask/common/../../raft_dask_cu11.libs/libucs-786cbefd.so.0.0.0(ucs_handle_error+0x2d4) [0xffff73c48d9c]
 1  /pyenv/versions/3.8.16/lib/python3.8/site-packages/raft_dask/common/../../raft_dask_cu11.libs/libucs-786cbefd.so.0.0.0(+0x29f2c) [0xffff73c48f2c]
 2  /pyenv/versions/3.8.16/lib/python3.8/site-packages/raft_dask/common/../../raft_dask_cu11.libs/libucs-786cbefd.so.0.0.0(+0x2a360) [0xffff73c49360]
 3  linux-vdso.so.1(__kernel_rt_sigreturn+0) [0xffff93ebf598]
 4  /pyenv/versions/3.8.16/lib/python3.8/site-packages/raft_dask/common/comms_utils.cpython-38-aarch64-linux-gnu.so(_ZN4raft13interruptible14get_token_implILb1EEESt10shared_ptrIS0_ENSt6thread2idE+0x58) [0xffff843671e8]
 5  /pyenv/versions/3.8.16/lib/python3.8/site-packages/raft_dask/common/comms_utils.cpython-38-aarch64-linux-gnu.so(_ZN4raft5comms6detail25test_collective_allgatherERKNS_16device_resourcesEi+0x1f4) [0xffff84345534]
 6  /pyenv/versions/3.8.16/lib/python3.8/site-packages/raft_dask/common/comms_utils.cpython-38-aarch64-linux-gnu.so(+0x48f98) [0xffff84345f98]
 7  /pyenv/versions/3.8.16/lib/libpython3.8.so.1.0(_PyObject_MakeTpCall+0xa8) [0xffff93b939a8]
 8  /pyenv/versions/3.8.16/lib/libpython3.8.so.1.0(+0x6c904) [0xffff93b68904]
 9  /pyenv/versions/3.8.16/lib/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x17c8) [0xffff93b6a450]
10  /pyenv/versions/3.8.16/lib/libpython3.8.so.1.0(+0x6b804) [0xffff93b67804]
11  /pyenv/versions/3.8.16/lib/libpython3.8.so.1.0(PyVectorcall_Call+0x70) [0xffff93b95540]
12  /pyenv/versions/3.8.16/lib/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x1e38) [0xffff93b6aac0]
13  /pyenv/versions/3.8.16/lib/libpython3.8.so.1.0(+0x6b804) [0xffff93b67804]
14  /pyenv/versions/3.8.16/lib/libpython3.8.so.1.0(+0x6c890) [0xffff93b68890]
15  /pyenv/versions/3.8.16/lib/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x17c8) [0xffff93b6a450]
16  /pyenv/versions/3.8.16/lib/libpython3.8.so.1.0(+0x6b804) [0xffff93b67804]
17  /pyenv/versions/3.8.16/lib/libpython3.8.so.1.0(PyVectorcall_Call+0x70) [0xffff93b95540]
18  /pyenv/versions/3.8.16/lib/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x1e38) [0xffff93b6aac0]
19  /pyenv/versions/3.8.16/lib/libpython3.8.so.1.0(+0x6b804) [0xffff93b67804]
20  /pyenv/versions/3.8.16/lib/libpython3.8.so.1.0(+0x6c890) [0xffff93b68890]
21  /pyenv/versions/3.8.16/lib/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x14bc) [0xffff93b6a144]
22  /pyenv/versions/3.8.16/lib/libpython3.8.so.1.0(+0x6b804) [0xffff93b67804]
23  /pyenv/versions/3.8.16/lib/libpython3.8.so.1.0(PyVectorcall_Call+0x70) [0xffff93b95540]
24  /pyenv/versions/3.8.16/lib/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x1e38) [0xffff93b6aac0]
25  /pyenv/versions/3.8.16/lib/libpython3.8.so.1.0(+0x6b804) [0xffff93b67804]
26  /pyenv/versions/3.8.16/lib/libpython3.8.so.1.0(+0x6c890) [0xffff93b68890]
27  /pyenv/versions/3.8.16/lib/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x14bc) [0xffff93b6a144]
28  /pyenv/versions/3.8.16/lib/libpython3.8.so.1.0(+0x6b804) [0xffff93b67804]
29  /pyenv/versions/3.8.16/lib/libpython3.8.so.1.0(+0x6c890) [0xffff93b68890]
30  /pyenv/versions/3.8.16/lib/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x14bc) [0xffff93b6a144]
31  /pyenv/versions/3.8.16/lib/libpython3.8.so.1.0(+0x6b804) [0xffff93b67804]
32  /pyenv/versions/3.8.16/lib/libpython3.8.so.1.0(+0x9b1f4) [0xffff93b971f4]
33  /pyenv/versions/3.8.16/lib/libpython3.8.so.1.0(PyVectorcall_Call+0x70) [0xffff93b95540]
34  /pyenv/versions/3.8.16/lib/libpython3.8.so.1.0(+0x2202dc) [0xffff93d1c2dc]
35  /pyenv/versions/3.8.16/lib/libpython3.8.so.1.0(+0x1c7004) [0xffff93cc3004]
36  /usr/lib/aarch64-linux-gnu/libpthread.so.0(+0x7624) [0xffff9395f624]
37  /usr/lib/aarch64-linux-gnu/libc.so.6(+0xd149c) [0xffff93a5a49c]
=================================
2023-02-08 22:53:13,880 - distributed.nanny - WARNING - Restarting worker
Initializing comms!
Initialization complete.
Destroying comms.
Initializing comms!
Initialization complete.
2023-02-08 22:53:15,131 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2023-02-08 22:53:15,131 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
Traceback (most recent call last):
  File "./ci/wheel_smoke_test_raft_dask.py", line 84, in <module>
    wait(dfs, timeout=5)
  File "/pyenv/versions/3.8.16/lib/python3.8/site-packages/distributed/client.py", line 4916, in wait
    result = client.sync(_wait, fs, timeout=timeout, return_when=return_when)
  File "/pyenv/versions/3.8.16/lib/python3.8/site-packages/distributed/utils.py", line 338, in sync
    return sync(
  File "/pyenv/versions/3.8.16/lib/python3.8/site-packages/distributed/utils.py", line 405, in sync
    raise exc.with_traceback(tb)
  File "/pyenv/versions/3.8.16/lib/python3.8/site-packages/distributed/utils.py", line [378](https://github.com/rapidsai/raft/actions/runs/4128816378/jobs/7134159569#step:9:379), in f
    result = yield future
  File "/pyenv/versions/3.8.16/lib/python3.8/site-packages/tornado/gen.py", line 769, in run
    value = future.result()
  File "/pyenv/versions/3.8.16/lib/python3.8/site-packages/distributed/client.py", line 4884, in _wait
    await future
  File "/pyenv/versions/3.8.16/lib/python3.8/asyncio/tasks.py", line 501, in wait_for
    raise exceptions.TimeoutError()
asyncio.exceptions.TimeoutError

More specifically, I see this in the stack trace: _ZN4raft13interruptible14get_token_implILb1EEESt10shared_ptrIS0_ENSt6thread2idE which replicates what cugraph was seeing. Can you verify if you are seeing this error locally when compiling and running the raft-dask tests? Even more strange is this error[72cb0896ebde:1701 :0:1718] Caught signal 7 (Bus error: invalid address alignment).

ahendriksen and others added 2 commits February 9, 2023 07:14
As explained in rapidsai#1246 (comment), ptxas chokes on the minkowski distance when `VecLen==4` and `IdxT==uint32_t`.

This PR removes the veclen == 4 specialization for the minkowski distance.

Follow up to: rapidsai#1239

Authors:
  - Allard Hendriksen (https://github.com/ahendriksen)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)
  - Sean Frye (https://github.com/sean-frye)

URL: rapidsai#1254
@achirkin achirkin added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Feb 13, 2023
@achirkin
Copy link
Contributor Author

Marked do-not-merge just to be on the safe side: I haven't been able to confirm it doesn't crash on arm yet.

@cjnolet
Copy link
Member

cjnolet commented Feb 13, 2023

@achirkin it looks like we have another MRE when used w/ OpenMP: #1275. This may have already been fixed but probably worth checking the test in so it's repeatable.

@achirkin
Copy link
Contributor Author

achirkin commented Feb 14, 2023

@achirkin it looks like we have another MRE when used w/ OpenMP: #1275. This may have already been fixed but probably worth checking the test in so it's repeatable.

Yes, thanks, that looks like the same issue. I managed to reproduce it, and this PR indeed fixes the issue. Yet, there is one thing that concerns me: what if this this crash is caused by something else, but just exposed here? Like the issue #740 that was solved by #764 (not very probable though, because the issue in #1275 is really minimalistic in terms of dependencies).

@achirkin achirkin removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Feb 15, 2023
@achirkin
Copy link
Contributor Author

I've managed to run the the tests including the previously failing wheel_smoke_test_raft_dask.py (https://github.com/rapidsai/raft/actions/runs/4128816378/jobs/7134159569) on an aarch64 machine. Unfortunately, couldn't reproduce the crash even on the version that failed on CI. On the bright side, the case from #1275 is definitely fixed by the PR.

I'm removing the do-not-merge label, since I couldn't find any way to provoke the segfault in the current state of the PR.

@cjnolet
Copy link
Member

cjnolet commented Feb 15, 2023

@achirkin Im okay with giving these changes a go. Many of the other downstream projects have been notified that we are going to merge this fix. Worst case is we find another bug and fix it, but the passing tests (and passing MREs) have given me confidence in these changes.

@cjnolet
Copy link
Member

cjnolet commented Feb 15, 2023

/merge

@rapids-bot rapids-bot bot merged commit 27ca9b9 into rapidsai:branch-23.04 Feb 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review bug Something isn't working cpp non-breaking Non-breaking change
Projects
Development

Successfully merging this pull request may close these issues.

[BUG] Segmentation fault in Interruptible OpenMP [BUG] Segmentation fault in interruptible.hpp
4 participants