Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Segmentation fault in interruptible.hpp #1225

Closed
cjnolet opened this issue Feb 1, 2023 · 4 comments · Fixed by #1229
Closed

[BUG] Segmentation fault in interruptible.hpp #1225

cjnolet opened this issue Feb 1, 2023 · 4 comments · Fixed by #1229
Labels
bug Something isn't working

Comments

@cjnolet
Copy link
Member

cjnolet commented Feb 1, 2023

cugraph has been encountering a segfault in interruptible.hpp. It seems to only appear when taking the Python path through their code, for some reason, but it's very consistent. The reproducer for this issue (assumes #1224 has been reverted) is:

import cugraph
import cudf
if __name__ == "__main__":

    edgelist = cudf.DataFrame( { "src": [0, 1, 2], "dst": [1, 2, 4], "wgt": [0.0, 0.1, 0.2], })
    G = cugraph.Graph()

    G.from_cudf_edgelist(edgelist, source="src", destination="dst", edge_attr="wgt", store_transposed=True)

    r = cugraph.pagerank(G)

And it's important to run this in a loop because it doesn't happen everytime:

while true; do python test_failure.py; if [$? -ne 0]; then break; fi; done

The following shows the error happening on line 214 here, which appears it might be related to an object being used in a callback after it's been cleaned up (something like registry_.find(xxx) after registry_ has already been deallocated, maybe?)

image

cc @achirkin @tfeher

I'm not sure why this doesn't seem to be happening in pylibraft, cuml, or cuopt but it's definitely consistently reproducible in cugraph.

@ChuckHastings @alexbarghi-nv @rlratzel @seunghwak @BradReesWork FYI

@cjnolet cjnolet added the bug Something isn't working label Feb 1, 2023
@achirkin
Copy link
Contributor

achirkin commented Feb 2, 2023

Does cugraph run something in multiple processes there? The registry_ is a static, global per-process variable, and the error seems to be happening at __run_exit_handlers, which happens at the program exit.
A wild guess: could it be related to the order of destruction between the registry_ and the mutex_? What if the swap the two lines in here?

/** Global registry of thread-local cancellation stores. */
static inline std::unordered_map<std::thread::id, std::weak_ptr<interruptible>> registry_;
/** Protect the access to the registry. */
static inline std::mutex mutex_;

(update: no, this didn't work)

@cjnolet
Copy link
Member Author

cjnolet commented Feb 2, 2023

@achirkin, I'm wondering if we could use somehting at the global level or non-member thread-local level like a std::atomic<bool> flag that could be flipped to true when the deleter is called on the registry_? Can you think of any other way we might be able to check for this state so we can handle it gracefully?

cc @jrhemstad for thoughts as well- it looks like we have a race condition happening in the callback for interruptible where it's trying to use registry_ after it's already been deleted. The order of deconstruction for static members seems to be inconsistent from behavior that @seunghwak has been seeing.

@seunghwak
Copy link
Contributor

seunghwak commented Feb 2, 2023

As I mentioned in the slack thread,

This is due to the arbitrary destruction order between s and registry_.
https://github.com/rapidsai/raft/blob/branch-23.02/cpp/include/raft/core/interruptible.hpp#L134
https://github.com/rapidsai/raft/blob/branch-23.02/cpp/include/raft/core/interruptible.hpp#L182

It seems like there is no guaranteed destruction order between a thread local static variable defined inside a member function and static class member variables.

If s is destroyed after registry_ is destroyed, registry_ will be accessed after destruction.

I confirmed this is indeed happening by defining

class MyClass {
 public:
  MyClass() { std::cout << "MyClass() called" << std::endl; }
  ~MyClass() { std::cout << "~MyClass() called." << std::endl; }
};

and adding static inline MyClass my_class_; after registry_ and mutex_ (https://github.com/rapidsai/raft/blob/branch-23.02/cpp/include/raft/core/interruptible.hpp#L184).

Also added the printout statement in the custom deleter (https://github.com/rapidsai/raft/blob/branch-23.02/cpp/include/raft/core/interruptible.hpp#L213); this custom deleter will be called when s is destroyed.

In testing, the print statement from ~MyClass sometimes appears before the print statement from the custom destructor, meaning the custom destructor can access registry_ after destruction.

@achirkin
Copy link
Contributor

achirkin commented Feb 2, 2023

Here's a rather ugly, but seemingly working, fix #1229 - wrap the registry_ in a shared pointer and use weak_ptr.lock() to avoid accessing it after it's been deleted.

rapids-bot bot pushed a commit that referenced this issue Feb 15, 2023
Because there's no way to control the order of destruction between the global and thread-local static objects, the token registry may sometimes be accessed after it has already been destructed (in the program exit handlers).

This fix wraps the registry in a shared pointer and keeps the weak pointers in the deleters which cause the problem, thus it avoids accessing the registry after it's been destroyed.

Closes #1225
Closes #1275

Authors:
  - Artem M. Chirkin (https://github.com/achirkin)
  - Corey J. Nolet (https://github.com/cjnolet)
  - Allard Hendriksen (https://github.com/ahendriksen)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #1229
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Development

Successfully merging a pull request may close this issue.

3 participants