-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure in interaction between thrust code on main program and shared object #736
Comments
Attached repro files |
I'm not actually sure if this is supported, but I'll ask around internally. Dynamic linking with CUDA device code can be fickle, from what I remember. IIRC, you may be able to work around this by statically linking all libraries that use CUDA together and then dynamically linking with the result, but it's been a while since I've run into this. |
We build a library into a .so and call it from Python -- I sure hope that's a supported profile. We write the lib's unit tests in C++, and link to the built .so, and that's where we run into this problem. We could work around it I guess by linking statically to the object files. As long as the Python + .so use case is supported, that's OK. |
Thanks for reporting the issue, @hwinkler. This reads similar to issue NVIDIA/cub#545 that has been fixed in PR NVIDIA/cub#547. Do you have a chance to see if the issue has been resolved in the meanwhile? |
@jrhemstad I can still reproduce this bug on b4d490b (branch/2.3.x), maybe this issue should be reopened. Here's the code for reproducing test.zip:
|
Thanks @wrvsrx , @gevtushenko will look into it! |
@wrvsrx the commit you mentioned seems to be from main branch rather than branch/2.3.x If you still can see the issue, please, add |
Sorry, I test the result on main rather than branch/2.3.x
|
@wrvsrx thank you for the info. I'm unable to reproduce the issue using equivalent setup. A few follow up questions:
|
I test this behavior in some other systems, I found this problem only happens when I create environments using nix. When I use nix to create environment on Ubuntu 22.04 or NixOS, I can reproduce the problem. However, if I use nvcc installed globally on ubuntu, this problem disappear. Maybe I should post the problem to cuda team of NixOS. Sorry for taking up your time. For your information, I can still see this issue when after specifying architecture. Setting If you have interest to test this problem under environment under nix, here are the steps:
|
I have a weird reproducible failure that occurs when I have a main program using Thrust, loading a Linux shared object library that also uses Thrust.
The error occurs only when I compile with -O0 or -O1.
Here's the main program, file
test.cu
It calls into a a function,
sortbug()
, that lives in a separate .so file:If I elide the unused function
ifMerelyPresentMainWillFail()
fromtest.cu
, the test program succeeds. Also if I compile everything using-O3
or-O2
, it succeeds. And if I move the code in the shared object to the main program file, it all succeeds.But if I don't do any of that, the program fails, not the same way in all cases.
In some cases, the
assert
intest.cu
fails: thesortbug()
function ran, but did not sort the keys.In other cases, the program throws a
thrust::system::system_error
,transform: failed to synchronize: cudaErrorInvalidConfiguration: invalid configuration argument
Here's the
Makefile
:The text was updated successfully, but these errors were encountered: