-
Notifications
You must be signed in to change notification settings - Fork 920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test that RAPIDS_NO_INITIALIZE means no cuInit #12361
Test that RAPIDS_NO_INITIALIZE means no cuInit #12361
Conversation
When RAPIDS_NO_INITIALIZE is set, importing cudf is not allowed to create a CUDA context. This is quite delicate since calls arbitrarily far down the import stack _might_ create one. To spot such problems, build a small shared library that interposes our own version of cuInit, and run a test importing cudf in a subprocess with that library LD_PRELOADed. If everything is kosher, we should not observe any calls to cuInit. If one observes bad behaviour, the culprit can then be manually tracked down in a debugger by breaking on our cuInit implementation.
We could also consider using
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some signposts
target_link_libraries(cudfcuinit_intercept PRIVATE conda_env) | ||
endif() | ||
target_link_libraries(cudfcuinit_intercept PUBLIC CUDA::cudart cuda dl) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need some help here, I'm completely flying blind, and this is wrong AFAICT.
Basically, I have a single file that I want to compile into a shared library and link against libdl
and libcuda
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need to link to libdl? I think linking to libc is sufficient for your purposes (dlfcn). The linking to CUDA seems reasonable here (although if you care specifically about whether it's dynamically or statically linked you will want to set the CUDA_RUNTIME_LIBRARY property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is only glibc 2.34 and later where you don't need to link libdl to get access to dlsym and friends (see https://sourceware.org/pipermail/libc-alpha/2021-August/129718.html and bminor/glibc@77f876c) unless I am misunderstanding something.
In any case, would be very happy for someone who knows what they are doing to help rewrite this part of the patch completely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually don't need cudart at all, only -lcuda
, which I think I should get with CUDA::cuda_driver
?
} | ||
} | ||
} // namespace | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This same stuff could easily be extended to address @jrhemstad's request in #11546 that one test that RMM is the only allocator of memory.
original_dlsym = (dlsym_t)dlvsym(RTLD_NEXT, "dlsym", "GLIBC_2.2.5"); | ||
if (original_dlsym) { | ||
original_cuGetProcAddress = (proc_t)original_dlsym(RTLD_NEXT, "cuGetProcAddress"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For driver calls there are two ways python libraries resolve them:
- [numba does this]
dlopen
libcuda.so
and thendlsym
on the handle - [cuda-python does this]
dlopen
libcuda
,dlsym
cuGetProcAddress
and then callcuGetProcAddress
to get the driver symbol
So unfortunately, it's not sufficient to just define cuInit
in this shared library and override the symbol resolution via LD_PRELOAD
. We have to instead patch into dlsym
and cuGetProcAddress
. The latter is easy, the former is hard (we can't just dlsym(RTLD_NEXT, ...)
here because that would call the local function. Instead, we use GLIBC's versioned lookup dlvsym
, but now we need to match the glibc version exactly in the running environment (this is the one my conda environment has).
I guess I could spin over a bunch of versions until I find the right one.
Any other suggestions gratefully received.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but now we need to match the glibc version exactly in the running environment (this is the one my conda environment has).
Versioning is not quite as bad as this, 2.2.5 is a magic number but will be stable forever (due to glibc's forward-compat guarantee).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you patch into numba
and cuda-python
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you patch into numba and cuda-python instead?
I can patch into numba, because the cuda interface is implemented in python, but can't do that for either cuda-python (or cupy) because their cuda interface is implemented in cython (so compiled) and hence monkey-patching won't work.
I also want to avoid a situation where some further third-party dependency is pulled in that also brings up a cuda context (perhaps directly via the C API). Since eventually everyone actually calls into the driver API, this seems like the best place to hook in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some Linux options that I think are more reliable and do not require patching.
How about something like this: https://stackoverflow.com/questions/5103443/how-to-check-what-shared-libraries-are-loaded-at-run-time-for-a-given-process ?
I could try to work up a script based on this if you'd like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will that tell me if cuInit is called? I think no
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can run the process and inspect with nvml and try and match that way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This SO post seems to settle on basically the same thing that you do (funnily enough, there's another post about how Citrix copy-pasted this solution disregarding the issues and broke some users).
Due to the extensive dlopening/dlsyming happening, I am not sure that either strace or ltrace or anything like them will be sufficient to detect the calls, which would have been the easier route here as David suggests. If all functions were called by name then I think ltrace would have been sufficient, but as it is you'll only see the dlopen of libcuda.so and then the dlsym of some arbitrary memory address. You could hope that the dlsym calls always use a name for the handle that includes cuInit; I think that would show up? It would probably only catch a subset of cases though.
&ptr, | ||
CUDA_VERSION, | ||
CU_GET_PROC_ADDRESS_DEFAULT | ||
#if CUDA_VERSION >= 12000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ABI change.
location = Path(__file__) | ||
cpp_build_dir = location / ".." / ".." / ".." / ".." / ".." / "cpp" / "build" | ||
libintercept = (cpp_build_dir / "libcudfcuinit_intercept.so").resolve() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the right way to reference this? Right now I'm assuming the build directory exists (because I didn't manage to wrangle cmake to install the library). Equally, however, I'm not sure we really want to install this library?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh boy, this is fun. I don't think there is a perfect solution here. FWIW my approach to this in #11875 was to move building the preload lib out of the main libcudf build, build it separately as part of CI, and then just launch tests with the preload library directly from the CLI in CI. That functionality was disabled as part of the Jenkins->GHA migration. Given that you're working on this, it may be time to investigate how to reenable that functionality within GHA.
@robertmaynard do you think that preload libraries like this or the stream verification lib should be built within the main CMakeLists.txt for the library, or shipped along with the conda packages? I had avoided that mostly because in the end we need the paths to the library anyway in order to preload, so it's not a great fit, but I know others had expressed different opinions. Depending on what direction we take with that we will need to adapt the solution in this pytest for how the library is discovered I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presume that the stream verification lib is also a single library. My first thought had been to just compile to the .so
as part of the test, referencing the source directory. But then I realised that I need someone to provide information about the compiler configuration and so forth.
bd83d28
to
e086ab3
Compare
Codecov ReportBase: 86.58% // Head: 86.58% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## branch-23.02 #12361 +/- ##
==============================================
Coverage 86.58% 86.58%
==============================================
Files 155 155
Lines 24368 24507 +139
==============================================
+ Hits 21098 21219 +121
- Misses 3270 3288 +18
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
|
||
void* dlsym(void* handle, const char* name_) | ||
{ | ||
std::string name{name_}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: error handling in all these wrappers in case the resolution of the original functions failed (at which point we can only abort)
Given the level of complexity we'd be introducing here (dlsyming dlsym itself seems like a massive minefield) I wonder if there might not be an easier approach altogether. We were at one point making a push to ensure that |
How about using nvprof as suggested above?
I think we should consider the two separate efforts. Yes, enabling |
I'm fine with the nvprof solution too. That seems like the simplest and most direct approach for this particular problem. My mentioning of Anyway I don't want to derail in this discussion too far. If the nvprof solution is sufficient no need to try to also address the |
That's kind of all it does (since
We would, likely, see more opaque errors. |
For the purposes of testing, it seems like just running with |
I couldn't get |
An alternate approach to that tried in rapidsai#12361, here we just script GDB and check if we hit a breakpoint in cuInit. When RAPIDS_NO_INITIALIZE is set in the environment, merely importing cudf should not call into the CUDA runtime/driver (i.e. no cuInit should be called). Conversely, to check that we are scripting GDB properly, when we create a cudf object, we definitely _should_ hit cuInit.
Closing in favour of #12545 |
An alternate approach to that tried in #12361, here we just script GDB and check if we hit a breakpoint in cuInit. When RAPIDS_NO_INITIALIZE is set in the environment, merely importing cudf should not call into the CUDA runtime/driver (i.e. no cuInit should be called). Conversely, to check that we are scripting GDB properly, when we create a cudf object, we definitely _should_ hit cuInit. Authors: - Lawrence Mitchell (https://github.com/wence-) Approvers: - Ashwin Srinath (https://github.com/shwina) - Vyas Ramasubramani (https://github.com/vyasr) URL: #12545
Description
When RAPIDS_NO_INITIALIZE is set, importing cudf is not allowed to create a CUDA context. This is quite delicate since calls arbitrarily far down the import stack might create one.
To spot such problems, build a small shared library that interposes our own version of cuInit, and run a test importing cudf in a subprocess with that library LD_PRELOADed. If everything is kosher, we should not observe any calls to cuInit.
If one observes bad behaviour, the culprit can then be manually tracked down in a debugger by breaking on our cuInit implementation.
Checklist