Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test that RAPIDS_NO_INITIALIZE means no cuInit #12361

Closed

Conversation

wence-
Copy link
Contributor

@wence- wence- commented Dec 12, 2022

Description

When RAPIDS_NO_INITIALIZE is set, importing cudf is not allowed to create a CUDA context. This is quite delicate since calls arbitrarily far down the import stack might create one.

To spot such problems, build a small shared library that interposes our own version of cuInit, and run a test importing cudf in a subprocess with that library LD_PRELOADed. If everything is kosher, we should not observe any calls to cuInit.

If one observes bad behaviour, the culprit can then be manually tracked down in a debugger by breaking on our cuInit implementation.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

When RAPIDS_NO_INITIALIZE is set, importing cudf is not allowed to
create a CUDA context. This is quite delicate since calls arbitrarily
far down the import stack _might_ create one.

To spot such problems, build a small shared library that interposes
our own version of cuInit, and run a test importing cudf in a
subprocess with that library LD_PRELOADed. If everything is kosher, we
should not observe any calls to cuInit.

If one observes bad behaviour, the culprit can then be manually
tracked down in a debugger by breaking on our cuInit implementation.
@wence- wence- added tech debt 5 - DO NOT MERGE Hold off on merging; see PR for details non-breaking Non-breaking change labels Dec 12, 2022
@wence- wence- requested review from a team as code owners December 12, 2022 17:26
@github-actions github-actions bot added CMake CMake build issue Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Dec 12, 2022
@shwina
Copy link
Contributor

shwina commented Dec 12, 2022

We could also consider using nvprof:

nvprof python -c 'import cudf' &| grep cuInit   
                    0.02%  30.542us         2  15.271us  2.9350us  27.607us  cuInit
RAPIDS_NO_INITIALIZE=1 nvprof python -c 'import cudf' &| grep cuInit
<no output>

Copy link
Contributor Author

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some signposts

target_link_libraries(cudfcuinit_intercept PRIVATE conda_env)
endif()
target_link_libraries(cudfcuinit_intercept PUBLIC CUDA::cudart cuda dl)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need some help here, I'm completely flying blind, and this is wrong AFAICT.

Basically, I have a single file that I want to compile into a shared library and link against libdl and libcuda.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to link to libdl? I think linking to libc is sufficient for your purposes (dlfcn). The linking to CUDA seems reasonable here (although if you care specifically about whether it's dynamically or statically linked you will want to set the CUDA_RUNTIME_LIBRARY property.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is only glibc 2.34 and later where you don't need to link libdl to get access to dlsym and friends (see https://sourceware.org/pipermail/libc-alpha/2021-August/129718.html and bminor/glibc@77f876c) unless I am misunderstanding something.

In any case, would be very happy for someone who knows what they are doing to help rewrite this part of the patch completely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually don't need cudart at all, only -lcuda, which I think I should get with CUDA::cuda_driver ?

}
}
} // namespace
#endif
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This same stuff could easily be extended to address @jrhemstad's request in #11546 that one test that RMM is the only allocator of memory.

original_dlsym = (dlsym_t)dlvsym(RTLD_NEXT, "dlsym", "GLIBC_2.2.5");
if (original_dlsym) {
original_cuGetProcAddress = (proc_t)original_dlsym(RTLD_NEXT, "cuGetProcAddress");
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For driver calls there are two ways python libraries resolve them:

  1. [numba does this] dlopen libcuda.so and then dlsym on the handle
  2. [cuda-python does this] dlopen libcuda, dlsym cuGetProcAddress and then call cuGetProcAddress to get the driver symbol

So unfortunately, it's not sufficient to just define cuInit in this shared library and override the symbol resolution via LD_PRELOAD. We have to instead patch into dlsym and cuGetProcAddress. The latter is easy, the former is hard (we can't just dlsym(RTLD_NEXT, ...) here because that would call the local function. Instead, we use GLIBC's versioned lookup dlvsym, but now we need to match the glibc version exactly in the running environment (this is the one my conda environment has).

I guess I could spin over a bunch of versions until I find the right one.

Any other suggestions gratefully received.

Copy link
Contributor Author

@wence- wence- Dec 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but now we need to match the glibc version exactly in the running environment (this is the one my conda environment has).

Versioning is not quite as bad as this, 2.2.5 is a magic number but will be stable forever (due to glibc's forward-compat guarantee).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you patch into numba and cuda-python instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you patch into numba and cuda-python instead?

I can patch into numba, because the cuda interface is implemented in python, but can't do that for either cuda-python (or cupy) because their cuda interface is implemented in cython (so compiled) and hence monkey-patching won't work.

I also want to avoid a situation where some further third-party dependency is pulled in that also brings up a cuda context (perhaps directly via the C API). Since eventually everyone actually calls into the driver API, this seems like the best place to hook in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some Linux options that I think are more reliable and do not require patching.
How about something like this: https://stackoverflow.com/questions/5103443/how-to-check-what-shared-libraries-are-loaded-at-run-time-for-a-given-process ?
I could try to work up a script based on this if you'd like.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will that tell me if cuInit is called? I think no

Copy link
Contributor Author

@wence- wence- Dec 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can run the process and inspect with nvml and try and match that way

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This SO post seems to settle on basically the same thing that you do (funnily enough, there's another post about how Citrix copy-pasted this solution disregarding the issues and broke some users).

Due to the extensive dlopening/dlsyming happening, I am not sure that either strace or ltrace or anything like them will be sufficient to detect the calls, which would have been the easier route here as David suggests. If all functions were called by name then I think ltrace would have been sufficient, but as it is you'll only see the dlopen of libcuda.so and then the dlsym of some arbitrary memory address. You could hope that the dlsym calls always use a name for the handle that includes cuInit; I think that would show up? It would probably only catch a subset of cases though.

&ptr,
CUDA_VERSION,
CU_GET_PROC_ADDRESS_DEFAULT
#if CUDA_VERSION >= 12000
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ABI change.

Comment on lines +10 to +12
location = Path(__file__)
cpp_build_dir = location / ".." / ".." / ".." / ".." / ".." / "cpp" / "build"
libintercept = (cpp_build_dir / "libcudfcuinit_intercept.so").resolve()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the right way to reference this? Right now I'm assuming the build directory exists (because I didn't manage to wrangle cmake to install the library). Equally, however, I'm not sure we really want to install this library?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh boy, this is fun. I don't think there is a perfect solution here. FWIW my approach to this in #11875 was to move building the preload lib out of the main libcudf build, build it separately as part of CI, and then just launch tests with the preload library directly from the CLI in CI. That functionality was disabled as part of the Jenkins->GHA migration. Given that you're working on this, it may be time to investigate how to reenable that functionality within GHA.

@robertmaynard do you think that preload libraries like this or the stream verification lib should be built within the main CMakeLists.txt for the library, or shipped along with the conda packages? I had avoided that mostly because in the end we need the paths to the library anyway in order to preload, so it's not a great fit, but I know others had expressed different opinions. Depending on what direction we take with that we will need to adapt the solution in this pytest for how the library is discovered I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presume that the stream verification lib is also a single library. My first thought had been to just compile to the .so as part of the test, referencing the source directory. But then I realised that I need someone to provide information about the compiler configuration and so forth.

@wence- wence- force-pushed the wence/feature/test-no-cuInit-import branch from bd83d28 to e086ab3 Compare December 12, 2022 17:58
@codecov
Copy link

codecov bot commented Dec 12, 2022

Codecov Report

Base: 86.58% // Head: 86.58% // Increases project coverage by +0.00% 🎉

Coverage data is based on head (a8800e9) compared to base (b6dccb3).
Patch coverage: 74.41% of modified lines in pull request are covered.

❗ Current head a8800e9 differs from pull request most recent head 8dee3f9. Consider uploading reports for the commit 8dee3f9 to get more accurate results

Additional details and impacted files
@@              Coverage Diff               @@
##           branch-23.02   #12361    +/-   ##
==============================================
  Coverage         86.58%   86.58%            
==============================================
  Files               155      155            
  Lines             24368    24507   +139     
==============================================
+ Hits              21098    21219   +121     
- Misses             3270     3288    +18     
Impacted Files Coverage Δ
python/cudf/cudf/api/extensions/accessor.py 93.33% <0.00%> (ø)
python/cudf/cudf/core/buffer/utils.py 96.00% <ø> (ø)
python/cudf/cudf/core/column/categorical.py 89.34% <ø> (ø)
python/cudf/cudf/core/frame.py 94.05% <ø> (ø)
python/cudf/cudf/core/buffer/spill_manager.py 74.37% <63.07%> (-5.63%) ⬇️
python/cudf/cudf/core/dataframe.py 93.49% <64.28%> (-0.11%) ⬇️
python/cudf/cudf/core/_base_index.py 81.38% <87.50%> (+0.09%) ⬆️
python/cudf/cudf/core/column/column.py 87.95% <91.30%> (-0.01%) ⬇️
python/cudf/cudf/core/algorithms.py 90.00% <100.00%> (-0.48%) ⬇️
python/cudf/cudf/core/buffer/spillable_buffer.py 93.53% <100.00%> (+0.67%) ⬆️
... and 38 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.


void* dlsym(void* handle, const char* name_)
{
std::string name{name_};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: error handling in all these wrappers in case the resolution of the original functions failed (at which point we can only abort)

@vyasr
Copy link
Contributor

vyasr commented Dec 16, 2022

Given the level of complexity we'd be introducing here (dlsyming dlsym itself seems like a massive minefield) I wonder if there might not be an easier approach altogether. We were at one point making a push to ensure that import cudf worked on a system with no GPUs. I assume that cuInit will fail if called on a system with no GPUs. Would refocusing on enabling that give us the same benefit as this PR?

@shwina
Copy link
Contributor

shwina commented Jan 3, 2023

I wonder if there might not be an easier approach altogether.

How about using nvprof as suggested above?

We were at one point making a push to ensure that import cudf worked on a system with no GPUs. I assume that cuInit will fail if called on a system with no GPUs. Would refocusing on enabling that give us the same benefit as this PR?

I think we should consider the two separate efforts. Yes, enabling import cudf on CPU only machines can be achieved by ensuring we don't initialize the CUDA context anywhere, but it could also be achieved in other ways (e.g., by choosing to ignore the RuntimeError thrown in Python land when cuInit fails).

@vyasr
Copy link
Contributor

vyasr commented Jan 4, 2023

I'm fine with the nvprof solution too. That seems like the simplest and most direct approach for this particular problem. My mentioning of import cudf is because this seems like a good opportunity to knock out two birds with one stone. I've never really looked at what RAPIDS_NO_INITIALIZE does beyond seeing the early return from validate_setup. What happens if we remove setup validation altogether? Do we see much more opaque errors, or do we just see the errors at a later point in time?

Anyway I don't want to derail in this discussion too far. If the nvprof solution is sufficient no need to try to also address the import cudf question here unless there's an obvious way to remove CUDA context initialization unconditionally upon import.

@wence-
Copy link
Contributor Author

wence- commented Jan 9, 2023

I've never really looked at what RAPIDS_NO_INITIALIZE does beyond seeing the early return from validate_setup.

That's kind of all it does (since validate_setup calls into the cuda runtime to inspect various properties of the system and is called as part of importing __init__.py).

What happens if we remove setup validation altogether? Do we see much more opaque errors, or do we just see the errors at a later point in time?

We would, likely, see more opaque errors. validate_setup checks that certain prerequisites of the cuda/hardware versions are satisfied and produces a readable downstream error message for the caller; I presume that just trying to access those features on invalid hardware would produce much more inscrutable errors.

@wence-
Copy link
Contributor Author

wence- commented Jan 9, 2023

For the purposes of testing, it seems like just running with nvprof is the preferred approach, so I'll have a go at that.

@wence-
Copy link
Contributor Author

wence- commented Jan 13, 2023

For the purposes of testing, it seems like just running with nvprof is the preferred approach, so I'll have a go at that.

I couldn't get nvprof or nsys to work as I would like, but did something with GDB instead (see #12545). I'll leave this open until we definitively decide on a solution one way or another.

wence- added a commit to wence-/cudf that referenced this pull request Jan 13, 2023
An alternate approach to that tried in rapidsai#12361, here we just script GDB
and check if we hit a breakpoint in cuInit. When RAPIDS_NO_INITIALIZE
is set in the environment, merely importing cudf should not call into
the CUDA runtime/driver (i.e. no cuInit should be called). Conversely,
to check that we are scripting GDB properly, when we create a cudf
object, we definitely _should_ hit cuInit.
@wence-
Copy link
Contributor Author

wence- commented Jan 20, 2023

Closing in favour of #12545

@wence- wence- closed this Jan 20, 2023
rapids-bot bot pushed a commit that referenced this pull request Jan 20, 2023
An alternate approach to that tried in #12361, here we just script GDB and check if we hit a breakpoint in cuInit. When RAPIDS_NO_INITIALIZE is set in the environment, merely importing cudf should not call into the CUDA runtime/driver (i.e. no cuInit should be called). Conversely, to check that we are scripting GDB properly, when we create a cudf object, we definitely _should_ hit cuInit.

Authors:
  - Lawrence Mitchell (https://github.com/wence-)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #12545
@wence- wence- deleted the wence/feature/test-no-cuInit-import branch October 24, 2024 09:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - DO NOT MERGE Hold off on merging; see PR for details CMake CMake build issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants