Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test that RAPIDS_NO_INITIALIZE means no cuInit #12361

Closed
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -669,6 +669,25 @@ target_link_libraries(
$<TARGET_NAME_IF_EXISTS:cuFile_interface>
)

add_library(cudfcuinit_intercept SHARED src/utilities/cuinit_intercept.cpp)
set_target_properties(
cudfcuinit_intercept
PROPERTIES BUILD_RPATH "\$ORIGIN"
INSTALL_RPATH "\$ORIGIN"
# set target compile options
CXX_STANDARD 17
CXX_STANDARD_REQUIRED ON
CUDA_STANDARD 17
CUDA_STANDARD_REQUIRED ON
POSITION_INDEPENDENT_CODE ON
INTERFACE_POSITION_INDEPENDENT_CODE ON
)

if(TARGET conda_env)
target_link_libraries(cudfcuinit_intercept PRIVATE conda_env)
endif()
target_link_libraries(cudfcuinit_intercept PUBLIC CUDA::cudart cuda dl)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need some help here, I'm completely flying blind, and this is wrong AFAICT.

Basically, I have a single file that I want to compile into a shared library and link against libdl and libcuda.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to link to libdl? I think linking to libc is sufficient for your purposes (dlfcn). The linking to CUDA seems reasonable here (although if you care specifically about whether it's dynamically or statically linked you will want to set the CUDA_RUNTIME_LIBRARY property.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is only glibc 2.34 and later where you don't need to link libdl to get access to dlsym and friends (see https://sourceware.org/pipermail/libc-alpha/2021-August/129718.html and bminor/glibc@77f876c) unless I am misunderstanding something.

In any case, would be very happy for someone who knows what they are doing to help rewrite this part of the patch completely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually don't need cudart at all, only -lcuda, which I think I should get with CUDA::cuda_driver ?

# Add Conda library, and include paths if specified
if(TARGET conda_env)
target_link_libraries(cudf PRIVATE conda_env)
Expand Down
116 changes: 116 additions & 0 deletions cpp/src/utilities/cuinit_intercept.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
/*
* Copyright (c) 2022, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#endif
#include <cuda.h>
#include <dlfcn.h>
#include <iostream>

#if defined(__GLIBC__) && __GLIBC__ >= 2 && defined(__GLIBC_MINOR__) && __GLIBC_MINOR__ >= 1
namespace {
static int cuInitCount{0};
using init_t = CUresult (*)(unsigned int);
using proc_t = CUresult (*)(const char*,
void**,
int,
cuuint64_t
#if CUDA_VERSION >= 12000
,
CUdriverProcAddressQueryResult*
#endif
);
using dlsym_t = void* (*)(void*, const char*);
static init_t original_cuInit{nullptr};
static proc_t original_cuGetProcAddress{nullptr};
static dlsym_t original_dlsym{nullptr};

static __attribute__((constructor)) void init_cuInit_hack()
{
// Hack hack hack, relies on matching the exact glibc version
original_dlsym = (dlsym_t)dlvsym(RTLD_NEXT, "dlsym", "GLIBC_2.2.5");
if (original_dlsym) {
original_cuGetProcAddress = (proc_t)original_dlsym(RTLD_NEXT, "cuGetProcAddress");
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For driver calls there are two ways python libraries resolve them:

  1. [numba does this] dlopen libcuda.so and then dlsym on the handle
  2. [cuda-python does this] dlopen libcuda, dlsym cuGetProcAddress and then call cuGetProcAddress to get the driver symbol

So unfortunately, it's not sufficient to just define cuInit in this shared library and override the symbol resolution via LD_PRELOAD. We have to instead patch into dlsym and cuGetProcAddress. The latter is easy, the former is hard (we can't just dlsym(RTLD_NEXT, ...) here because that would call the local function. Instead, we use GLIBC's versioned lookup dlvsym, but now we need to match the glibc version exactly in the running environment (this is the one my conda environment has).

I guess I could spin over a bunch of versions until I find the right one.

Any other suggestions gratefully received.

Copy link
Contributor Author

@wence- wence- Dec 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but now we need to match the glibc version exactly in the running environment (this is the one my conda environment has).

Versioning is not quite as bad as this, 2.2.5 is a magic number but will be stable forever (due to glibc's forward-compat guarantee).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you patch into numba and cuda-python instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you patch into numba and cuda-python instead?

I can patch into numba, because the cuda interface is implemented in python, but can't do that for either cuda-python (or cupy) because their cuda interface is implemented in cython (so compiled) and hence monkey-patching won't work.

I also want to avoid a situation where some further third-party dependency is pulled in that also brings up a cuda context (perhaps directly via the C API). Since eventually everyone actually calls into the driver API, this seems like the best place to hook in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some Linux options that I think are more reliable and do not require patching.
How about something like this: https://stackoverflow.com/questions/5103443/how-to-check-what-shared-libraries-are-loaded-at-run-time-for-a-given-process ?
I could try to work up a script based on this if you'd like.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will that tell me if cuInit is called? I think no

Copy link
Contributor Author

@wence- wence- Dec 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can run the process and inspect with nvml and try and match that way

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This SO post seems to settle on basically the same thing that you do (funnily enough, there's another post about how Citrix copy-pasted this solution disregarding the issues and broke some users).

Due to the extensive dlopening/dlsyming happening, I am not sure that either strace or ltrace or anything like them will be sufficient to detect the calls, which would have been the easier route here as David suggests. If all functions were called by name then I think ltrace would have been sufficient, but as it is you'll only see the dlopen of libcuda.so and then the dlsym of some arbitrary memory address. You could hope that the dlsym calls always use a name for the handle that includes cuInit; I think that would show up? It would probably only catch a subset of cases though.

}

extern "C" {
CUresult cuInit(unsigned int flags)
{
if (!original_cuInit) {
void* ptr{nullptr};
CUresult err = original_cuGetProcAddress("cuInit",
&ptr,
CUDA_VERSION,
CU_GET_PROC_ADDRESS_DEFAULT
#if CUDA_VERSION >= 12000
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ABI change.

,
nullptr
#endif
);
if (err != CUDA_SUCCESS) { return err; }
if (ptr) { original_cuInit = (init_t)(ptr); }
}
std::cerr << "cuInit has been called " << ++cuInitCount << " times" << std::endl;
if (original_cuInit) {
return original_cuInit(flags);
} else {
return CUDA_ERROR_NOT_INITIALIZED;
}
}

CUresult cuGetProcAddress(const char* symbol,
void** pfn,
int cudaVersion,
cuuint64_t flags
#if CUDA_VERSION >= 12000
,
CUdriverProcAddressQueryResult* symbolStatus
#endif
)
{
if (!original_cuGetProcAddress) { return CUDA_ERROR_NOT_SUPPORTED; }
CUresult err = original_cuGetProcAddress(symbol,
pfn,
cudaVersion,
flags
#if CUDA_VERSION >= 12000
,
symbolStatus
#endif
);
if (std::string{symbol} == "cuInit") {
original_cuInit = (init_t)(*pfn);
*pfn = (void*)cuInit;
}
return err;
}

void* dlsym(void* handle, const char* name_)
{
std::string name{name_};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: error handling in all these wrappers in case the resolution of the original functions failed (at which point we can only abort)

if (name == "cuInit") {
return (void*)cuInit;
} else if (name == "cuGetProcAddress") {
return (void*)cuGetProcAddress;
} else {
return original_dlsym(handle, name_);
}
}
}
} // namespace
#endif
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This same stuff could easily be extended to address @jrhemstad's request in #11546 that one test that RMM is the only allocator of memory.

28 changes: 28 additions & 0 deletions python/cudf/cudf/tests/test_nocuinit.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Copyright (c) 2022, NVIDIA CORPORATION.

import os
import subprocess
import sys
from pathlib import Path

import pytest

location = Path(__file__)
cpp_build_dir = location / ".." / ".." / ".." / ".." / ".." / "cpp" / "build"
libintercept = (cpp_build_dir / "libcudfcuinit_intercept.so").resolve()
Comment on lines +10 to +12
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the right way to reference this? Right now I'm assuming the build directory exists (because I didn't manage to wrangle cmake to install the library). Equally, however, I'm not sure we really want to install this library?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh boy, this is fun. I don't think there is a perfect solution here. FWIW my approach to this in #11875 was to move building the preload lib out of the main libcudf build, build it separately as part of CI, and then just launch tests with the preload library directly from the CLI in CI. That functionality was disabled as part of the Jenkins->GHA migration. Given that you're working on this, it may be time to investigate how to reenable that functionality within GHA.

@robertmaynard do you think that preload libraries like this or the stream verification lib should be built within the main CMakeLists.txt for the library, or shipped along with the conda packages? I had avoided that mostly because in the end we need the paths to the library anyway in order to preload, so it's not a great fit, but I know others had expressed different opinions. Depending on what direction we take with that we will need to adapt the solution in this pytest for how the library is discovered I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presume that the stream verification lib is also a single library. My first thought had been to just compile to the .so as part of the test, referencing the source directory. But then I realised that I need someone to provide information about the compiler configuration and so forth.



@pytest.mark.skipif(
not libintercept.exists(),
reason="libcudfcuinit_intercept.so not built, can't check for cuInit",
)
def test_import_no_cuinit():
env = os.environ.copy()
env["RAPIDS_NO_INITIALIZE"] = "1"
env["LD_PRELOAD"] = str(libintercept)
output = subprocess.check_output(
[sys.executable, "-c", "import cudf"],
env=env,
stderr=subprocess.STDOUT,
)
assert "cuInit has been called" not in output.decode()