cuDF: Need to canonicalize dlopen'd library names #12708

leofang · 2023-02-06T19:06:26Z

Documenting an offline discussion with @jakirkham in preparation of CUDA 12 bring-up on conda-forge.

Currently, there're some places where libXXXXX.so without any SONAME being dlopen'd, for example,

cudf/cpp/src/io/utilities/file_io_utilities.cpp

Line 120 in c7db81a

cf_lib = dlopen("libcufile.so", RTLD_LAZY | RTLD_LOCAL | RTLD_NODELETE);

There will become problematic because the libXXXXX.so symlink is supposed to exist only in the libXXXXX-dev package by the stand practice of Linux distros, not in libXXXXX which only contains libXXXXX.so.{major}. The dlopen'd names need to be canonicalized.

The text was updated successfully, but these errors were encountered:

vyasr · 2023-02-08T20:14:54Z

What is the right approach to do this when we want to build libcudf artifacts for both CUDA 11 and 12? Will we need to guard the dlopens with preprocessor checks of CUDART_VERSION or something similar? This assumes the APIs haven't changed of course, in cases where the APIs are different we will need to operate conditionally anyway.

leofang · 2023-02-08T21:32:07Z

The right thing to do is to dlopen libXXXXX.so.major. Keeping libXXXXX.so (no SONAME) as a fallback is probably OK. Going forward this will be compliant with the new conda-forge packaging scheme.

vyasr · 2023-02-08T22:18:42Z

Right, but say I am compiling libcudf twice, once for CUDA 11 and once for CUDA 12. Is the expectation that we do something like

#if CUDART_VERSION > 12000
dlopen(libXXX.so.12)
#else
dlopen(libXXX.so.11)
#endif

in the code?

davidwendt · 2023-02-08T22:48:03Z

This seems to defeat the purposes of dlopen where we want to loosely couple to the shared library at build time.
Perhaps the version should be determined somehow at runtime? There is a good chance I'm misunderstanding this.

vyasr · 2023-02-09T00:17:45Z

It's a bit tricky because we want to couple loosely in some ways (maybe it's OK for the library not to exist, or maybe we're OK with using any minor version available) but we still need to couple tightly in other ways (if our main binary is compiled against CUDA 12 then we probably have to use CUDA 12 versions of the dlopened libraries to guarantee ABI compatibility). But I'm not sure if my snippet is the best solution, or if you would still include additional runtime checks for different versions to provide more useful error messages, or something else.

jakirkham · 2023-02-09T02:52:20Z

cc @vuule (who may have more thoughts on this)

vuule · 2023-02-09T22:26:05Z

Pretty much on the same page as Vyas here. If we generate the .so name at compile time based on the runtime version, we can use it in dlopen. With something like https://github.com/rapidsai/kvikio/blob/branch-23.04/cpp/include/kvikio/shim/utils.hpp#L49 we can fall back to the old name.
It would be nice to generate the .so name in a way that automatically supports future CUDA versions. With hardcoded supported versions, we would silently fail the dlopen call if we forget to update the code at next CUDA version release.

robertmaynard · 2023-02-16T16:31:16Z

The right thing to do is to dlopen libXXXXX.so.major. Keeping libXXXXX.so (no SONAME) as a fallback is probably OK. Going forward this will be compliant with the new conda-forge packaging scheme.

This isn't the approach the CTK takes. The SOVERSION values of CTK libraries is not tightly coupled to the CTK major version. For example libcufle in CTK 12.0 ships libcufile.so.0 since it hasn't reved the ABI compat since release. Same for curand, and cusolver

If we had to encode this we need to parse the versions.json that is packaged with the CTK to determine the mapping from CTK version and SOVERSION.

jakirkham · 2023-04-07T23:11:35Z

Suppose we could create a shim library that links against the right libcufile.so.X during build time and then dlopened that shim library at runtime.

This might strike the balance Vyas described above. Namely linking to the version that we built and tested against (so have some confidence has the feature set we need to run) while simultaneously handling the case were libcufile.so.X is not present for whatever reason (not installed by user, missing from CTK that version, not available for user's architecture, etc.).

Should also handle the linking to the right SOVERSION as Rob mentioned (without hopefully needing to bake too much logic about CTK versioning in cuDF).

Thoughts? 🙂

bdice · 2023-04-07T23:25:47Z

I suppose that a shim library could work. I kind of dislike the idea of having a bunch of shim libraries for all the dlopens in RAPIDS… but it does seem that it would solve the problem. I think I’d rather handle the problem directly with hardcoded versions even if it requires manual updates and fallback logic. Would it be possible to overcome the problem that @vuule mentioned (below) by explicitly testing that the dlopen succeeds? I don’t think we can be guaranteed that new versions would work out of the box, so having manual/hardcoded logic to enable new versions doesn’t sound too awful to me. The net LOC should be lower than the shim approach and it would be more self-contained in libcudf source code.

With hardcoded supported versions, we would silently fail the dlopen call if we forget to update the code at next CUDA version release.

robertmaynard · 2023-04-10T13:41:02Z

Suppose we could create a shim library that links against the right libcufile.so.X during build time and then dlopened that shim library at runtime.

I think we would always have to dlopen/dlsym so that we don't get a DT_NEEDED entry for libcufile.so.X

wence- · 2023-04-11T09:10:46Z

I don’t think we can be guaranteed that new versions would work out of the box, so having manual/hardcoded logic to enable new versions doesn’t sound too awful to me.

If the libraries are not lying about their ABI compatibility, then hard-coded versions "just work", no?

e.g. in 11.8, libcufft is major versioned as .10, whereas in 12.0 it is versioned as .11. This says to me that these two libraries are not ABI compatible, and so it would be unsafe to dlopen the unversioned .so as a "loosely coupled" thing (even if it existed). So if we have code that dlopens libcufft.so.10 and fails, I don't think it can fall back to trying libcufft.so.11 unless there is further conditional handling of potential ABI differences.

In cudf, the dlopen of libcufile appears to be the only one, which accidentally happens to be safe because they've never revved the ABI version.

jakirkham · 2023-04-11T20:09:24Z

Yeah we discussed this today and the consensus was to just hard-code for now. There is only one version to track atm. So that seems like the simplest fix.

We can revisit when there is a SOVERSION bump, which will likely correspond to a new CUDA 12.x that we try to support.

Closes #12708 Authors: - Ashwin Srinath (https://github.com/shwina) - Lawrence Mitchell (https://github.com/wence-) Approvers: - Bradley Dice (https://github.com/bdice) - Divye Gala (https://github.com/divyegala) - Vukasin Milovanovic (https://github.com/vuule) URL: #13210

bdice mentioned this issue Mar 31, 2023

Add libcublas recipe conda-forge/staged-recipes#21901

Merged

10 tasks

jakirkham mentioned this issue Mar 31, 2023

Include libcublas.so in libcublas conda-forge/libcublas-feedstock#1

Closed

jakirkham changed the title ~~Need to canonicalize dlopen'd library names~~ cuDF: Need to canonicalize dlopen'd library names Apr 10, 2023

vyasr mentioned this issue Apr 10, 2023

[BUG] Embedded Python interpreter: RuntimeError: Function "cuInit" not found #12862

Closed

vyasr self-assigned this Apr 11, 2023

jakirkham mentioned this issue Apr 19, 2023

KvikIO: Need to canonicalize dlopen'd library names rapidsai/kvikio#200

Closed

jakirkham unassigned vyasr Apr 21, 2023

This was referenced Apr 24, 2023

Canonicalize libcufile soname #13209

Closed

Use canonicalized name for dlopen'd libraries (libcufile) #13210

Merged

rapids-bot bot closed this as completed in #13210 Apr 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuDF: Need to canonicalize dlopen'd library names #12708

cuDF: Need to canonicalize dlopen'd library names #12708

leofang commented Feb 6, 2023 •

edited

Loading

vyasr commented Feb 8, 2023 •

edited

Loading

leofang commented Feb 8, 2023

vyasr commented Feb 8, 2023

davidwendt commented Feb 8, 2023

vyasr commented Feb 9, 2023

jakirkham commented Feb 9, 2023

vuule commented Feb 9, 2023

robertmaynard commented Feb 16, 2023 •

edited

Loading

jakirkham commented Apr 7, 2023

bdice commented Apr 7, 2023 •

edited

Loading

robertmaynard commented Apr 10, 2023

wence- commented Apr 11, 2023

jakirkham commented Apr 11, 2023

cuDF: Need to canonicalize dlopen'd library names #12708

cuDF: Need to canonicalize dlopen'd library names #12708

Comments

leofang commented Feb 6, 2023 • edited Loading

vyasr commented Feb 8, 2023 • edited Loading

leofang commented Feb 8, 2023

vyasr commented Feb 8, 2023

davidwendt commented Feb 8, 2023

vyasr commented Feb 9, 2023

jakirkham commented Feb 9, 2023

vuule commented Feb 9, 2023

robertmaynard commented Feb 16, 2023 • edited Loading

jakirkham commented Apr 7, 2023

bdice commented Apr 7, 2023 • edited Loading

robertmaynard commented Apr 10, 2023

wence- commented Apr 11, 2023

jakirkham commented Apr 11, 2023

leofang commented Feb 6, 2023 •

edited

Loading

vyasr commented Feb 8, 2023 •

edited

Loading

robertmaynard commented Feb 16, 2023 •

edited

Loading

bdice commented Apr 7, 2023 •

edited

Loading