[FEA] Establish conventions to enable fully CUDA-less builds #823

cjnolet · 2022-09-13T18:58:24Z

Currently there are 3 types of build environments in which a user might use RAFT headers:

CUDA-runtime installed, gcc & nvcc
CUDA-runtime installed, gcc & !nvcc
No CUDA-runtime installed, gcc & !nvcc

For the most part, RAFT supports the first two build environments currently and provides separate hpp and cuh header extensions to denote code which only expects CUDA-runtime (hpp) from code which also needs to be compiled w/ nvcc (cuh).

There is a need to support the third item for Triton workloads and it's currently a challenge for users to determine which headers don't require cuda runtime. It's even challenging for us, as the raft developers, to list said files and explain some sort of coherent rules behind how to find them to users.

We should start thinking about what these conventions might look like. For example, should we go a step further than just separating the header extensions and also separate the namespaces to denote cuda-runtime from non-CUDA? (e.g. raft::core::runtime / raft::distance::runtime vs raft::core::device / raft::core::host?).

This issue also relates to #806.

The text was updated successfully, but these errors were encountered:

cjnolet · 2022-09-13T18:59:28Z

cc @wphicks @divyegala @mhoemmen @teju85 for thoughts and ideas here.

mhoemmen · 2022-09-13T19:26:53Z

@cjnolet I would prefer being able to spell algorithms the same way regardless of where they run, and telling them whether to run on host or device based on an execution policy parameter. This would align with the C++ Standard, as well as other NVIDIA products like Thrust.

"Spell the same way" means same function name, same namespace, but different overloads. Ultimately those overloads need to call implementation details that might depend on whether a compiler or library (e.g., CUDA runtime) is available. The library question can be handled by the same implementation strategies that C++ BLAS wrappers use. The compiler question is a bit harder. I could see users opting into this explicitly by including headers. By analogy, users need to include <execution> in order to use parallel execution policies in Standard Algorithms.

cjnolet · 2022-09-13T20:09:02Z

"Spell the same way" means same function name, same namespace, but different overloads.

@mhoemmen I should have been more clear in my description of this problem. Just to clarify- I am not suggesting we should use host and device in our namespaces. Indeed, the direction we are going for supporting the device- vs host-only public APIs for the primitive functions is still generally going in the execution environment direction as we've discussed previously.

It might help if I back up for a moment and provide some background- when we first started building these primitives, we never even considered the question "Can I build without CUDA even being installed on my system?". As a result, we've created a lot of headers which are spread out across the library and which all assume at least CUDA-runtime is installed and available. Some of the headers (hpp extensions) can be built without nvcc, but for the most part, the headers in RAFT were written assuming at least CUDA-runtime was going to be installed and accessible on the system.

There are severl headers spread across the library that don't actually require CUDA-runtime: raft/core/comms.hpp, raft/core/error.hpp, and raft/distance/distance_type.hpp, for example. Right now, there isn't a real easy way for a user to know which hpp headers are actually safe to use without CUDA-runtime installed and expecting the users to follow the includes of every file might not be the best approach to this.

There just doesn't seem to be a good way today to tell users "Here's the list of RAFT headers which are safe to use in a public API and which don't require CUDA-runtime to be installed at all". @teju85 started adding runtime to namespaces which require CUDA-runtime to be installed. Is that the answer? I'm mostly looking for thoughts on a strategy to let users know up front which files don't require it (and document said strategy so we follow it consistently).

wphicks · 2022-09-14T20:02:16Z

I think this is probably worth an offline discussion, but I'll go ahead and give a brief description of how I'm handling this in the FIL CPU-GPU interop work. There, I distinguish between "consumable" and "implementation" headers, which is a little different than "public" and "detail." Consumable headers may still not be part of the public API, but they are intended to be consumed by other code, whereas implementation headers should never be #included by anything except the consumable header they are associated with.

As an example, here is a consumable header which is used for copying both CPU and GPU memory chunks. Note that it is within a detail directory, so I don't expect this to be part of the public-facing API. Its CPU implementation header is here, and its GPU implementation header is here.

Within the consumable header, we conditionally include the GPU implementation header based on whether or not the compile flag ENABLE_GPU has been turned on or not. Therefore, the consumable header is always safe to include, and anything which requires the CUDA runtime API is isolated to the GPU implementation header. This gives us a single obvious dividing line which CUDA headers never cross.

I don't know that we would want to adopt this across all of RAFT. There are some parts of RAFT which just don't make sense without access to the CUDA runtime API, and there the additional layer of indirection may not offer anything useful. The nice thing about this mechanism though is that if we decide that a particular header does need to provide CPU/GPU interop later, then the existing header just becomes the "consumable" header, and its functionality can be split out into implementation headers without requiring any downstream or internal changes whatsoever. So, we could progressively apply this mechanism as needed, organically growing the functionality that is available with or without CUDA.

cjnolet · 2022-09-14T21:58:48Z

Thanks @wphicks! The copy example you've provided looks very similar to what @divyegala and I (and @mhoemmen) have been discussing around execution environments and our new mdspan-ified API should enable that by separating the RAFT functions into overloads meant for device, those meant for host, and combinations of such. That tells me we could be close to alignment already, which is great.

As for the larger discussion about RAFT's conventions for documenting and making the existing non-CUDA compilable files easier to find, I think some solutions are becoming more clear as I've been working through this PR to move stuff around for 22.10. It's a draft at the moment but I've been attempting to document them in that PR as they become more clear.

wphicks · 2022-09-19T16:39:52Z

Fantastic! This sounds like a great improvement and well-aligned with other CPU-GPU interop work.

by separating the RAFT functions into overloads meant for device, those meant for host, and combinations of such

The only thing I'd want us to think about a little bit is ensuring that this does not lock us into a design that requires that we know whether data is on device/host. One key element that can substantially simplify parts of our design is allowing some code to be agnostic to that. It may be that the determination of device/host was made upstream of a particular function call or that we have objects that fully encapsulate device-specific behavior, but in both cases it can be useful to explicitly say "We don't care!" when it comes to the memory type.

In general, that shouldn't be an issue, but we should be wary of interop mechanisms that assume that we must rely on overloads to make that distinction.

github-actions · 2022-10-19T17:15:22Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

cjnolet added the feature request New feature or request label Sep 13, 2022

cjnolet changed the title ~~[FEA] Enable non-CUDA builds~~ [FEA] Enable fully non-CUDA builds Sep 13, 2022

cjnolet changed the title ~~[FEA] Enable fully non-CUDA builds~~ [FEA] Enable fully CUDA-less builds Sep 13, 2022

cjnolet changed the title ~~[FEA] Enable fully CUDA-less builds~~ [FEA] Establish conventions to enable fully CUDA-less builds Sep 14, 2022

cjnolet added this to VS/ML/DM Primitives Release Board Oct 12, 2022

github-actions bot added the inactive-30d label Oct 19, 2022

cjnolet moved this to In Progress in VS/ML/DM Primitives Release Board May 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Establish conventions to enable fully CUDA-less builds #823

[FEA] Establish conventions to enable fully CUDA-less builds #823

cjnolet commented Sep 13, 2022 •

edited

Loading

cjnolet commented Sep 13, 2022

mhoemmen commented Sep 13, 2022

cjnolet commented Sep 13, 2022 •

edited

Loading

wphicks commented Sep 14, 2022

cjnolet commented Sep 14, 2022

wphicks commented Sep 19, 2022 •

edited

Loading

github-actions bot commented Oct 19, 2022

[FEA] Establish conventions to enable fully CUDA-less builds #823

[FEA] Establish conventions to enable fully CUDA-less builds #823

Comments

cjnolet commented Sep 13, 2022 • edited Loading

cjnolet commented Sep 13, 2022

mhoemmen commented Sep 13, 2022

cjnolet commented Sep 13, 2022 • edited Loading

wphicks commented Sep 14, 2022

cjnolet commented Sep 14, 2022

wphicks commented Sep 19, 2022 • edited Loading

github-actions bot commented Oct 19, 2022

cjnolet commented Sep 13, 2022 •

edited

Loading

cjnolet commented Sep 13, 2022 •

edited

Loading

wphicks commented Sep 19, 2022 •

edited

Loading