Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default value of CMAKE_CUDA_ARCHITECTURES of NATIVE breaks cmake CMakeDetermineCUDACompiler #84

Closed
csadorf opened this issue Oct 20, 2022 · 15 comments

Comments

@csadorf
Copy link
Contributor

csadorf commented Oct 20, 2022

The default value of NATIVE for CMAKE_CUDA_ARCHITECTURES defined here

export CUDAARCHS="${CUDAARCHS:-${CMAKE_CUDA_ARCHITECTURES:-NATIVE}}"

breaks CMakeDetermineCUDACompiler with error message

  CMAKE_CUDA_ARCHITECTURES:

    NATIVE

  is not one of the following:

    * a semicolon-separated list of integers, each optionally
      followed by '-real' or '-virtual'
    * a special value: all, all-major, native

See also: d9f617e#r87361335

@csadorf
Copy link
Contributor Author

csadorf commented Oct 20, 2022

The support for the native argument for CUDA_ARCHITECTURES was added to cmake here.

It appears the magic value was always native, not NATIVE, so I am a bit puzzled why this has not been an issue before. Especially, because the default value of NATIVE is found in multiple places within rapids:

rg 'ARCHITECTURE.*NATIVE'
cugraph/build.sh
193:        CUGRAPH_CMAKE_CUDA_ARCHITECTURES="NATIVE"
215:        CUGRAPH_CMAKE_CUDA_ARCHITECTURES="NATIVE"

cuml/cpp/README.md
47:| CMAKE_CUDA_ARCHITECTURES |  List of GPU architectures, semicolon-separated | Empty  | List the GPU architectures to compile the GPU targets for. Set to "NATIVE" to auto detect GPU architecture of the system, set to "ALL" to compile for all RAPIDS supported archs: ["60" "62" "70" "72" "75" "80" "86"].  |

cuml/build.sh
237:        CUML_CMAKE_CUDA_ARCHITECTURES="NATIVE"

compose/etc/bash-utils.sh
1571:    export CUDAARCHS="${CUDAARCHS:-${CMAKE_CUDA_ARCHITECTURES:-NATIVE}}"

cuspatial/build.sh
144:    CUSPATIAL_CMAKE_CUDA_ARCHITECTURES="-DCMAKE_CUDA_ARCHITECTURES=NATIVE"

raft/build.sh
295:        RAFT_CMAKE_CUDA_ARCHITECTURES="NATIVE"

cudf/build.sh
262:        CUDF_CMAKE_CUDA_ARCHITECTURES="${CUDF_CMAKE_CUDA_ARCHITECTURES:-NATIVE}"
263:        if [[ "$CUDF_CMAKE_CUDA_ARCHITECTURES" == "NATIVE" ]]; then

@csadorf
Copy link
Contributor Author

csadorf commented Oct 20, 2022

After some digging I found that the NATIVE value is a valid argument for the rapids_cuda_set_architectures() function defined in rapids-cmake/cuda/set_architectures.cmake, but is not a valid argument for CMAKE_CUDA_ARCHITECTURES. The latter only supports native as of version 3.24.

However, even when rapids_cuda_set_architectures() is not directly called, it will be invoked by rapids_init_architectures(..) with rapids_cuda_set_architectures(NATIVE) in case that the CMAKE_CUDA_ARCHITECTURES variable is set to that value. See here.

This means the bug is only triggered if CMakeDetermineCUDACompiler is run before rapids_cuda_init_architectures(...) is called. I am working on determining the MRE.

csadorf added a commit to csadorf/rapids-compose that referenced this issue Oct 21, 2022
@trxcllnt
Copy link
Owner

It looks like cuML does call rapids_cuda_init_architectures() before calling project(CUML), so using NATIVE here should be OK.

It seems rapids_cuda_set_architectures doesn't accept native (it does accept NATIVE), therefore using the lower-case form breaks building cugraph (because they enumerate the CMAKE_CUDA_ARCHITECTURES list manually).

Can we change this back to NATIVE?

@csadorf
Copy link
Contributor Author

csadorf commented Nov 15, 2022

Can we change this back to NATIVE?

Looks like it was already changed? It seems like even just adding CMAKE_CUDA_ARCHITECTURES=NATIVE is causing some issues for cuML. The python builds fail now with:

-- Detecting CXX compile features - done
CMake Error at /raid/sadorf/dev/rapids/rapids-compose/etc/conda/cuda_11.5/envs/rapids/share/cmake-3.24/Modules/CMakeDetermineCUDACompiler.cmake:279 (message):
  CMAKE_CUDA_ARCHITECTURES:

    NATIVE

  is not one of the following:

    * a semicolon-separated list of integers, each optionally
      followed by '-real' or '-virtual'
    * a special value: all, all-major, native

Call Stack (most recent call first):
  CMakeLists.txt:54 (enable_language)

I can will try to find a fix for this. that does not require changing the value back.

@trxcllnt
Copy link
Owner

I think the issue was with setting the CUDAARCHS envvar, which CMake says is used to initialize the value of CMAKE_CUDA_ARCHITECTURES. For a while CMake had a bug which made this not true, but now it seems that's been fixed.

I pushed a fix yesterday that sets CUDAARCHS=native, but also explicitly sets CMAKE_CUDA_ARCHITECTURES=NATIVE, which will take precedence over the CUDAARCHS envvar in C++ builds.

We don't explicitly pass CMAKE_CUDA_ARCHITECTURES to Python builds, so cuML's scikit-build will still initialize CMAKE_CUDA_ARCHITECTURES from the CUDAARCHS envvar (i.e. lowercase native).

I tested via build-cuml-python and it seemed to be good, but let me know if you're still seeing issues.

@csadorf
Copy link
Contributor Author

csadorf commented Nov 16, 2022

I tested via build-cuml-python and it seemed to be good, but let me know if you're still seeing issues.

Not sure what's different in my setup, but I still needed to initiate build-cuml-python twice, it fails with the above quoted error on the first attempt.

@trxcllnt
Copy link
Owner

Do you have CMAKE_CUDA_ARCHITECTURES set in your .env file?

@trxcllnt
Copy link
Owner

One option could be to set CUDAARCHS=native in your .env file. That would cause it to always use native for both CMAKE_CUDA_ARCHITECTURES and CUDAARCHS.

@csadorf
Copy link
Contributor Author

csadorf commented Nov 17, 2022

One option could be to set CUDAARCHS=native in your .env file. That would cause it to always use native for both CMAKE_CUDA_ARCHITECTURES and CUDAARCHS.

That doe solve the problem for me, but I think we should try to get this sorted out so that things just work-out-of-the box. I assume there is something we would need to adjust in the cuml python build process?

@vyasr
Copy link
Collaborator

vyasr commented Nov 26, 2022

I would agree that we should sort this out for all users. I wonder if there is a simpler path of least resistance here. rapids-cmake adds support for two special flags, ALL (build for all architectures) and NATIVE as discussed on this thread. Both of these improvements have been upstreamed to CMake itself (in lowercase form ALL->all and NATIVE->native), but we have not yet updated our minimum required CMake version enough to use both. We currently require 3.23, and while support for all was added in 3.23, support native is only in 3.24. Once we bump our required CMake version across RAPIDS I think we should be able to remove the specialized functionality in rapids-cmake unless there are subtle differences that I am not aware of. @robertmaynard is that something that you planned for when we transition to 3.24? That migration was on my todo list, but not something I got around to in 22.12 because of wheels-related work. Will this problem simply be fixed if we bump our minimum required CMake version to 3.24 and make the associated rapids-cmake changes? If so, I can front-load that in 23.02 since we also have a number of other improvements to the build process that I want to make around Python/CMake builds and wheels that will be facilitated by the move.

@robertmaynard
Copy link
Contributor

robertmaynard commented Nov 28, 2022

So there are differences between how NATIVE from rapids-cmake work and how native in CMake 3.24. These differences will have an impact on projects like cugraph.

The biggest difference is that NATIVE is transformed into an explicit list of architectures after the project call. This allows projects like cugraph to iterate the values to subset further. Cmake 3.24+ native will either stay as the string native or become a set of architecutres based on the nvcc compiler version ( will stay native when the compiler supports -arch=native ).

Once we bump our required CMake version across RAPIDS I think we should be able to remove the specialized functionality in rapids-cmake

I agree, while rapids-cmake version of these functions has side-effects that projects like cugraph use this is needed as the long term goal is to remove features from rapids-cmake as they arrive in CMake proper.

Will this problem simply be fixed if we bump our minimum required CMake version to 3.24 and make the associated rapids-cmake changes

No. To me we have a secondary issue that compose is setting CUDAARCHS to NATIVE with no DCMAKE_CUDA_ARCHITECTURES value which breaks projects.

The issue is that rapids-cmake rapids_init_architectures presuming only valid values for CUDAARCHS and doesn't expect it would contain ALL or NATIVE. I think we should extend rapids-cmake to also look for all/native/ALL/NATIVE and if used with CMake < 3.24 we map it to the current NATIVE / ALL behavior in both the env var and the cmake var.

I think the issue was with setting the CUDAARCHS envvar, which CMake says is used to initialize the value of CMAKE_CUDA_ARCHITECTURES. For a while CMake had a bug which made this not true, but now it seems that's been fixed.

I believe that was this rapids-cmake bug: rapidsai/rapids-cmake#268 which means that rapids-cmake 22.10+ will consider CUDAARCHS

It seems like even just adding CMAKE_CUDA_ARCHITECTURES=NATIVE is causing some issues for cuML. The python builds fail now with....

As outline above there is a rapids-cmake bug where it doesn't transform CUDAARCHS values when no CMAKE_CUDA_ARCHITECTURES is also provided.

But that won't unblock you as python cuml is incorrectly setting up rapids_cuda_init_architectures here: https://github.com/rapidsai/cuml/blob/branch-22.12/python/CMakeLists.txt#L54 and https://github.com/rapidsai/cuml/blob/branch-22.12/python/CMakeLists.txt#L88. As noted in the docs for the function this requires a call to project with the same name for the activation to occur. In cuml pythons case it will need todo the following to have the NATIVE hooks actiaved:

  rapids_cuda_init_architectures(cuml-python)
  enable_language(CUDA)
  include("${CMAKE_PROJECT_cuml-python_INCLUDE}")

@vyasr
Copy link
Collaborator

vyasr commented Nov 28, 2022

But that won't unblock you as python cuml is incorrectly setting up rapids_cuda_init_architectures here: https://github.com/rapidsai/cuml/blob/branch-22.12/python/CMakeLists.txt#L54 and https://github.com/rapidsai/cuml/blob/branch-22.12/python/CMakeLists.txt#L88. As noted in the docs for the function this requires a call to project with the same name for the activation to occur. In cuml pythons case it will need todo the following to have the NATIVE hooks actiaved:

I've already fixed this bug in the wheels PR for cuML. I believe it was introduced as part of the patch that added the extra get_raft.cmake invocation to handle the lack of COMPONENTS support in rapids-cmake (now fixed). However, that uncovered an additional problem (which manifests in the form of the same error), which is that when cuML finds a preexisting raft package with the distance COMPONENT it conditionally enables the CUDA language before rapids-cmake has a chance to parse out the NATIVE. Is it worth trying to embed that requirement into libcuml's build system, or should we just modify the Python build?

@robertmaynard
Copy link
Contributor

We should modify the python build to manage CMAKE_CUDA_ARCHITECTURES ( via rapids-cmake ) before we call find_package(cuml)

@vyasr
Copy link
Collaborator

vyasr commented Nov 28, 2022

That's specifically necessary here because of the transitive call to find raft that occurs, correct? We normally only need to do this if we're compiling the C++ libraries (not if we find them preinstalled).

@robertmaynard
Copy link
Contributor

Yes it is due to the fact that some header dependency ( thrust or cuco ) is enabling the CUDA language

rapids-bot bot pushed a commit to rapidsai/rapids-cmake that referenced this issue Jan 4, 2023
CMake 3.23+ now offers support for `all` and `all-major` as special keywords to `CMAKE_CUDA_ARCHITECTURES`. This means that rapids-cmake overload on `ALL` now creates confusion and breaks expectations ( trxcllnt/rapids-compose#84 https://gitlab.kitware.com/cmake/cmake/-/issues/23739, #104 ) 

This PR introduces the new magic keyword of `RAPIDS` which behaves the same as `ALL`, and will become the new apporved approach for projects that want to build for RAPIDS supported platforms.

Fixes #314


More details on the plans of deprecation for `ALL` can be found at: #318

Authors:
  - Robert Maynard (https://github.com/robertmaynard)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #327
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants