-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] compile error tdigest_aggregation.cu on cuda 12.2 on arm64 #14610
Comments
I am no expert on any of this. All I did was look at the nightly build and saw that it failed/reported the failure. I can possibly rerun/retest things in the environment, but I am not sure I even have direct access to it myself right now. Trying to get a minimal repro case is going to take me a very long time. @sameerz is there someone on our team that can help out with this? |
Don’t worry @revans2. I will see what I can do to reproduce and work with the CCCL team. |
I think this comes down to an issue with a particular host compiler as well. Do you know what the C++ compiler used was? |
I can't reproduce it locally anyhow. So I suspect that it is due to a bug of a specific compiler in ARM. |
This removes `cuda::proclaim_return_type` from a device lambda because that lambda is going to be nested inside another device lambda, which is in turn enclosed by `cuda::proclaim_return_type`. This PR is to fix a compile issue that we encountered: ``` /usr/local/cuda/include/cuda/std/detail/libcxx/include/__functional/invoke.h(402): error: calling a __device__ function("cudf::tdigest::detail::_NV_ANON_NAMESPACE::build_output_column(int, ::std::unique_ptr< ::cudf::column, ::std::default_delete< ::cudf::column> > &&, ::std::unique_ptr< ::cudf::column, ::std::default_delete< ::cudf::column> > &&, ::std::unique_ptr< ::cudf::column, ::std::default_delete< ::cudf::column> > &&, ::std::unique_ptr< ::cudf::column, ::std::default_delete< ::cudf::column> > &&, ::std::unique_ptr< ::cudf::column, ::std::default_delete< ::cudf::column> > &&, bool, ::rmm::cuda_stream_view, ::rmm::mr::device_memory_resource *) ::[lambda(int) (instance 2)]::operator ()(int) const") from a __host__ __device__ function("__invoke") is not allowed ``` Note: The issue is reproducible only in our build environment: ARM architecture, cuda 12 + rockylinux8. Closes #14610. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Michael Schellenberger Costa (https://github.com/miscco) - Karthikeyan (https://github.com/karthikeyann) - https://github.com/nvdbaranec - Bradley Dice (https://github.com/bdice) URL: #14607
According to the build log, host compiler is GNU 11.2.1, and NVCC 12.2. So it is almost the same as on my local machine. But my local machine can't reproduce the issue. FYI: The issue showed up in a build system using this docker image: https://github.com/NVIDIA/spark-rapids-jni/blob/branch-24.02/ci/Dockerfile.multi (but with modified cuda version to 12). |
To clarify, the bug was trigger when
|
This removes `cuda::proclaim_return_type` from a device lambda because that lambda is going to be nested inside another device lambda, which is in turn enclosed by `cuda::proclaim_return_type`. This PR is to fix a compile issue that we encountered: ``` /usr/local/cuda/include/cuda/std/detail/libcxx/include/__functional/invoke.h(402): error: calling a __device__ function("cudf::tdigest::detail::_NV_ANON_NAMESPACE::build_output_column(int, ::std::unique_ptr< ::cudf::column, ::std::default_delete< ::cudf::column> > &&, ::std::unique_ptr< ::cudf::column, ::std::default_delete< ::cudf::column> > &&, ::std::unique_ptr< ::cudf::column, ::std::default_delete< ::cudf::column> > &&, ::std::unique_ptr< ::cudf::column, ::std::default_delete< ::cudf::column> > &&, ::std::unique_ptr< ::cudf::column, ::std::default_delete< ::cudf::column> > &&, bool, ::rmm::cuda_stream_view, ::rmm::mr::device_memory_resource *) ::[lambda(int) (instance 2)]::operator ()(int) const") from a __host__ __device__ function("__invoke") is not allowed ``` Note: The issue is reproducible only in our build environment: ARM architecture, cuda 12 + rockylinux8. Closes rapidsai#14610. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Michael Schellenberger Costa (https://github.com/miscco) - Karthikeyan (https://github.com/karthikeyann) - https://github.com/nvdbaranec - Bradley Dice (https://github.com/bdice) URL: rapidsai#14607
Describe the bug
Recently our nightly CI filed for CUDA 12.2 on an arm64 server with the following errors.
The text was updated successfully, but these errors were encountered: