Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Pinned memory allocation within Parquet reader can be very slow #7600

Closed
jlowe opened this issue Mar 15, 2021 · 13 comments
Closed

[BUG] Pinned memory allocation within Parquet reader can be very slow #7600

jlowe opened this issue Mar 15, 2021 · 13 comments
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue Spark Functionality that helps Spark RAPIDS

Comments

@jlowe
Copy link
Member

jlowe commented Mar 15, 2021

Describe the bug
The libcudf Parquet reader performs pinned memory allocations, and in some environments (e.g.: cloud and other virtual envs) pinned memory allocations can be expensive in practice. Here's a screenshot of an Nsight Systems trace showing the cudaHostAlloc taking 1.5 seconds on the first call. Subsequent calls are relatively cheap, likely because the OS has already spent the cost of rearranging the memory and is reusing the work.

image

In this case it was so slow I'm not sure the use of pinned memory was overall cost-effective vs. using paged memory directly, probably dependent upon the number of times the application calls the Parquet reader.

This use-case was with the RAPIDS Accelerator for Apache Spark which often pre-allocates a pool of pinned memory up-front when it starts. If the Parquet reader had a way of reusing the pre-allocated pinned memory pool provided by the application then this slow allocation could be avoided.

Steps/Code to reproduce bug
The cost seems to be very much dependent upon the runtime environment. I've seen it most often in cloud-like environments. I suspect it could occur in a bare-metal environment as well if the memory was filled with buffers and page-cache, and the OS needed to rearrange pages to form a physically contiguous chunk.

Expected behavior
Parquet reader should not spend excessive amounts of time allocating memory.

@jlowe jlowe added bug Something isn't working Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue labels Mar 15, 2021
@jlowe jlowe added Spark Functionality that helps Spark RAPIDS and removed Needs Triage Need team to review and classify labels Mar 15, 2021
@jrhemstad
Copy link
Contributor

This is a problem that we can solve by RMM providing a pinned memory pool and exposing hooks into libcudf to plug in a pinned pool. This way users (Spark) can plug in their own pinned memory pool, or just use the default pinned memory pool.

@jlowe jlowe added the Performance Performance related issue label Mar 15, 2021
@abellina
Copy link
Contributor

abellina commented Apr 6, 2021

Please note that I see a behavior difference in 0.19 as compared to 0.18 regarding this.

In 0.18, I don't see the slow cudaHostAlloc at parquet read time reported here. In 0.19, I also see a cuInit at parquet read time that I do not see in 0.18, followed by that slow cudaHostAlloc. It seems we are paying the initialization cost later than we used to in 0.18.

@abellina
Copy link
Contributor

abellina commented Apr 7, 2021

Ok, I think I may have found the culprit. For libcudf.so, it looks like we are linking the cuda runtime statically. I am confused how this is happening, but If I pass -DCUDA_USE_STATIC_CUDA_RUNTIME=OFF, symbols like cuInit and other cuda symbols, go away. Additionally, the extra cuInit call and parquet slowness goes away.

A "fast" libcudf.so has no cuInit symbols for instance:

$ nm libcudf.so |grep cuInit
$ echo $?
1

A "slow" one has the symbols:

00000000036ec9a8 b _ZN6cudart12__fun_cuInitE
00000000029a2ac0 t _ZN6cudartL15__unimpl_cuInitEj

As far as I understand CUDA_USE_STATIC_CUDA_RUNTIME is specific to FindCUDA and we shouldn't be using it (we are using FindCUDAToolkit right?). But this option is removing symbols for me.

It would be good to know if others can corroborate this. This PR looks to be where this behavior changed: 61091a0.

@abellina
Copy link
Contributor

abellina commented Apr 7, 2021

One place that wants FindCUDA is ArrowCUDA (https://github.com/apache/arrow/blob/apache-arrow-1.0.1/cpp/src/arrow/gpu/CMakeLists.txt#L32). We normally statically link arrow, so we see this issue.

It seems if we set CUDA_USE_STATIC_CUDA_RUNTIME=OFF in CMakeLists.txt that we can still get that static cudart from CUDA_STATIC_RUNTIME, so that seems viable. @jrhemstad any objections? @jlowe @nvdbaranec fyi.

@kkraus14
Copy link
Collaborator

kkraus14 commented Apr 7, 2021

As far as I know ArrowCUDA only uses the CUDA driver library (https://github.com/apache/arrow/search?q=%22include+cuda%22), but it looks like their CMake erroneously links to the runtime as well 😕 (https://github.com/apache/arrow/blob/apache-arrow-1.0.1/cpp/src/arrow/gpu/CMakeLists.txt#L39-L56). I'm guessing what's happening is that because we link to Arrow before CUDA in our CMake that we inherit the link options of Arrow?

cc @trxcllnt @robertmaynard

@abellina
Copy link
Contributor

abellina commented Apr 7, 2021

@kkraus14, no this is an issue when we statically link against arrow_cuda, since it is also statically linking against cudart.

This CUDA Libraries line is printed from https://github.com/apache/arrow/blob/apache-arrow-1.0.1/cpp/src/arrow/gpu/CMakeLists.txt#L35. Thanks @jlowe for pointing out this is where the static link actually happens (https://github.com/apache/arrow/blob/apache-arrow-1.0.1/cpp/src/arrow/gpu/CMakeLists.txt#L56)

-DCUDA_USE_STATIC_CUDA_RUNTIME=OFF:

$ PARALLEL_LEVEL=12 cmake -GNinja -DCMAKE_CUDA_ARCHITECTURES="75" -DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX -DARROW_STATIC_LIB=ON -DUSE_NVTX=0 -DRMM_LOGGING_LEVEL=OFF -DBUILD_BENCHMARKS=OFF -DBUILD_TESTS=OFF -DCUDF_USE_ARROW_STATIC=ON -DCUDA_USE_STATIC_CUDA_RUNTIME=OFF ..|grep CUDA\ Libraries
-- CUDA Libraries: /usr/local/cuda/lib64/libcudart.so

-DCUDA_USE_STATIC_CUDA_RUNTIME=ON (default):

$ PARALLEL_LEVEL=12 cmake -GNinja -DCMAKE_CUDA_ARCHITECTURES="75" -DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX -DARROW_STATIC_LIB=ON -DUSE_NVTX=0 -DRMM_LOGGING_LEVEL=OFF -DBUILD_BENCHMARKS=OFF -DBUILD_TESTS=OFF -DCUDF_USE_ARROW_STATIC=ON -DCUDA_USE_STATIC_CUDA_RUNTIME=ON ..|grep CUDA\ Libraries
-- CUDA Libraries: /usr/local/cuda/lib64/libcudart_static.a;Threads::Threads;dl;/usr/lib/x86_64-linux-gnu/librt.so

@robertmaynard
Copy link
Contributor

In 0.19 and above we forgot to specify a value for CMAKE_CUDA_RUNTIME_LIBRARY ( https://cmake.org/cmake/help/latest/variable/CMAKE_CUDA_RUNTIME_LIBRARY.html ) and therefore use the default when compiling CUDA code ( static ). Correcting this will make the cudf CUDA_STATIC_RUNTIME variable work as desired when we don't consider Arrow.

If Arrow when built statically is using the CUDA runtime statically we also have to use the runtime statically as you can't combine both the dynamic and static versions in a program.

@abellina
Copy link
Contributor

abellina commented Apr 7, 2021

@robertmaynard thanks, I just tried that but I am not sure that FindCUDA is using it. Perhaps I missed something:

-DCMAKE_CUDA_RUNTIME_LIBRARY=Shared:

$ PARALLEL_LEVEL=12 cmake -GNinja -DCMAKE_CUDA_ARCHITECTURES="75" -DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX -DARROW_STATIC_LIB=ON -DUSE_NVTX=0 -DRMM_LOGGING_LEVEL=OFF -DBUILD_BENCHMARKS=OFF -DBUILD_TESTS=OFF -DCUDF_USE_ARROW_STATIC=ON -DCMAKE_CUDA_RUNTIME_LIBRARY=Shared ..|grep CUDA\ Libraries
-- CUDA Libraries: /usr/local/cuda/lib64/libcudart_static.a;Threads::Threads;dl;/usr/lib/x86_64-linux-gnu/librt.so

@robertmaynard
Copy link
Contributor

@robertmaynard thanks, I just tried that but I am not sure that FindCUDA is using it. Perhaps I missed something:

-DCMAKE_CUDA_RUNTIME_LIBRARY=Shared:

$ PARALLEL_LEVEL=12 cmake -GNinja -DCMAKE_CUDA_ARCHITECTURES="75" -DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX -DARROW_STATIC_LIB=ON -DUSE_NVTX=0 -DRMM_LOGGING_LEVEL=OFF -DBUILD_BENCHMARKS=OFF -DBUILD_TESTS=OFF -DCUDF_USE_ARROW_STATIC=ON -DCMAKE_CUDA_RUNTIME_LIBRARY=Shared ..|grep CUDA\ Libraries
-- CUDA Libraries: /usr/local/cuda/lib64/libcudart_static.a;Threads::Threads;dl;/usr/lib/x86_64-linux-gnu/librt.so

FindCUDA doesn't use CMAKE_CUDA_RUNTIME_LIBRARY as it predates the existence of CUDA as supported language by Cmake ( via enable_language(CUDA) or project(example LANGUAGES CXX CUDA)

@abellina
Copy link
Contributor

abellina commented Apr 7, 2021

Makes sense @robertmaynard.

So for cudf-0.19, it seems we should work around this + work on changes in arrow (especially if it doesn't need cudart) for future cudf versions.

rapids-bot bot pushed a commit that referenced this issue Apr 7, 2021
CMake has two ways to control the CUDA runtime to link too. This makes sure that the CUDA language controls and the CUDAToolkit both target the same runtime

Issue brought up in: #7600

Authors:
  - Robert Maynard (https://github.com/robertmaynard)

Approvers:
  - Alessandro Bellina (https://github.com/abellina)
  - Keith Kraus (https://github.com/kkraus14)

URL: #7887
rapids-bot bot pushed a commit that referenced this issue Apr 7, 2021
This PR does two things:
- It adds a check that will fail the build if it detects that CUDA runtime was linked statically. For now, that seems like a safe bet, and if we decide to start building with a static CUDA in the future, we should remove that check. 
- As part of investigation for #7600, libnvcomp was the last library that had a statically linked CUDA runtime, so this PR addresses that.

Authors:
  - Alessandro Bellina (https://github.com/abellina)

Approvers:
  - Jason Lowe (https://github.com/jlowe)

URL: #7896
@kkraus14
Copy link
Collaborator

kkraus14 commented Apr 8, 2021

@jlowe @abellina is this resolved?

@abellina
Copy link
Contributor

abellina commented Apr 8, 2021

@kkraus14 I retested my case with the latest nightly builds and cuInit is gone. All I see is a fast cudaHostAlloc (~1ms), and my parquet scan takes ~100ms (instead of 1+ second) so this fixes what I reported for 0.19.

@jlowe shared the original trace with me, and I verified that it had a cuInit before the long cudaHostAlloc:

cuInit

Closing. Thanks @robertmaynard, @kkraus14, @jrhemstad, @jlowe and @nvdbaranec (who suspected this was static cudart).

@abellina abellina closed this as completed Apr 8, 2021
@kkraus14
Copy link
Collaborator

kkraus14 commented Apr 9, 2021

To further close the loop here, PR has been merged in Arrow where come the 4.0.0 release it will no longer link to libcudart: apache/arrow@5b5c058

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

5 participants