[BUG] Pinned memory allocation within Parquet reader can be very slow #7600

jlowe · 2021-03-15T17:42:27Z

Describe the bug
The libcudf Parquet reader performs pinned memory allocations, and in some environments (e.g.: cloud and other virtual envs) pinned memory allocations can be expensive in practice. Here's a screenshot of an Nsight Systems trace showing the cudaHostAlloc taking 1.5 seconds on the first call. Subsequent calls are relatively cheap, likely because the OS has already spent the cost of rearranging the memory and is reusing the work.

In this case it was so slow I'm not sure the use of pinned memory was overall cost-effective vs. using paged memory directly, probably dependent upon the number of times the application calls the Parquet reader.

This use-case was with the RAPIDS Accelerator for Apache Spark which often pre-allocates a pool of pinned memory up-front when it starts. If the Parquet reader had a way of reusing the pre-allocated pinned memory pool provided by the application then this slow allocation could be avoided.

Steps/Code to reproduce bug
The cost seems to be very much dependent upon the runtime environment. I've seen it most often in cloud-like environments. I suspect it could occur in a bare-metal environment as well if the memory was filled with buffers and page-cache, and the OS needed to rearrange pages to form a physically contiguous chunk.

Expected behavior
Parquet reader should not spend excessive amounts of time allocating memory.

The text was updated successfully, but these errors were encountered:

jrhemstad · 2021-03-15T17:57:56Z

This is a problem that we can solve by RMM providing a pinned memory pool and exposing hooks into libcudf to plug in a pinned pool. This way users (Spark) can plug in their own pinned memory pool, or just use the default pinned memory pool.

abellina · 2021-04-06T14:35:02Z

Please note that I see a behavior difference in 0.19 as compared to 0.18 regarding this.

In 0.18, I don't see the slow cudaHostAlloc at parquet read time reported here. In 0.19, I also see a cuInit at parquet read time that I do not see in 0.18, followed by that slow cudaHostAlloc. It seems we are paying the initialization cost later than we used to in 0.18.

abellina · 2021-04-07T05:28:28Z

Ok, I think I may have found the culprit. For libcudf.so, it looks like we are linking the cuda runtime statically. I am confused how this is happening, but If I pass -DCUDA_USE_STATIC_CUDA_RUNTIME=OFF, symbols like cuInit and other cuda symbols, go away. Additionally, the extra cuInit call and parquet slowness goes away.

A "fast" libcudf.so has no cuInit symbols for instance:

$ nm libcudf.so |grep cuInit
$ echo $?
1

A "slow" one has the symbols:

00000000036ec9a8 b _ZN6cudart12__fun_cuInitE
00000000029a2ac0 t _ZN6cudartL15__unimpl_cuInitEj

As far as I understand CUDA_USE_STATIC_CUDA_RUNTIME is specific to FindCUDA and we shouldn't be using it (we are using FindCUDAToolkit right?). But this option is removing symbols for me.

It would be good to know if others can corroborate this. This PR looks to be where this behavior changed: 61091a0.

abellina · 2021-04-07T13:39:56Z

One place that wants FindCUDA is ArrowCUDA (https://github.com/apache/arrow/blob/apache-arrow-1.0.1/cpp/src/arrow/gpu/CMakeLists.txt#L32). We normally statically link arrow, so we see this issue.

It seems if we set CUDA_USE_STATIC_CUDA_RUNTIME=OFF in CMakeLists.txt that we can still get that static cudart from CUDA_STATIC_RUNTIME, so that seems viable. @jrhemstad any objections? @jlowe @nvdbaranec fyi.

kkraus14 · 2021-04-07T14:23:24Z

As far as I know ArrowCUDA only uses the CUDA driver library (https://github.com/apache/arrow/search?q=%22include+cuda%22), but it looks like their CMake erroneously links to the runtime as well 😕 (https://github.com/apache/arrow/blob/apache-arrow-1.0.1/cpp/src/arrow/gpu/CMakeLists.txt#L39-L56). I'm guessing what's happening is that because we link to Arrow before CUDA in our CMake that we inherit the link options of Arrow?

cc @trxcllnt @robertmaynard

abellina · 2021-04-07T14:29:16Z

@kkraus14, no this is an issue when we statically link against arrow_cuda, since it is also statically linking against cudart.

This CUDA Libraries line is printed from https://github.com/apache/arrow/blob/apache-arrow-1.0.1/cpp/src/arrow/gpu/CMakeLists.txt#L35. Thanks @jlowe for pointing out this is where the static link actually happens (https://github.com/apache/arrow/blob/apache-arrow-1.0.1/cpp/src/arrow/gpu/CMakeLists.txt#L56)

-DCUDA_USE_STATIC_CUDA_RUNTIME=OFF:

$ PARALLEL_LEVEL=12 cmake -GNinja -DCMAKE_CUDA_ARCHITECTURES="75" -DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX -DARROW_STATIC_LIB=ON -DUSE_NVTX=0 -DRMM_LOGGING_LEVEL=OFF -DBUILD_BENCHMARKS=OFF -DBUILD_TESTS=OFF -DCUDF_USE_ARROW_STATIC=ON -DCUDA_USE_STATIC_CUDA_RUNTIME=OFF ..|grep CUDA\ Libraries
-- CUDA Libraries: /usr/local/cuda/lib64/libcudart.so

-DCUDA_USE_STATIC_CUDA_RUNTIME=ON (default):

$ PARALLEL_LEVEL=12 cmake -GNinja -DCMAKE_CUDA_ARCHITECTURES="75" -DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX -DARROW_STATIC_LIB=ON -DUSE_NVTX=0 -DRMM_LOGGING_LEVEL=OFF -DBUILD_BENCHMARKS=OFF -DBUILD_TESTS=OFF -DCUDF_USE_ARROW_STATIC=ON -DCUDA_USE_STATIC_CUDA_RUNTIME=ON ..|grep CUDA\ Libraries
-- CUDA Libraries: /usr/local/cuda/lib64/libcudart_static.a;Threads::Threads;dl;/usr/lib/x86_64-linux-gnu/librt.so

robertmaynard · 2021-04-07T14:31:11Z

In 0.19 and above we forgot to specify a value for CMAKE_CUDA_RUNTIME_LIBRARY ( https://cmake.org/cmake/help/latest/variable/CMAKE_CUDA_RUNTIME_LIBRARY.html ) and therefore use the default when compiling CUDA code ( static ). Correcting this will make the cudf CUDA_STATIC_RUNTIME variable work as desired when we don't consider Arrow.

If Arrow when built statically is using the CUDA runtime statically we also have to use the runtime statically as you can't combine both the dynamic and static versions in a program.

abellina · 2021-04-07T14:36:34Z

@robertmaynard thanks, I just tried that but I am not sure that FindCUDA is using it. Perhaps I missed something:

-DCMAKE_CUDA_RUNTIME_LIBRARY=Shared:

$ PARALLEL_LEVEL=12 cmake -GNinja -DCMAKE_CUDA_ARCHITECTURES="75" -DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX -DARROW_STATIC_LIB=ON -DUSE_NVTX=0 -DRMM_LOGGING_LEVEL=OFF -DBUILD_BENCHMARKS=OFF -DBUILD_TESTS=OFF -DCUDF_USE_ARROW_STATIC=ON -DCMAKE_CUDA_RUNTIME_LIBRARY=Shared ..|grep CUDA\ Libraries
-- CUDA Libraries: /usr/local/cuda/lib64/libcudart_static.a;Threads::Threads;dl;/usr/lib/x86_64-linux-gnu/librt.so

robertmaynard · 2021-04-07T14:40:59Z

@robertmaynard thanks, I just tried that but I am not sure that FindCUDA is using it. Perhaps I missed something:

-DCMAKE_CUDA_RUNTIME_LIBRARY=Shared:

$ PARALLEL_LEVEL=12 cmake -GNinja -DCMAKE_CUDA_ARCHITECTURES="75" -DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX -DARROW_STATIC_LIB=ON -DUSE_NVTX=0 -DRMM_LOGGING_LEVEL=OFF -DBUILD_BENCHMARKS=OFF -DBUILD_TESTS=OFF -DCUDF_USE_ARROW_STATIC=ON -DCMAKE_CUDA_RUNTIME_LIBRARY=Shared ..|grep CUDA\ Libraries
-- CUDA Libraries: /usr/local/cuda/lib64/libcudart_static.a;Threads::Threads;dl;/usr/lib/x86_64-linux-gnu/librt.so

FindCUDA doesn't use CMAKE_CUDA_RUNTIME_LIBRARY as it predates the existence of CUDA as supported language by Cmake ( via enable_language(CUDA) or project(example LANGUAGES CXX CUDA)

abellina · 2021-04-07T14:43:19Z

Makes sense @robertmaynard.

So for cudf-0.19, it seems we should work around this + work on changes in arrow (especially if it doesn't need cudart) for future cudf versions.

CMake has two ways to control the CUDA runtime to link too. This makes sure that the CUDA language controls and the CUDAToolkit both target the same runtime Issue brought up in: #7600 Authors: - Robert Maynard (https://github.com/robertmaynard) Approvers: - Alessandro Bellina (https://github.com/abellina) - Keith Kraus (https://github.com/kkraus14) URL: #7887

This PR does two things: - It adds a check that will fail the build if it detects that CUDA runtime was linked statically. For now, that seems like a safe bet, and if we decide to start building with a static CUDA in the future, we should remove that check. - As part of investigation for #7600, libnvcomp was the last library that had a statically linked CUDA runtime, so this PR addresses that. Authors: - Alessandro Bellina (https://github.com/abellina) Approvers: - Jason Lowe (https://github.com/jlowe) URL: #7896

kkraus14 · 2021-04-08T14:41:53Z

@jlowe @abellina is this resolved?

abellina · 2021-04-08T16:25:18Z

@kkraus14 I retested my case with the latest nightly builds and cuInit is gone. All I see is a fast cudaHostAlloc (~1ms), and my parquet scan takes ~100ms (instead of 1+ second) so this fixes what I reported for 0.19.

@jlowe shared the original trace with me, and I verified that it had a cuInit before the long cudaHostAlloc:

Closing. Thanks @robertmaynard, @kkraus14, @jrhemstad, @jlowe and @nvdbaranec (who suspected this was static cudart).

kkraus14 · 2021-04-09T00:24:56Z

To further close the loop here, PR has been merged in Arrow where come the 4.0.0 release it will no longer link to libcudart: apache/arrow@5b5c058

jlowe added bug Something isn't working Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue labels Mar 15, 2021

jlowe added Spark Functionality that helps Spark RAPIDS and removed Needs Triage Need team to review and classify labels Mar 15, 2021

jlowe added the Performance Performance related issue label Mar 15, 2021

abellina mentioned this issue Apr 7, 2021

Set CUDA_USE_STATIC_CUDA_RUNTIME=OFF for java builds [skip-ci] #7884

Closed

This was referenced Apr 7, 2021

[REVIEW] cudf consistently specifies the cuda runtime #7887

Merged

[FEA] Have CUDA runtime controls via only CMAKE_CUDA_RUNTIME_LIBRARY #7888

Closed

abellina mentioned this issue Apr 7, 2021

Use dynamic cudart for nvcomp in java build [skip-ci] #7896

Merged

abellina closed this as completed Apr 8, 2021

asfimport mentioned this issue Apr 9, 2021

[C++] ArrowCUDA erroneously links to CUDA Runtime while only using CUDA Driver API apache/arrow#18603

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Pinned memory allocation within Parquet reader can be very slow #7600

[BUG] Pinned memory allocation within Parquet reader can be very slow #7600

jlowe commented Mar 15, 2021

jrhemstad commented Mar 15, 2021

abellina commented Apr 6, 2021 •

edited

Loading

abellina commented Apr 7, 2021 •

edited

Loading

abellina commented Apr 7, 2021 •

edited

Loading

kkraus14 commented Apr 7, 2021 •

edited

Loading

abellina commented Apr 7, 2021 •

edited

Loading

robertmaynard commented Apr 7, 2021

abellina commented Apr 7, 2021 •

edited

Loading

robertmaynard commented Apr 7, 2021

abellina commented Apr 7, 2021

kkraus14 commented Apr 8, 2021

abellina commented Apr 8, 2021 •

edited

Loading

kkraus14 commented Apr 9, 2021

[BUG] Pinned memory allocation within Parquet reader can be very slow #7600

[BUG] Pinned memory allocation within Parquet reader can be very slow #7600

Comments

jlowe commented Mar 15, 2021

jrhemstad commented Mar 15, 2021

abellina commented Apr 6, 2021 • edited Loading

abellina commented Apr 7, 2021 • edited Loading

abellina commented Apr 7, 2021 • edited Loading

kkraus14 commented Apr 7, 2021 • edited Loading

abellina commented Apr 7, 2021 • edited Loading

robertmaynard commented Apr 7, 2021

abellina commented Apr 7, 2021 • edited Loading

robertmaynard commented Apr 7, 2021

abellina commented Apr 7, 2021

kkraus14 commented Apr 8, 2021

abellina commented Apr 8, 2021 • edited Loading

kkraus14 commented Apr 9, 2021

abellina commented Apr 6, 2021 •

edited

Loading

abellina commented Apr 7, 2021 •

edited

Loading

abellina commented Apr 7, 2021 •

edited

Loading

kkraus14 commented Apr 7, 2021 •

edited

Loading

abellina commented Apr 7, 2021 •

edited

Loading

abellina commented Apr 7, 2021 •

edited

Loading

abellina commented Apr 8, 2021 •

edited

Loading