-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Pinned memory allocation within Parquet reader can be very slow #7600
Comments
This is a problem that we can solve by RMM providing a pinned memory pool and exposing hooks into libcudf to plug in a pinned pool. This way users (Spark) can plug in their own pinned memory pool, or just use the default pinned memory pool. |
Please note that I see a behavior difference in 0.19 as compared to 0.18 regarding this. In 0.18, I don't see the slow |
Ok, I think I may have found the culprit. For A "fast" libcudf.so has no
A "slow" one has the symbols:
As far as I understand It would be good to know if others can corroborate this. This PR looks to be where this behavior changed: 61091a0. |
One place that wants It seems if we set |
As far as I know |
@kkraus14, no this is an issue when we statically link against arrow_cuda, since it is also statically linking against cudart. This
|
In 0.19 and above we forgot to specify a value for If Arrow when built statically is using the CUDA runtime statically we also have to use the runtime statically as you can't combine both the dynamic and static versions in a program. |
@robertmaynard thanks, I just tried that but I am not sure that
|
|
Makes sense @robertmaynard. So for cudf-0.19, it seems we should work around this + work on changes in arrow (especially if it doesn't need cudart) for future cudf versions. |
CMake has two ways to control the CUDA runtime to link too. This makes sure that the CUDA language controls and the CUDAToolkit both target the same runtime Issue brought up in: #7600 Authors: - Robert Maynard (https://github.com/robertmaynard) Approvers: - Alessandro Bellina (https://github.com/abellina) - Keith Kraus (https://github.com/kkraus14) URL: #7887
This PR does two things: - It adds a check that will fail the build if it detects that CUDA runtime was linked statically. For now, that seems like a safe bet, and if we decide to start building with a static CUDA in the future, we should remove that check. - As part of investigation for #7600, libnvcomp was the last library that had a statically linked CUDA runtime, so this PR addresses that. Authors: - Alessandro Bellina (https://github.com/abellina) Approvers: - Jason Lowe (https://github.com/jlowe) URL: #7896
@kkraus14 I retested my case with the latest nightly builds and @jlowe shared the original trace with me, and I verified that it had a Closing. Thanks @robertmaynard, @kkraus14, @jrhemstad, @jlowe and @nvdbaranec (who suspected this was static cudart). |
To further close the loop here, PR has been merged in Arrow where come the 4.0.0 release it will no longer link to libcudart: apache/arrow@5b5c058 |
Describe the bug
The libcudf Parquet reader performs pinned memory allocations, and in some environments (e.g.: cloud and other virtual envs) pinned memory allocations can be expensive in practice. Here's a screenshot of an Nsight Systems trace showing the
cudaHostAlloc
taking 1.5 seconds on the first call. Subsequent calls are relatively cheap, likely because the OS has already spent the cost of rearranging the memory and is reusing the work.In this case it was so slow I'm not sure the use of pinned memory was overall cost-effective vs. using paged memory directly, probably dependent upon the number of times the application calls the Parquet reader.
This use-case was with the RAPIDS Accelerator for Apache Spark which often pre-allocates a pool of pinned memory up-front when it starts. If the Parquet reader had a way of reusing the pre-allocated pinned memory pool provided by the application then this slow allocation could be avoided.
Steps/Code to reproduce bug
The cost seems to be very much dependent upon the runtime environment. I've seen it most often in cloud-like environments. I suspect it could occur in a bare-metal environment as well if the memory was filled with buffers and page-cache, and the OS needed to rearrange pages to form a physically contiguous chunk.
Expected behavior
Parquet reader should not spend excessive amounts of time allocating memory.
The text was updated successfully, but these errors were encountered: