[FEA] Have a global pinned memory pool by default #15612

vuule · 2024-04-29T22:19:02Z

Users outside of Spark-RAPIDS still use the default, non-pooled, host memory resource and thus have the overhead of pinned memory allocations in hostdevice_vector, and any other places where pinned memory is used for faster data transfer.

Proposal: Default to a memory resource with a small pinned pool. When the pool is full, the resource should fall back to new pinned allocations to keep consistent with the old behavior when too much pinned memory is used.

To ensure we don't impact CPU performance, the default size of the pool can be a set percentage of the total system memory. Pinning a small minority of system memory (~5%) should not have a negative impact.

Initially, only hostdevice_vector would use this resource but we can expand the pinned memory use in libcudf once a default pool resource is in place.

Details to consider:
Pool should probably be created on first use - avoids duplicated pool is users set the resource before the first use.
Switching the host resource should work at any point, even if we must have two pools at the same time.
Can the default pool be safely destroyed? streams can't be destroyed on exit, not sure about cudaFreeHost

The text was updated successfully, but these errors were encountered:

vuule · 2024-04-29T22:27:50Z

Measured memory use by hostdevice_vector in cuIO benchmarks as a percentage of peak device memory use. The pinned memory use is proportional to device memory use, so we can use the peak device memory use as a measure of how much pinned memory we would need if we used up all device memory.
The results show that we would never fall back to new pinned allocations with a pinned pool sized at 4% of device memory capacity. However, even at 0.5%, the pool can be used to allocate 90% of used pinned memory without additional allocations.

vuule · 2024-04-29T22:40:00Z

Benchmarking results:

"relative throughput" is the average ratio of the throughput with the custom resource and the throughput with default (pinned, non-pooled) resource.

Benchmarks consistently show improvement with pooled resource compared to pinned allocations.
read_json shows disproportionate improvement because of small benchmarks that are hugely impacted by a single pinned allocation.
Data also show that small pools bring very similar performance improvement to the pool that never falls back to new allocations.
Surprisingly, benchmarks also show that using pageable memory in hostdevice_vector is preferable to pinned (non-pooled).

TODO: run benchmarks from #15585 because we expect to see higher impact in multi-threaded use cases.

…5665) Issue #15612 Adds a pooled pinned memory resource that is created on first call to `get_host_memory_resource` or `set_host_memory_resource`. The pool has a fixed size: 0.5% of the device memory capacity, limited to 100MB. At 100MB, the pool takes ~30ms to initialize. Size of the pool can be overridden with environment variable `LIBCUDF_PINNED_POOL_SIZE`. If an allocation cannot be done within the pool, a new pinned allocation is performed. The allocator uses a stream from the global stream pool to initialize and perform synchronous operations (`allocate`/`deallocate`). Users of the resource don't need to be aware of this implementation detail as these operations synchronize before they are completed. Authors: - Vukasin Milovanovic (https://github.com/vuule) - Nghia Truong (https://github.com/ttnghia) Approvers: - Nghia Truong (https://github.com/ttnghia) - Alessandro Bellina (https://github.com/abellina) - Jake Hemstad (https://github.com/jrhemstad) - Vyas Ramasubramani (https://github.com/vyasr) URL: #15665

closes #15612 Expanded the set of vector factories to cover pinned vectors. The functions return `cudf::detail::host_vector`, which use a type-erased allocator, allowing us to utilize the runtime configurable global pinned (previously host) resource. The `pinned_host_vector` type has been removed as it can only support the non-pooled pinned allocations. Its use is not replaced with `cudf::detail::host_vector`. Moved the global host (now pinned) resource out of cuIO and changed the type to host_device. User-specified resources are now required to allocate device-accessible memory. The name has been changed to pinned to reflect the new requirement. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Alessandro Bellina (https://github.com/abellina) - Yunsong Wang (https://github.com/PointKernel) - Mark Harris (https://github.com/harrism) - David Wendt (https://github.com/davidwendt) URL: #15895

vuule added feature request New feature or request Performance Performance related issue labels Apr 29, 2024

vuule self-assigned this Apr 29, 2024

This was referenced Apr 30, 2024

[FEA] Expand pinned memory use in libcudf #15616

Open

Add default pinned pool that falls back to new pinned allocations #15665

Merged

vuule mentioned this issue Jun 3, 2024

Pinned vector factory that uses the global pool #15895

Merged

3 tasks

rapids-bot bot closed this as completed in #15895 Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Have a global pinned memory pool by default #15612

[FEA] Have a global pinned memory pool by default #15612

vuule commented Apr 29, 2024

vuule commented Apr 29, 2024

vuule commented Apr 29, 2024 •

edited

Loading

[FEA] Have a global pinned memory pool by default #15612

[FEA] Have a global pinned memory pool by default #15612

Comments

vuule commented Apr 29, 2024

vuule commented Apr 29, 2024

vuule commented Apr 29, 2024 • edited Loading

vuule commented Apr 29, 2024 •

edited

Loading