Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Have a global pinned memory pool by default #15612

Closed
vuule opened this issue Apr 29, 2024 · 2 comments · Fixed by #15895
Closed

[FEA] Have a global pinned memory pool by default #15612

vuule opened this issue Apr 29, 2024 · 2 comments · Fixed by #15895
Assignees
Labels
feature request New feature or request Performance Performance related issue

Comments

@vuule
Copy link
Contributor

vuule commented Apr 29, 2024

Users outside of Spark-RAPIDS still use the default, non-pooled, host memory resource and thus have the overhead of pinned memory allocations in hostdevice_vector, and any other places where pinned memory is used for faster data transfer.

Proposal: Default to a memory resource with a small pinned pool. When the pool is full, the resource should fall back to new pinned allocations to keep consistent with the old behavior when too much pinned memory is used.

To ensure we don't impact CPU performance, the default size of the pool can be a set percentage of the total system memory. Pinning a small minority of system memory (~5%) should not have a negative impact.

Initially, only hostdevice_vector would use this resource but we can expand the pinned memory use in libcudf once a default pool resource is in place.

Details to consider:
Pool should probably be created on first use - avoids duplicated pool is users set the resource before the first use.
Switching the host resource should work at any point, even if we must have two pools at the same time.
Can the default pool be safely destroyed? streams can't be destroyed on exit, not sure about cudaFreeHost

@vuule vuule added feature request New feature or request Performance Performance related issue labels Apr 29, 2024
@vuule vuule self-assigned this Apr 29, 2024
@vuule
Copy link
Contributor Author

vuule commented Apr 29, 2024

Measured memory use by hostdevice_vector in cuIO benchmarks as a percentage of peak device memory use. The pinned memory use is proportional to device memory use, so we can use the peak device memory use as a measure of how much pinned memory we would need if we used up all device memory.
The results show that we would never fall back to new pinned allocations with a pinned pool sized at 4% of device memory capacity. However, even at 0.5%, the pool can be used to allocate 90% of used pinned memory without additional allocations.
image

@vuule
Copy link
Contributor Author

vuule commented Apr 29, 2024

Benchmarking results:
image
"relative throughput" is the average ratio of the throughput with the custom resource and the throughput with default (pinned, non-pooled) resource.

Benchmarks consistently show improvement with pooled resource compared to pinned allocations.
read_json shows disproportionate improvement because of small benchmarks that are hugely impacted by a single pinned allocation.
Data also show that small pools bring very similar performance improvement to the pool that never falls back to new allocations.
Surprisingly, benchmarks also show that using pageable memory in hostdevice_vector is preferable to pinned (non-pooled).

TODO: run benchmarks from #15585 because we expect to see higher impact in multi-threaded use cases.

rapids-bot bot pushed a commit that referenced this issue May 20, 2024
…5665)

Issue #15612

Adds a pooled pinned memory resource that is created on first call to `get_host_memory_resource` or `set_host_memory_resource`.
The pool has a fixed size: 0.5% of the device memory capacity, limited to 100MB. At 100MB, the pool takes ~30ms to initialize. Size of the pool can be overridden with environment variable `LIBCUDF_PINNED_POOL_SIZE`.
If an allocation cannot be done within the pool, a new pinned allocation is performed.
The allocator uses a stream from the global stream pool to initialize and perform synchronous operations (`allocate`/`deallocate`). Users of the resource don't need to be aware of this implementation detail as these operations synchronize before they are completed.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Alessandro Bellina (https://github.com/abellina)
  - Jake Hemstad (https://github.com/jrhemstad)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #15665
rapids-bot bot pushed a commit that referenced this issue Jun 12, 2024
closes #15612
Expanded the set of vector factories to cover pinned vectors. The functions return `cudf::detail::host_vector`, which use a type-erased allocator, allowing us to utilize the runtime configurable global pinned (previously host) resource.
The `pinned_host_vector` type has been removed as it can only support the non-pooled pinned allocations. Its use is not replaced with `cudf::detail::host_vector`.
Moved the global host (now pinned) resource out of cuIO and changed the type to host_device. User-specified resources are now required to allocate device-accessible memory. The name has been changed to pinned to reflect the new requirement.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Alessandro Bellina (https://github.com/abellina)
  - Yunsong Wang (https://github.com/PointKernel)
  - Mark Harris (https://github.com/harrism)
  - David Wendt (https://github.com/davidwendt)

URL: #15895
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Performance Performance related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant