Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
New benchmark compares concurrent throughput of device_vector and dev…
…ice_uvector (#981) Adds a new benchmark in `device_uvector_benchmark.cpp` that compares using multiple streams and concurrent kernels interleaved with vector creation. This is then parameterized on the type of the vector: 1. `thrust::device_vector` -- uses cudaMalloc allocation 2. `rmm::device_vector` -- uses RMM allocation 3. `rmm::device_uvector` -- uses RMM allocation and uninitialized vector The benchmark uses the `cuda_async_memory_resource` so that cudaMallocAsync is used for allocation of the `rmm::` vector types. The performance on V100 demonstrates that option 1. is slowest due to allocation bottlenecks. 2. alleviates these by using `cudaMallocFromPoolAsync`, but there is no concurrency among the kernels because `thrust::device_vector` synchronizes the default stream. 3. Is fastest and achieves full concurrency (verified in `nsight-sys`). ```---------------------------------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... ---------------------------------------------------------------------------------------------------------------------------------- BM_VectorWorkflow<thrust::device_vector<int32_t>>/100000/manual_time 242 us 267 us 2962 bytes_per_second=13.8375G/s BM_VectorWorkflow<thrust::device_vector<int32_t>>/1000000/manual_time 1441 us 1465 us 472 bytes_per_second=23.273G/s BM_VectorWorkflow<thrust::device_vector<int32_t>>/10000000/manual_time 10483 us 10498 us 68 bytes_per_second=31.9829G/s BM_VectorWorkflow<thrust::device_vector<int32_t>>/100000000/manual_time 63583 us 63567 us 12 bytes_per_second=52.7303G/s BM_VectorWorkflow<rmm::device_vector<int32_t>>/100000/manual_time 82.0 us 105 us 8181 bytes_per_second=40.8661G/s BM_VectorWorkflow<rmm::device_vector<int32_t>>/1000000/manual_time 502 us 527 us 1357 bytes_per_second=66.8029G/s BM_VectorWorkflow<rmm::device_vector<int32_t>>/10000000/manual_time 4714 us 4746 us 148 bytes_per_second=71.1222G/s BM_VectorWorkflow<rmm::device_vector<int32_t>>/100000000/manual_time 46451 us 46478 us 13 bytes_per_second=72.1784G/s BM_VectorWorkflow<rmm::device_uvector<int32_t>>/100000/manual_time 39.0 us 59.9 us 17970 bytes_per_second=85.8733G/s BM_VectorWorkflow<rmm::device_uvector<int32_t>>/1000000/manual_time 135 us 159 us 5253 bytes_per_second=248.987G/s BM_VectorWorkflow<rmm::device_uvector<int32_t>>/10000000/manual_time 1319 us 1351 us 516 bytes_per_second=254.169G/s BM_VectorWorkflow<rmm::device_uvector<int32_t>>/100000000/manual_time 12841 us 12865 us 54 bytes_per_second=261.099G/s ``` Authors: - Mark Harris (https://github.com/harrism) Approvers: - Jake Hemstad (https://github.com/jrhemstad) - Conor Hoekstra (https://github.com/codereport) URL: #981
- Loading branch information