-
Notifications
You must be signed in to change notification settings - Fork 511
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Summary: Prior to this diff, SSD-TBE used a combination of a pinned CPU buffer and the GPU buffer for `evicted_rows` (the buffer for staging rows that are evicted from L1 cache). It explicitly performed asynchronous memory copy (via `cudaMemcpyAsync`) to transfer `evicted_rows` from device to host. Since the number of evicted rows is known only on the device, SSD-TBE overallocated the `evicted_rows` CPU and GPU buffers. Therefore, it transferred extra data during the device-host memory copy. Such the extra data could be large and could make the memory copy a bottleneck of an execution. This diff mitigates the problem mentioned above by using a unified address buffer for `evicted_rows` and using a kernel (namely `masked_index_select` to load/store data instead of using a CUDA memory copy operation. This mechanism can avoid the extra memory copy. However, the memory copy can be less efficient (might not be able to fully saturate the available memory bandwidth) since it does not use the copy engine. Moreover, since it uses SMs for memory copy, when overlapping the operator with other computes, it can potentially compete for the SM resources with others. Differential Revision: D62114877
- Loading branch information
1 parent
4f9e0d3
commit a32cd8a
Showing
1 changed file
with
50 additions
and
10 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters