-
Notifications
You must be signed in to change notification settings - Fork 509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make evicted_rows a UVA buffer #3079
Conversation
✅ Deploy Preview for pytorch-fbgemm-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
This pull request was exported from Phabricator. Differential Revision: D62114877 |
This pull request was exported from Phabricator. Differential Revision: D62114877 |
a32cd8a
to
4e1aefd
Compare
This pull request was exported from Phabricator. Differential Revision: D62114877 |
Summary: Pull Request resolved: pytorch#3079 X-link: facebookresearch/FBGEMM#173 Prior to this diff, SSD-TBE used a combination of a pinned CPU buffer and the GPU buffer for `evicted_rows` (the buffer for staging rows that are evicted from L1 cache). It explicitly performed asynchronous memory copy (via `cudaMemcpyAsync`) to transfer `evicted_rows` from device to host. Since the number of evicted rows is known only on the device, SSD-TBE overallocated the `evicted_rows` CPU and GPU buffers. Therefore, it transferred extra data during the device-host memory copy. Such the extra data could be large and could make the memory copy a bottleneck of an execution. This diff mitigates the problem mentioned above by using a unified address buffer for `evicted_rows` and using a kernel (namely `masked_index_select` to load/store data instead of using a CUDA memory copy operation. This mechanism can avoid the extra memory copy. However, the memory copy can be less efficient (might not be able to fully saturate the available memory bandwidth) since it does not use the copy engine. Moreover, since it uses SMs for memory copy, when overlapping the operator with other computes, it can potentially compete for the SM resources with others. Differential Revision: D62114877
4e1aefd
to
67c04d4
Compare
Differential Revision: D62190777
This pull request was exported from Phabricator. Differential Revision: D62114877 |
Summary: Pull Request resolved: pytorch#3079 X-link: facebookresearch/FBGEMM#173 Prior to this diff, SSD-TBE used a combination of a pinned CPU buffer and the GPU buffer for `evicted_rows` (the buffer for staging rows that are evicted from L1 cache). It explicitly performed asynchronous memory copy (via `cudaMemcpyAsync`) to transfer `evicted_rows` from device to host. Since the number of evicted rows is known only on the device, SSD-TBE overallocated the `evicted_rows` CPU and GPU buffers. Therefore, it transferred extra data during the device-host memory copy. Such the extra data could be large and could make the memory copy a bottleneck of an execution. This diff mitigates the problem mentioned above by using a unified address buffer for `evicted_rows` and using a kernel (namely `masked_index_select` to load/store data instead of using a CUDA memory copy operation. This mechanism can avoid the extra memory copy. However, the memory copy can be less efficient (might not be able to fully saturate the available memory bandwidth) since it does not use the copy engine. Moreover, since it uses SMs for memory copy, when overlapping the operator with other computes, it can potentially compete for the SM resources with others. Reviewed By: q10 Differential Revision: D62114877
67c04d4
to
ad1ec29
Compare
Summary: Pull Request resolved: pytorch#3079 X-link: facebookresearch/FBGEMM#173 Prior to this diff, SSD-TBE used a combination of a pinned CPU buffer and the GPU buffer for `evicted_rows` (the buffer for staging rows that are evicted from L1 cache). It explicitly performed asynchronous memory copy (via `cudaMemcpyAsync`) to transfer `evicted_rows` from device to host. Since the number of evicted rows is known only on the device, SSD-TBE overallocated the `evicted_rows` CPU and GPU buffers. Therefore, it transferred extra data during the device-host memory copy. Such the extra data could be large and could make the memory copy a bottleneck of an execution. This diff mitigates the problem mentioned above by using a unified address buffer for `evicted_rows` and using a kernel (namely `masked_index_select` to load/store data instead of using a CUDA memory copy operation. This mechanism can avoid the extra memory copy. However, the memory copy can be less efficient (might not be able to fully saturate the available memory bandwidth) since it does not use the copy engine. Moreover, since it uses SMs for memory copy, when overlapping the operator with other computes, it can potentially compete for the SM resources with others. Reviewed By: q10 Differential Revision: D62114877
This pull request was exported from Phabricator. Differential Revision: D62114877 |
ad1ec29
to
b059883
Compare
This pull request has been merged in 53d84ad. |
Summary:
Prior to this diff, SSD-TBE used a combination of a pinned CPU buffer
and the GPU buffer for
evicted_rows
(the buffer for staging rowsthat are evicted from L1 cache). It explicitly performed asynchronous
memory copy (via
cudaMemcpyAsync
) to transferevicted_rows
fromdevice to host. Since the number of evicted rows is known only on the
device, SSD-TBE overallocated the
evicted_rows
CPU and GPU buffers.Therefore, it transferred extra data during the device-host memory
copy. Such the extra data could be large and could make the memory
copy a bottleneck of an execution.
This diff mitigates the problem mentioned above by using a unified
address buffer for
evicted_rows
and using a kernel (namelymasked_index_select
to load/store data instead of using a CUDAmemory copy operation. This mechanism can avoid the extra memory
copy. However, the memory copy can be less efficient (might not be
able to fully saturate the available memory bandwidth) since it does
not use the copy engine. Moreover, since it uses SMs for memory copy,
when overlapping the operator with other computes, it can potentially
compete for the SM resources with others.
Differential Revision: D62114877