New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Make evicted_rows a UVA buffer #3079

Closed

sryap wants to merge 2 commits into pytorch:main from sryap:export-D62114877

Contributor

sryap commented Sep 4, 2024

Summary:
Prior to this diff, SSD-TBE used a combination of a pinned CPU buffer
and the GPU buffer for evicted_rows (the buffer for staging rows
that are evicted from L1 cache). It explicitly performed asynchronous
memory copy (via cudaMemcpyAsync) to transfer evicted_rows from
device to host. Since the number of evicted rows is known only on the
device, SSD-TBE overallocated the evicted_rows CPU and GPU buffers.
Therefore, it transferred extra data during the device-host memory
copy. Such the extra data could be large and could make the memory
copy a bottleneck of an execution.

This diff mitigates the problem mentioned above by using a unified
address buffer for evicted_rows and using a kernel (namely
masked_index_select to load/store data instead of using a CUDA
memory copy operation. This mechanism can avoid the extra memory
copy. However, the memory copy can be less efficient (might not be
able to fully saturate the available memory bandwidth) since it does
not use the copy engine. Moreover, since it uses SMs for memory copy,
when overlapping the operator with other computes, it can potentially
compete for the SM resources with others.

Differential Revision: D62114877

facebook-github-bot added the cla signed label

netlify bot commented Sep 4, 2024 •

edited

Loading

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`b059883`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/66d94c18b20ebd000847e4a4
😎 Deploy Preview	https://deploy-preview-3079--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Contributor

facebook-github-bot commented Sep 4, 2024

This pull request was exported from Phabricator. Differential Revision: D62114877

facebook-github-bot added the fb-exported label

Contributor

facebook-github-bot commented Sep 4, 2024

This pull request was exported from Phabricator. Differential Revision: D62114877

sryap force-pushed the export-D62114877 branch from a32cd8a to 4e1aefd Compare

September 4, 2024 21:56

Contributor

facebook-github-bot commented Sep 4, 2024

This pull request was exported from Phabricator. Differential Revision: D62114877

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Make evicted_rows a UVA buffer (pytorch#3079)

67c04d4

Summary:
Pull Request resolved: pytorch#3079

X-link: facebookresearch/FBGEMM#173

Prior to this diff, SSD-TBE used a combination of a pinned CPU buffer
and the GPU buffer for `evicted_rows` (the buffer for staging rows
that are evicted from L1 cache).  It explicitly performed asynchronous
memory copy (via `cudaMemcpyAsync`) to transfer `evicted_rows` from
device to host.  Since the number of evicted rows is known only on the
device, SSD-TBE overallocated the `evicted_rows` CPU and GPU buffers.
Therefore, it transferred extra data during the device-host memory
copy.  Such the extra data could be large and could make the memory
copy a bottleneck of an execution.

This diff mitigates the problem mentioned above by using a unified
address buffer for `evicted_rows` and using a kernel (namely
`masked_index_select` to load/store data instead of using a CUDA
memory copy operation.  This mechanism can avoid the extra memory
copy.  However, the memory copy can be less efficient (might not be
able to fully saturate the available memory bandwidth) since it does
not use the copy engine.  Moreover, since it uses SMs for memory copy,
when overlapping the operator with other computes, it can potentially
compete for the SM resources with others.

Differential Revision: D62114877

sryap force-pushed the export-D62114877 branch from 4e1aefd to 67c04d4 Compare

September 4, 2024 22:01


          Add compact_indices op

153fa62

Differential Revision: D62190777

Contributor

facebook-github-bot commented Sep 5, 2024

This pull request was exported from Phabricator. Differential Revision: D62114877

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Make evicted_rows a UVA buffer (pytorch#3079)

ad1ec29

Summary:
Pull Request resolved: pytorch#3079

X-link: facebookresearch/FBGEMM#173

Prior to this diff, SSD-TBE used a combination of a pinned CPU buffer
and the GPU buffer for `evicted_rows` (the buffer for staging rows
that are evicted from L1 cache).  It explicitly performed asynchronous
memory copy (via `cudaMemcpyAsync`) to transfer `evicted_rows` from
device to host.  Since the number of evicted rows is known only on the
device, SSD-TBE overallocated the `evicted_rows` CPU and GPU buffers.
Therefore, it transferred extra data during the device-host memory
copy.  Such the extra data could be large and could make the memory
copy a bottleneck of an execution.

This diff mitigates the problem mentioned above by using a unified
address buffer for `evicted_rows` and using a kernel (namely
`masked_index_select` to load/store data instead of using a CUDA
memory copy operation.  This mechanism can avoid the extra memory
copy.  However, the memory copy can be less efficient (might not be
able to fully saturate the available memory bandwidth) since it does
not use the copy engine.  Moreover, since it uses SMs for memory copy,
when overlapping the operator with other computes, it can potentially
compete for the SM resources with others.

Reviewed By: q10

Differential Revision: D62114877

sryap force-pushed the export-D62114877 branch from 67c04d4 to ad1ec29 Compare

September 5, 2024 06:08


          Make evicted_rows a UVA buffer (pytorch#3079)

b059883

Summary:
Pull Request resolved: pytorch#3079

X-link: facebookresearch/FBGEMM#173

Prior to this diff, SSD-TBE used a combination of a pinned CPU buffer
and the GPU buffer for `evicted_rows` (the buffer for staging rows
that are evicted from L1 cache).  It explicitly performed asynchronous
memory copy (via `cudaMemcpyAsync`) to transfer `evicted_rows` from
device to host.  Since the number of evicted rows is known only on the
device, SSD-TBE overallocated the `evicted_rows` CPU and GPU buffers.
Therefore, it transferred extra data during the device-host memory
copy.  Such the extra data could be large and could make the memory
copy a bottleneck of an execution.

This diff mitigates the problem mentioned above by using a unified
address buffer for `evicted_rows` and using a kernel (namely
`masked_index_select` to load/store data instead of using a CUDA
memory copy operation.  This mechanism can avoid the extra memory
copy.  However, the memory copy can be less efficient (might not be
able to fully saturate the available memory bandwidth) since it does
not use the copy engine.  Moreover, since it uses SMs for memory copy,
when overlapping the operator with other computes, it can potentially
compete for the SM resources with others.

Reviewed By: q10

Differential Revision: D62114877

Contributor

facebook-github-bot commented Sep 5, 2024

This pull request was exported from Phabricator. Differential Revision: D62114877

sryap force-pushed the export-D62114877 branch from ad1ec29 to b059883 Compare

September 5, 2024 06:13

facebook-github-bot closed this in

53d84ad

facebook-github-bot added the Merged label

Contributor

facebook-github-bot commented Sep 5, 2024

This pull request has been merged in 53d84ad.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed fb-exported Merged