Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make evicted_rows a UVA buffer #3079

Closed
wants to merge 2 commits into from
Closed

Conversation

sryap
Copy link
Contributor

@sryap sryap commented Sep 4, 2024

Summary:
Prior to this diff, SSD-TBE used a combination of a pinned CPU buffer
and the GPU buffer for evicted_rows (the buffer for staging rows
that are evicted from L1 cache). It explicitly performed asynchronous
memory copy (via cudaMemcpyAsync) to transfer evicted_rows from
device to host. Since the number of evicted rows is known only on the
device, SSD-TBE overallocated the evicted_rows CPU and GPU buffers.
Therefore, it transferred extra data during the device-host memory
copy. Such the extra data could be large and could make the memory
copy a bottleneck of an execution.

This diff mitigates the problem mentioned above by using a unified
address buffer for evicted_rows and using a kernel (namely
masked_index_select to load/store data instead of using a CUDA
memory copy operation. This mechanism can avoid the extra memory
copy. However, the memory copy can be less efficient (might not be
able to fully saturate the available memory bandwidth) since it does
not use the copy engine. Moreover, since it uses SMs for memory copy,
when overlapping the operator with other computes, it can potentially
compete for the SM resources with others.

Differential Revision: D62114877

Copy link

netlify bot commented Sep 4, 2024

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit b059883
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/66d94c18b20ebd000847e4a4
😎 Deploy Preview https://deploy-preview-3079--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D62114877

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D62114877

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D62114877

sryap added a commit to sryap/FBGEMM that referenced this pull request Sep 4, 2024
Summary:
Pull Request resolved: pytorch#3079

X-link: facebookresearch/FBGEMM#173

Prior to this diff, SSD-TBE used a combination of a pinned CPU buffer
and the GPU buffer for `evicted_rows` (the buffer for staging rows
that are evicted from L1 cache).  It explicitly performed asynchronous
memory copy (via `cudaMemcpyAsync`) to transfer `evicted_rows` from
device to host.  Since the number of evicted rows is known only on the
device, SSD-TBE overallocated the `evicted_rows` CPU and GPU buffers.
Therefore, it transferred extra data during the device-host memory
copy.  Such the extra data could be large and could make the memory
copy a bottleneck of an execution.

This diff mitigates the problem mentioned above by using a unified
address buffer for `evicted_rows` and using a kernel (namely
`masked_index_select` to load/store data instead of using a CUDA
memory copy operation.  This mechanism can avoid the extra memory
copy.  However, the memory copy can be less efficient (might not be
able to fully saturate the available memory bandwidth) since it does
not use the copy engine.  Moreover, since it uses SMs for memory copy,
when overlapping the operator with other computes, it can potentially
compete for the SM resources with others.

Differential Revision: D62114877
Differential Revision: D62190777
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D62114877

sryap added a commit to sryap/FBGEMM that referenced this pull request Sep 5, 2024
Summary:
Pull Request resolved: pytorch#3079

X-link: facebookresearch/FBGEMM#173

Prior to this diff, SSD-TBE used a combination of a pinned CPU buffer
and the GPU buffer for `evicted_rows` (the buffer for staging rows
that are evicted from L1 cache).  It explicitly performed asynchronous
memory copy (via `cudaMemcpyAsync`) to transfer `evicted_rows` from
device to host.  Since the number of evicted rows is known only on the
device, SSD-TBE overallocated the `evicted_rows` CPU and GPU buffers.
Therefore, it transferred extra data during the device-host memory
copy.  Such the extra data could be large and could make the memory
copy a bottleneck of an execution.

This diff mitigates the problem mentioned above by using a unified
address buffer for `evicted_rows` and using a kernel (namely
`masked_index_select` to load/store data instead of using a CUDA
memory copy operation.  This mechanism can avoid the extra memory
copy.  However, the memory copy can be less efficient (might not be
able to fully saturate the available memory bandwidth) since it does
not use the copy engine.  Moreover, since it uses SMs for memory copy,
when overlapping the operator with other computes, it can potentially
compete for the SM resources with others.

Reviewed By: q10

Differential Revision: D62114877
Summary:
Pull Request resolved: pytorch#3079

X-link: facebookresearch/FBGEMM#173

Prior to this diff, SSD-TBE used a combination of a pinned CPU buffer
and the GPU buffer for `evicted_rows` (the buffer for staging rows
that are evicted from L1 cache).  It explicitly performed asynchronous
memory copy (via `cudaMemcpyAsync`) to transfer `evicted_rows` from
device to host.  Since the number of evicted rows is known only on the
device, SSD-TBE overallocated the `evicted_rows` CPU and GPU buffers.
Therefore, it transferred extra data during the device-host memory
copy.  Such the extra data could be large and could make the memory
copy a bottleneck of an execution.

This diff mitigates the problem mentioned above by using a unified
address buffer for `evicted_rows` and using a kernel (namely
`masked_index_select` to load/store data instead of using a CUDA
memory copy operation.  This mechanism can avoid the extra memory
copy.  However, the memory copy can be less efficient (might not be
able to fully saturate the available memory bandwidth) since it does
not use the copy engine.  Moreover, since it uses SMs for memory copy,
when overlapping the operator with other computes, it can potentially
compete for the SM resources with others.

Reviewed By: q10

Differential Revision: D62114877
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D62114877

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 53d84ad.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants