Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cache conflict miss support (backend) #2596

Closed
wants to merge 1 commit into from

Conversation

sryap
Copy link
Contributor

@sryap sryap commented May 16, 2024

Differential Revision: D55998215

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D55998215

Copy link

netlify bot commented May 16, 2024

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit 3d84b25
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/664d9809b35ad30008cec06f
😎 Deploy Preview https://deploy-preview-2596--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D55998215

sryap added a commit to sryap/FBGEMM that referenced this pull request May 20, 2024
Summary: Pull Request resolved: pytorch#2596

Differential Revision: D55998215
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D55998215

sryap added a commit to sryap/FBGEMM that referenced this pull request May 20, 2024
Summary: Pull Request resolved: pytorch#2596

Differential Revision: D55998215
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D55998215

sryap added a commit to sryap/FBGEMM that referenced this pull request May 20, 2024
Summary: Pull Request resolved: pytorch#2596

Differential Revision: D55998215
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D55998215

sryap added a commit to sryap/FBGEMM that referenced this pull request May 22, 2024
Summary: Pull Request resolved: pytorch#2596

Differential Revision: D55998215
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D55998215

sryap added a commit to sryap/FBGEMM that referenced this pull request May 22, 2024
Summary:
Pull Request resolved: pytorch#2596

Prior to this diff, SSD TBE lacked support for the conflict cache miss
scenario. It operated under the assumption that the cache, located in
GPU memory, was sufficiently large to hold all prefetched data from
SSD. In the event of a conflict cache miss, the behavior of SSD TBE
would be unpredictable (it could either fail or potentially access
illegal memory). Note that a conflict cache miss happens when an
embedding row is absent in the cache, and after being fetched from
SSD, it cannot be inserted into the cache due to capacity constraints
or associativity limitations.

This diff introduces support for conflict cache misses by storing rows
that cannot be inserted into the cache due to conflicts in a scratch
pad, which is a temporary GPU tensor. In the case where rows are
missed from the cache, TBE kernels can access the scratch pad.

Prior to this diff, during the SSD prefetch stage, any row that was
missed the cache and required fetching from SSD would be first fetched
into a CPU scratch pad and then transferred to GPU. Rows that could be
inserted into the cache would subsequently be copied from the GPU
scratch pad into the cache. If conflict misses occurred, the prefetch
behavior would be unpredictable. With this diff, conflict missed rows
are now retained in the scratch pad, which is kept alive until the
current iteration completes.  Throughout the forward and backward +
optimizer stages of TBE, both the cache and scratch pad are equivalent
in terms of usage. However, following the completion of the backward +
optimizer step, rows in the scratch pad are flushed back to SSD,
unlike rows residing in the cache which are not evicted for future
usage (see the diagram below for more details).

 {F1645878181}

Differential Revision: D55998215
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D55998215

sryap added a commit to sryap/FBGEMM that referenced this pull request May 22, 2024
Summary:
Pull Request resolved: pytorch#2596

Prior to this diff, SSD TBE lacked support for the conflict cache miss
scenario. It operated under the assumption that the cache, located in
GPU memory, was sufficiently large to hold all prefetched data from
SSD. In the event of a conflict cache miss, the behavior of SSD TBE
would be unpredictable (it could either fail or potentially access
illegal memory). Note that a conflict cache miss happens when an
embedding row is absent in the cache, and after being fetched from
SSD, it cannot be inserted into the cache due to capacity constraints
or associativity limitations.

This diff introduces support for conflict cache misses by storing rows
that cannot be inserted into the cache due to conflicts in a scratch
pad, which is a temporary GPU tensor. In the case where rows are
missed from the cache, TBE kernels can access the scratch pad.

Prior to this diff, during the SSD prefetch stage, any row that was
missed the cache and required fetching from SSD would be first fetched
into a CPU scratch pad and then transferred to GPU. Rows that could be
inserted into the cache would subsequently be copied from the GPU
scratch pad into the cache. If conflict misses occurred, the prefetch
behavior would be unpredictable. With this diff, conflict missed rows
are now retained in the scratch pad, which is kept alive until the
current iteration completes.  Throughout the forward and backward +
optimizer stages of TBE, both the cache and scratch pad are equivalent
in terms of usage. However, following the completion of the backward +
optimizer step, rows in the scratch pad are flushed back to SSD,
unlike rows residing in the cache which are not evicted for future
usage (see the diagram below for more details).

 {F1645878181}

Differential Revision: D55998215
Summary:
Pull Request resolved: pytorch#2596

Prior to this diff, SSD TBE lacked support for the conflict cache miss
scenario. It operated under the assumption that the cache, located in
GPU memory, was sufficiently large to hold all prefetched data from
SSD. In the event of a conflict cache miss, the behavior of SSD TBE
would be unpredictable (it could either fail or potentially access
illegal memory). Note that a conflict cache miss happens when an
embedding row is absent in the cache, and after being fetched from
SSD, it cannot be inserted into the cache due to capacity constraints
or associativity limitations.

This diff introduces support for conflict cache misses by storing rows
that cannot be inserted into the cache due to conflicts in a scratch
pad, which is a temporary GPU tensor. In the case where rows are
missed from the cache, TBE kernels can access the scratch pad.

Prior to this diff, during the SSD prefetch stage, any row that was
missed the cache and required fetching from SSD would be first fetched
into a CPU scratch pad and then transferred to GPU. Rows that could be
inserted into the cache would subsequently be copied from the GPU
scratch pad into the cache. If conflict misses occurred, the prefetch
behavior would be unpredictable. With this diff, conflict missed rows
are now retained in the scratch pad, which is kept alive until the
current iteration completes.  Throughout the forward and backward +
optimizer stages of TBE, both the cache and scratch pad are equivalent
in terms of usage. However, following the completion of the backward +
optimizer step, rows in the scratch pad are flushed back to SSD,
unlike rows residing in the cache which are not evicted for future
usage (see the diagram below for more details).

 {F1645878181}

Differential Revision: D55998215
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D55998215

joshuadeng pushed a commit to joshuadeng/FBGEMM that referenced this pull request May 23, 2024
Summary:
Pull Request resolved: pytorch#2596

Prior to this diff, SSD TBE lacked support for the conflict cache miss
scenario. It operated under the assumption that the cache, located in
GPU memory, was sufficiently large to hold all prefetched data from
SSD. In the event of a conflict cache miss, the behavior of SSD TBE
would be unpredictable (it could either fail or potentially access
illegal memory). Note that a conflict cache miss happens when an
embedding row is absent in the cache, and after being fetched from
SSD, it cannot be inserted into the cache due to capacity constraints
or associativity limitations.

This diff introduces support for conflict cache misses by storing rows
that cannot be inserted into the cache due to conflicts in a scratch
pad, which is a temporary GPU tensor. In the case where rows are
missed from the cache, TBE kernels can access the scratch pad.

Prior to this diff, during the SSD prefetch stage, any row that was
missed the cache and required fetching from SSD would be first fetched
into a CPU scratch pad and then transferred to GPU. Rows that could be
inserted into the cache would subsequently be copied from the GPU
scratch pad into the cache. If conflict misses occurred, the prefetch
behavior would be unpredictable. With this diff, conflict missed rows
are now retained in the scratch pad, which is kept alive until the
current iteration completes.  Throughout the forward and backward +
optimizer stages of TBE, both the cache and scratch pad are equivalent
in terms of usage. However, following the completion of the backward +
optimizer step, rows in the scratch pad are flushed back to SSD,
unlike rows residing in the cache which are not evicted for future
usage (see the diagram below for more details).

 {F1645878181}

Differential Revision: D55998215
joshuadeng pushed a commit to joshuadeng/FBGEMM that referenced this pull request May 23, 2024
Summary:
Pull Request resolved: pytorch#2596

Prior to this diff, SSD TBE lacked support for the conflict cache miss
scenario. It operated under the assumption that the cache, located in
GPU memory, was sufficiently large to hold all prefetched data from
SSD. In the event of a conflict cache miss, the behavior of SSD TBE
would be unpredictable (it could either fail or potentially access
illegal memory). Note that a conflict cache miss happens when an
embedding row is absent in the cache, and after being fetched from
SSD, it cannot be inserted into the cache due to capacity constraints
or associativity limitations.

This diff introduces support for conflict cache misses by storing rows
that cannot be inserted into the cache due to conflicts in a scratch
pad, which is a temporary GPU tensor. In the case where rows are
missed from the cache, TBE kernels can access the scratch pad.

Prior to this diff, during the SSD prefetch stage, any row that was
missed the cache and required fetching from SSD would be first fetched
into a CPU scratch pad and then transferred to GPU. Rows that could be
inserted into the cache would subsequently be copied from the GPU
scratch pad into the cache. If conflict misses occurred, the prefetch
behavior would be unpredictable. With this diff, conflict missed rows
are now retained in the scratch pad, which is kept alive until the
current iteration completes.  Throughout the forward and backward +
optimizer stages of TBE, both the cache and scratch pad are equivalent
in terms of usage. However, following the completion of the backward +
optimizer step, rows in the scratch pad are flushed back to SSD,
unlike rows residing in the cache which are not evicted for future
usage (see the diagram below for more details).

 {F1645878181}

Differential Revision: D55998215
joshuadeng pushed a commit to joshuadeng/FBGEMM that referenced this pull request May 24, 2024
Summary:
Pull Request resolved: pytorch#2596

Prior to this diff, SSD TBE lacked support for the conflict cache miss
scenario. It operated under the assumption that the cache, located in
GPU memory, was sufficiently large to hold all prefetched data from
SSD. In the event of a conflict cache miss, the behavior of SSD TBE
would be unpredictable (it could either fail or potentially access
illegal memory). Note that a conflict cache miss happens when an
embedding row is absent in the cache, and after being fetched from
SSD, it cannot be inserted into the cache due to capacity constraints
or associativity limitations.

This diff introduces support for conflict cache misses by storing rows
that cannot be inserted into the cache due to conflicts in a scratch
pad, which is a temporary GPU tensor. In the case where rows are
missed from the cache, TBE kernels can access the scratch pad.

Prior to this diff, during the SSD prefetch stage, any row that was
missed the cache and required fetching from SSD would be first fetched
into a CPU scratch pad and then transferred to GPU. Rows that could be
inserted into the cache would subsequently be copied from the GPU
scratch pad into the cache. If conflict misses occurred, the prefetch
behavior would be unpredictable. With this diff, conflict missed rows
are now retained in the scratch pad, which is kept alive until the
current iteration completes.  Throughout the forward and backward +
optimizer stages of TBE, both the cache and scratch pad are equivalent
in terms of usage. However, following the completion of the backward +
optimizer step, rows in the scratch pad are flushed back to SSD,
unlike rows residing in the cache which are not evicted for future
usage (see the diagram below for more details).

 {F1645878181}

Differential Revision: D55998215
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in db4d379.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants