Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable pipeline prefetching #2963

Closed
wants to merge 2 commits into from
Closed

Conversation

sryap
Copy link
Contributor

@sryap sryap commented Aug 9, 2024

Summary:
This diff enables pipeline cache prefetching for SSD-TBE. This allows
prefetch for the next iteration's batch to be carried out while the
computation of the current batch is going on.

We have done the following to guarantee cache consistency when
pipeline prefetching is enabled:

(1) Enable cache line locking (implemented in D46172802, D47638502,
D60812956) to ensure that cache lines are not prematurely evicted by
the prefetch when the previous iteration's computation is not
complete.

(2) Lookup L1 cache, the previous iteration's scratch pad (let's call
it SP(i-1)), and SSD/L2 cache. Move rows from SSD/L2 and/or SP(i-1) to
either L1 or the current iteration's scratch pad (let's call it
SP(i)). Then we update the row pointers of the previous iteration's
indices based on the new locations, i.e., L1 or SP(i). The detailed
explaination of the process is shown in the figure below:

{F1796865736}
https://internalfb.com/excalidraw/EX264315

(3) Ensure proper synchronizations between streams and events

  • Ensure that prefetch of iteration i is complete before backward TBE
    of iteration i-1
  • Ensure that prefetch of iteration i+1 starts after the backward TBE
    of iteration i is complete

Usage:

# Initialize the module with prefetch_pipeline=True
# prefetch_stream is the CUDA stream for prefetching (optional)
emb = SSDTableBatchedEmbeddingBags(
            embedding_specs=...,
            prefetch_pipeline=True,
            # It is recommended to set the stream priority to low
            prefetch_stream=torch.cuda.Stream(),
).cuda()

# When calling prefetch, make sure to pass the forward stream if using prefetch_stream so that TBE records tensors on streams properly
with torch.cuda.stream(prefetch_stream):
    emb.prefetch(
        indices,
        offsets,
        forward_stream=forward_stream
    )

Differential Revision: D60727327

Copy link

netlify bot commented Aug 9, 2024

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit bf45992
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/66bd414cc41ed40008baf35e
😎 Deploy Preview https://deploy-preview-2963--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D60727327

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D60727327

sryap added a commit to sryap/FBGEMM that referenced this pull request Aug 13, 2024
Summary:
Pull Request resolved: pytorch#2963

This diff enables pipeline cache prefetching for SSD-TBE.  This allows
prefetch for the next iteration's batch to be carried out while the
computation of the current batch is going on.

We have done the following to guarantee cache consistency when
pipeline prefetching is enabled:

(1) Enable cache line locking (implemented in D46172802, D47638502,
D60812956) to ensure that cache lines are not prematurely evicted by
the prefetch when the previous iteration's computation is not
complete.

(2) Lookup L1 cache, the previous iteration's scratch pad (let's call
it SP(i-1)), and SSD/L2 cache. Move rows from SSD/L2 and/or SP(i-1) to
either L1 or the current iteration's scratch pad (let's call it
SP(i)).  Then we update the row pointers of the previous iteration's
indices based on the new locations, i.e., L1 or SP(i).  The detailed
explaination of the process is shown in the figure below:

{F1802341461}
https://internalfb.com/excalidraw/EX264315

(3) Ensure proper synchronizations between streams and events
- Ensure that prefetch of iteration i is complete before backward TBE
  of iteration i-1
- Ensure that prefetch of iteration i+1 starts after the backward TBE
  of iteration i is complete

The following is how prefetch operators run on GPU streams/CPU:

{F1802798301}

**Usage:**

```
# Initialize the module with prefetch_pipeline=True
# prefetch_stream is the CUDA stream for prefetching (optional)
emb = SSDTableBatchedEmbeddingBags(
            embedding_specs=...,
            prefetch_pipeline=True,
            # It is recommended to set the stream priority to low
            prefetch_stream=torch.cuda.Stream(),
).cuda()

# When calling prefetch, make sure to pass the forward stream if using prefetch_stream so that TBE records tensors on streams properly
with torch.cuda.stream(prefetch_stream):
    emb.prefetch(
        indices,
        offsets,
        forward_stream=forward_stream
    )
```

Differential Revision: D60727327
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D60727327

sryap added a commit to sryap/FBGEMM that referenced this pull request Aug 14, 2024
Summary:
Pull Request resolved: pytorch#2963

This diff enables pipeline cache prefetching for SSD-TBE.  This allows
prefetch for the next iteration's batch to be carried out while the
computation of the current batch is going on.

We have done the following to guarantee cache consistency when
pipeline prefetching is enabled:

(1) Enable cache line locking (implemented in D46172802, D47638502,
D60812956) to ensure that cache lines are not prematurely evicted by
the prefetch when the previous iteration's computation is not
complete.

(2) Lookup L1 cache, the previous iteration's scratch pad (let's call
it SP(i-1)), and SSD/L2 cache. Move rows from SSD/L2 and/or SP(i-1) to
either L1 or the current iteration's scratch pad (let's call it
SP(i)).  Then we update the row pointers of the previous iteration's
indices based on the new locations, i.e., L1 or SP(i).  The detailed
explaination of the process is shown in the figure below:

{F1802341461}
https://internalfb.com/excalidraw/EX264315

(3) Ensure proper synchronizations between streams and events
- Ensure that prefetch of iteration i is complete before backward TBE
  of iteration i-1
- Ensure that prefetch of iteration i+1 starts after the backward TBE
  of iteration i is complete

The following is how prefetch operators run on GPU streams/CPU:

{F1802798301}

**Usage:**

```
# Initialize the module with prefetch_pipeline=True
emb = SSDTableBatchedEmbeddingBags(
            embedding_specs=...,
            prefetch_pipeline=True,
).cuda()

# When calling prefetch, make sure to pass the forward stream if using
# prefetch_stream so that TBE records tensors on streams properly
with torch.cuda.stream(prefetch_stream):
    emb.prefetch(
        indices,
        offsets,
        forward_stream=forward_stream
    )
```

Differential Revision: D60727327
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D60727327

sryap added a commit to sryap/FBGEMM that referenced this pull request Aug 14, 2024
Summary:
Pull Request resolved: pytorch#2963

This diff enables pipeline cache prefetching for SSD-TBE.  This allows
prefetch for the next iteration's batch to be carried out while the
computation of the current batch is going on.

We have done the following to guarantee cache consistency when
pipeline prefetching is enabled:

(1) Enable cache line locking (implemented in D46172802, D47638502,
D60812956) to ensure that cache lines are not prematurely evicted by
the prefetch when the previous iteration's computation is not
complete.

(2) Lookup L1 cache, the previous iteration's scratch pad (let's call
it SP(i-1)), and SSD/L2 cache. Move rows from SSD/L2 and/or SP(i-1) to
either L1 or the current iteration's scratch pad (let's call it
SP(i)).  Then we update the row pointers of the previous iteration's
indices based on the new locations, i.e., L1 or SP(i).  The detailed
explaination of the process is shown in the figure below:

{F1802341461}
https://internalfb.com/excalidraw/EX264315

(3) Ensure proper synchronizations between streams and events
- Ensure that prefetch of iteration i is complete before backward TBE
  of iteration i-1
- Ensure that prefetch of iteration i+1 starts after the backward TBE
  of iteration i is complete

The following is how prefetch operators run on GPU streams/CPU:

{F1802798301}

**Usage:**

```
# Initialize the module with prefetch_pipeline=True
emb = SSDTableBatchedEmbeddingBags(
            embedding_specs=...,
            prefetch_pipeline=True,
).cuda()

# When calling prefetch, make sure to pass the forward stream if using
# prefetch_stream so that TBE records tensors on streams properly
with torch.cuda.stream(prefetch_stream):
    emb.prefetch(
        indices,
        offsets,
        forward_stream=forward_stream
    )
```

Differential Revision: D60727327
sarunya and others added 2 commits August 14, 2024 16:27
Differential Revision: D61294287
Summary:
Pull Request resolved: pytorch#2963

This diff enables pipeline cache prefetching for SSD-TBE.  This allows
prefetch for the next iteration's batch to be carried out while the
computation of the current batch is going on.

We have done the following to guarantee cache consistency when
pipeline prefetching is enabled:

(1) Enable cache line locking (implemented in D46172802, D47638502,
D60812956) to ensure that cache lines are not prematurely evicted by
the prefetch when the previous iteration's computation is not
complete.

(2) Lookup L1 cache, the previous iteration's scratch pad (let's call
it SP(i-1)), and SSD/L2 cache. Move rows from SSD/L2 and/or SP(i-1) to
either L1 or the current iteration's scratch pad (let's call it
SP(i)).  Then we update the row pointers of the previous iteration's
indices based on the new locations, i.e., L1 or SP(i).  The detailed
explaination of the process is shown in the figure below:

{F1802341461}
https://internalfb.com/excalidraw/EX264315

(3) Ensure proper synchronizations between streams and events
- Ensure that prefetch of iteration i is complete before backward TBE
  of iteration i-1
- Ensure that prefetch of iteration i+1 starts after the backward TBE
  of iteration i is complete

The following is how prefetch operators run on GPU streams/CPU:

{F1802798301}

**Usage:**

```
# Initialize the module with prefetch_pipeline=True
emb = SSDTableBatchedEmbeddingBags(
            embedding_specs=...,
            prefetch_pipeline=True,
).cuda()

# When calling prefetch, make sure to pass the forward stream if using
# prefetch_stream so that TBE records tensors on streams properly
with torch.cuda.stream(prefetch_stream):
    emb.prefetch(
        indices,
        offsets,
        forward_stream=forward_stream
    )
```

Differential Revision: D60727327
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D60727327

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in e73b1f8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants