New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Enable pipeline prefetching #2963

Closed

sryap wants to merge 2 commits into pytorch:main from sryap:export-D60727327

Contributor

sryap commented Aug 9, 2024

Summary:
This diff enables pipeline cache prefetching for SSD-TBE. This allows
prefetch for the next iteration's batch to be carried out while the
computation of the current batch is going on.

We have done the following to guarantee cache consistency when
pipeline prefetching is enabled:

(1) Enable cache line locking (implemented in D46172802, D47638502,
D60812956) to ensure that cache lines are not prematurely evicted by
the prefetch when the previous iteration's computation is not
complete.

(2) Lookup L1 cache, the previous iteration's scratch pad (let's call
it SP(i-1)), and SSD/L2 cache. Move rows from SSD/L2 and/or SP(i-1) to
either L1 or the current iteration's scratch pad (let's call it
SP(i)). Then we update the row pointers of the previous iteration's
indices based on the new locations, i.e., L1 or SP(i). The detailed
explaination of the process is shown in the figure below:

{F1796865736}
https://internalfb.com/excalidraw/EX264315

(3) Ensure proper synchronizations between streams and events

Ensure that prefetch of iteration i is complete before backward TBE
of iteration i-1
Ensure that prefetch of iteration i+1 starts after the backward TBE
of iteration i is complete

Usage:

# Initialize the module with prefetch_pipeline=True
# prefetch_stream is the CUDA stream for prefetching (optional)
emb = SSDTableBatchedEmbeddingBags(
            embedding_specs=...,
            prefetch_pipeline=True,
            # It is recommended to set the stream priority to low
            prefetch_stream=torch.cuda.Stream(),
).cuda()

# When calling prefetch, make sure to pass the forward stream if using prefetch_stream so that TBE records tensors on streams properly
with torch.cuda.stream(prefetch_stream):
    emb.prefetch(
        indices,
        offsets,
        forward_stream=forward_stream
    )

Differential Revision: D60727327

facebook-github-bot added the cla signed label

netlify bot commented Aug 9, 2024 •

edited

Loading

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`bf45992`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/66bd414cc41ed40008baf35e
😎 Deploy Preview	https://deploy-preview-2963--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Contributor

facebook-github-bot commented Aug 9, 2024

This pull request was exported from Phabricator. Differential Revision: D60727327

facebook-github-bot added the fb-exported label

Contributor

facebook-github-bot commented Aug 13, 2024

This pull request was exported from Phabricator. Differential Revision: D60727327

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Enable pipeline prefetching (pytorch#2963)

6f761f8

Summary:
Pull Request resolved: pytorch#2963

This diff enables pipeline cache prefetching for SSD-TBE.  This allows
prefetch for the next iteration's batch to be carried out while the
computation of the current batch is going on.

We have done the following to guarantee cache consistency when
pipeline prefetching is enabled:

(1) Enable cache line locking (implemented in D46172802, D47638502,
D60812956) to ensure that cache lines are not prematurely evicted by
the prefetch when the previous iteration's computation is not
complete.

(2) Lookup L1 cache, the previous iteration's scratch pad (let's call
it SP(i-1)), and SSD/L2 cache. Move rows from SSD/L2 and/or SP(i-1) to
either L1 or the current iteration's scratch pad (let's call it
SP(i)).  Then we update the row pointers of the previous iteration's
indices based on the new locations, i.e., L1 or SP(i).  The detailed
explaination of the process is shown in the figure below:

{F1802341461}
https://internalfb.com/excalidraw/EX264315

(3) Ensure proper synchronizations between streams and events
- Ensure that prefetch of iteration i is complete before backward TBE
  of iteration i-1
- Ensure that prefetch of iteration i+1 starts after the backward TBE
  of iteration i is complete

The following is how prefetch operators run on GPU streams/CPU:

{F1802798301}

**Usage:**

```
# Initialize the module with prefetch_pipeline=True
# prefetch_stream is the CUDA stream for prefetching (optional)
emb = SSDTableBatchedEmbeddingBags(
            embedding_specs=...,
            prefetch_pipeline=True,
            # It is recommended to set the stream priority to low
            prefetch_stream=torch.cuda.Stream(),
).cuda()

# When calling prefetch, make sure to pass the forward stream if using prefetch_stream so that TBE records tensors on streams properly
with torch.cuda.stream(prefetch_stream):
    emb.prefetch(
        indices,
        offsets,
        forward_stream=forward_stream
    )
```

Differential Revision: D60727327

sryap force-pushed the export-D60727327 branch from 8b46b99 to 6f761f8 Compare

August 13, 2024 01:21

Contributor

facebook-github-bot commented Aug 14, 2024

This pull request was exported from Phabricator. Differential Revision: D60727327

sryap force-pushed the export-D60727327 branch from 6f761f8 to bbf2b1b Compare

August 14, 2024 20:01

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Enable pipeline prefetching (pytorch#2963)

bbf2b1b

Summary:
Pull Request resolved: pytorch#2963

This diff enables pipeline cache prefetching for SSD-TBE.  This allows
prefetch for the next iteration's batch to be carried out while the
computation of the current batch is going on.

We have done the following to guarantee cache consistency when
pipeline prefetching is enabled:

(1) Enable cache line locking (implemented in D46172802, D47638502,
D60812956) to ensure that cache lines are not prematurely evicted by
the prefetch when the previous iteration's computation is not
complete.

(2) Lookup L1 cache, the previous iteration's scratch pad (let's call
it SP(i-1)), and SSD/L2 cache. Move rows from SSD/L2 and/or SP(i-1) to
either L1 or the current iteration's scratch pad (let's call it
SP(i)).  Then we update the row pointers of the previous iteration's
indices based on the new locations, i.e., L1 or SP(i).  The detailed
explaination of the process is shown in the figure below:

{F1802341461}
https://internalfb.com/excalidraw/EX264315

(3) Ensure proper synchronizations between streams and events
- Ensure that prefetch of iteration i is complete before backward TBE
  of iteration i-1
- Ensure that prefetch of iteration i+1 starts after the backward TBE
  of iteration i is complete

The following is how prefetch operators run on GPU streams/CPU:

{F1802798301}

**Usage:**

```
# Initialize the module with prefetch_pipeline=True
emb = SSDTableBatchedEmbeddingBags(
            embedding_specs=...,
            prefetch_pipeline=True,
).cuda()

# When calling prefetch, make sure to pass the forward stream if using
# prefetch_stream so that TBE records tensors on streams properly
with torch.cuda.stream(prefetch_stream):
    emb.prefetch(
        indices,
        offsets,
        forward_stream=forward_stream
    )
```

Differential Revision: D60727327

Contributor

facebook-github-bot commented Aug 14, 2024

This pull request was exported from Phabricator. Differential Revision: D60727327

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Enable pipeline prefetching (pytorch#2963)

07dd860

Summary:
Pull Request resolved: pytorch#2963

This diff enables pipeline cache prefetching for SSD-TBE.  This allows
prefetch for the next iteration's batch to be carried out while the
computation of the current batch is going on.

We have done the following to guarantee cache consistency when
pipeline prefetching is enabled:

(1) Enable cache line locking (implemented in D46172802, D47638502,
D60812956) to ensure that cache lines are not prematurely evicted by
the prefetch when the previous iteration's computation is not
complete.

(2) Lookup L1 cache, the previous iteration's scratch pad (let's call
it SP(i-1)), and SSD/L2 cache. Move rows from SSD/L2 and/or SP(i-1) to
either L1 or the current iteration's scratch pad (let's call it
SP(i)).  Then we update the row pointers of the previous iteration's
indices based on the new locations, i.e., L1 or SP(i).  The detailed
explaination of the process is shown in the figure below:

{F1802341461}
https://internalfb.com/excalidraw/EX264315

(3) Ensure proper synchronizations between streams and events
- Ensure that prefetch of iteration i is complete before backward TBE
  of iteration i-1
- Ensure that prefetch of iteration i+1 starts after the backward TBE
  of iteration i is complete

The following is how prefetch operators run on GPU streams/CPU:

{F1802798301}

**Usage:**

```
# Initialize the module with prefetch_pipeline=True
emb = SSDTableBatchedEmbeddingBags(
            embedding_specs=...,
            prefetch_pipeline=True,
).cuda()

# When calling prefetch, make sure to pass the forward stream if using
# prefetch_stream so that TBE records tensors on streams properly
with torch.cuda.stream(prefetch_stream):
    emb.prefetch(
        indices,
        offsets,
        forward_stream=forward_stream
    )
```

Differential Revision: D60727327

sryap force-pushed the export-D60727327 branch from bbf2b1b to 07dd860 Compare

August 14, 2024 20:06

sarunya and others added 2 commits

August 14, 2024 16:27


          Fix get_unique_indices_v2 registration

82d78bf

Differential Revision: D61294287


          Enable pipeline prefetching (pytorch#2963)

bf45992

Summary:
Pull Request resolved: pytorch#2963

This diff enables pipeline cache prefetching for SSD-TBE.  This allows
prefetch for the next iteration's batch to be carried out while the
computation of the current batch is going on.

We have done the following to guarantee cache consistency when
pipeline prefetching is enabled:

(1) Enable cache line locking (implemented in D46172802, D47638502,
D60812956) to ensure that cache lines are not prematurely evicted by
the prefetch when the previous iteration's computation is not
complete.

(2) Lookup L1 cache, the previous iteration's scratch pad (let's call
it SP(i-1)), and SSD/L2 cache. Move rows from SSD/L2 and/or SP(i-1) to
either L1 or the current iteration's scratch pad (let's call it
SP(i)).  Then we update the row pointers of the previous iteration's
indices based on the new locations, i.e., L1 or SP(i).  The detailed
explaination of the process is shown in the figure below:

{F1802341461}
https://internalfb.com/excalidraw/EX264315

(3) Ensure proper synchronizations between streams and events
- Ensure that prefetch of iteration i is complete before backward TBE
  of iteration i-1
- Ensure that prefetch of iteration i+1 starts after the backward TBE
  of iteration i is complete

The following is how prefetch operators run on GPU streams/CPU:

{F1802798301}

**Usage:**

```
# Initialize the module with prefetch_pipeline=True
emb = SSDTableBatchedEmbeddingBags(
            embedding_specs=...,
            prefetch_pipeline=True,
).cuda()

# When calling prefetch, make sure to pass the forward stream if using
# prefetch_stream so that TBE records tensors on streams properly
with torch.cuda.stream(prefetch_stream):
    emb.prefetch(
        indices,
        offsets,
        forward_stream=forward_stream
    )
```

Differential Revision: D60727327

Contributor

facebook-github-bot commented Aug 14, 2024

This pull request was exported from Phabricator. Differential Revision: D60727327

sryap force-pushed the export-D60727327 branch from 07dd860 to bf45992 Compare

August 14, 2024 23:44

facebook-github-bot closed this in

e73b1f8

facebook-github-bot added the Merged label

Contributor

facebook-github-bot commented Aug 16, 2024

This pull request has been merged in e73b1f8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed fb-exported Merged