Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix semi-sync race conditions and optimize memory usage #2598

Closed
wants to merge 1 commit into from

Conversation

sarckk
Copy link
Member

@sarckk sarckk commented Nov 27, 2024

Summary:
This PR introduces a set of fixes and optimizations for TorchRec's semi-synchronous training pipeline (where embeddings are fetched in advance and in parallel with backward):

  • Fixes memory safety issues causing CUDA Illegal Memory Access errors and NaNs during training, especially at high memory utilisations. The fix involves using calling record_stream for tensors allocated in one CUDA stream and used in another (an example is embedding tensors which are looked up in an embedding stream but later accessed in the main stream).
  • Optimizes memory allocations to reduce memory fragmentation observed at high memory utilizations causing significant performance degradation due to expensive defragmentation calls (thanks to @che-sh for initial observation and analysis). Optimizations include:
    • Freeing context objects and cached module outputs as early as possible to save memory
    • Moving small tensor allocations earlier to minimize fragmentation
    • Using a single stream per embedding module (instead of even and odd streams) as memory allocations by PyTorch's CUDACachingAllocator are associated with streams, meaning freed memory blocks in one stream cannot be used by other streams. By using more streams, we effectively decrease memory available to each stream, making defrags more likely.

This is joint work with @che-sh (optimizations to reduce memory fragmentation) and @dstaay-fb (record_stream fixes for embeddings)

Differential Revision: D64220706

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 27, 2024
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64220706

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64220706

sarckk added a commit to sarckk/torchrec that referenced this pull request Dec 5, 2024
Summary:

This PR introduces a set of fixes and optimizations for TorchRec's semi-synchronous training pipeline (where embeddings are fetched in advance and in parallel with backward):
- Fixes memory safety issues causing CUDA Illegal Memory Access errors and NaNs during training, especially at high memory utilisations. The fix involves using calling [`record_stream`](https://pytorch.org/docs/stable/generated/torch.Tensor.record_stream.html) for tensors allocated in one CUDA stream and used in another (an example is embedding tensors which are looked up in an embedding stream but later accessed in the main stream).
- Optimizes memory allocations to reduce memory fragmentation observed at high memory utilizations causing significant performance degradation due to expensive defragmentation calls (thanks to che-sh for initial observation and analysis). Optimizations include:
  -  Freeing context objects and cached module outputs as early as possible to save memory
  - Moving small tensor allocations earlier to minimize fragmentation
  - Using a single stream per embedding module (instead of even and odd streams) as memory allocations by PyTorch's CUDACachingAllocator are associated with streams, meaning freed memory blocks in one stream cannot be used by other streams. By using more streams, we effectively decrease memory available to each stream, making defrags more likely. 

This is joint work with che-sh (optimizations to reduce memory fragmentation) and dstaay-fb (record_stream fixes for embeddings)

Reviewed By: che-sh

Differential Revision: D64220706
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64220706

sarckk added a commit to sarckk/torchrec that referenced this pull request Dec 10, 2024
Summary:

This PR introduces a set of fixes and optimizations for TorchRec's semi-synchronous training pipeline (where embeddings are fetched in advance and in parallel with backward):
- Fixes memory safety issues causing CUDA Illegal Memory Access errors and NaNs during training, especially at high memory utilisations. The fix involves using calling [`record_stream`](https://pytorch.org/docs/stable/generated/torch.Tensor.record_stream.html) for tensors allocated in one CUDA stream and used in another (an example is embedding tensors which are looked up in an embedding stream but later accessed in the main stream).
- Optimizes memory allocations to reduce memory fragmentation observed at high memory utilizations causing significant performance degradation due to expensive defragmentation calls (thanks to che-sh for initial observation and analysis). Optimizations include:
  -  Freeing context objects and cached module outputs as early as possible to save memory
  - Moving small tensor allocations earlier to minimize fragmentation
  - Using a single stream per embedding module (instead of even and odd streams) as memory allocations by PyTorch's CUDACachingAllocator are associated with streams, meaning freed memory blocks in one stream cannot be used by other streams. By using more streams, we effectively decrease memory available to each stream, making defrags more likely. 

This is joint work with che-sh (optimizations to reduce memory fragmentation) and dstaay-fb (record_stream fixes for embeddings)

Reviewed By: che-sh

Differential Revision: D64220706
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64220706

sarckk added a commit to sarckk/torchrec that referenced this pull request Dec 10, 2024
Summary:

This PR introduces a set of fixes and optimizations for TorchRec's semi-synchronous training pipeline (where embeddings are fetched in advance and in parallel with backward):
- Fixes memory safety issues causing CUDA Illegal Memory Access errors and NaNs during training, especially at high memory utilisations. The fix involves using calling [`record_stream`](https://pytorch.org/docs/stable/generated/torch.Tensor.record_stream.html) for tensors allocated in one CUDA stream and used in another (an example is embedding tensors which are looked up in an embedding stream but later accessed in the main stream).
- Optimizes memory allocations to reduce memory fragmentation observed at high memory utilizations causing significant performance degradation due to expensive defragmentation calls (thanks to che-sh for initial observation and analysis). Optimizations include:
  -  Freeing context objects and cached module outputs as early as possible to save memory
  - Moving small tensor allocations earlier to minimize fragmentation
  - Using a single stream per embedding module (instead of even and odd streams) as memory allocations by PyTorch's CUDACachingAllocator are associated with streams, meaning freed memory blocks in one stream cannot be used by other streams. By using more streams, we effectively decrease memory available to each stream, making defrags more likely. 

This is joint work with che-sh (optimizations to reduce memory fragmentation) and dstaay-fb (record_stream fixes for embeddings)

Reviewed By: che-sh

Differential Revision: D64220706
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64220706

sarckk added a commit to sarckk/torchrec that referenced this pull request Dec 10, 2024
Summary:

This PR introduces a set of fixes and optimizations for TorchRec's semi-synchronous training pipeline (where embeddings are fetched in advance and in parallel with backward):
- Fixes memory safety issues causing CUDA Illegal Memory Access errors and NaNs during training, especially at high memory utilisations. The fix involves using calling [`record_stream`](https://pytorch.org/docs/stable/generated/torch.Tensor.record_stream.html) for tensors allocated in one CUDA stream and used in another (an example is embedding tensors which are looked up in an embedding stream but later accessed in the main stream).
- Optimizes memory allocations to reduce memory fragmentation observed at high memory utilizations causing significant performance degradation due to expensive defragmentation calls (thanks to che-sh for initial observation and analysis). Optimizations include:
  -  Freeing context objects and cached module outputs as early as possible to save memory
  - Moving small tensor allocations earlier to minimize fragmentation
  - Using a single stream per embedding module (instead of even and odd streams) as memory allocations by PyTorch's CUDACachingAllocator are associated with streams, meaning freed memory blocks in one stream cannot be used by other streams. By using more streams, we effectively decrease memory available to each stream, making defrags more likely. 

This is joint work with che-sh (optimizations to reduce memory fragmentation) and dstaay-fb (record_stream fixes for embeddings)

Reviewed By: che-sh

Differential Revision: D64220706
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64220706

sarckk added a commit to sarckk/torchrec that referenced this pull request Dec 10, 2024
Summary:

This PR introduces a set of fixes and optimizations for TorchRec's semi-synchronous training pipeline (where embeddings are fetched in advance and in parallel with backward):
- Fixes memory safety issues causing CUDA Illegal Memory Access errors and NaNs during training, especially at high memory utilisations. The fix involves using calling [`record_stream`](https://pytorch.org/docs/stable/generated/torch.Tensor.record_stream.html) for tensors allocated in one CUDA stream and used in another (an example is embedding tensors which are looked up in an embedding stream but later accessed in the main stream).
- Optimizes memory allocations to reduce memory fragmentation observed at high memory utilizations causing significant performance degradation due to expensive defragmentation calls (thanks to che-sh for initial observation and analysis). Optimizations include:
  -  Freeing context objects and cached module outputs as early as possible to save memory
  - Moving small tensor allocations earlier to minimize fragmentation
  - Using a single stream per embedding module (instead of even and odd streams) as memory allocations by PyTorch's CUDACachingAllocator are associated with streams, meaning freed memory blocks in one stream cannot be used by other streams. By using more streams, we effectively decrease memory available to each stream, making defrags more likely. 

This is joint work with che-sh (optimizations to reduce memory fragmentation) and dstaay-fb (record_stream fixes for embeddings)

Reviewed By: che-sh

Differential Revision: D64220706
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64220706

sarckk added a commit to sarckk/torchrec that referenced this pull request Dec 10, 2024
Summary:

This PR introduces a set of fixes and optimizations for TorchRec's semi-synchronous training pipeline (where embeddings are fetched in advance and in parallel with backward):
- Fixes memory safety issues causing CUDA Illegal Memory Access errors and NaNs during training, especially at high memory utilisations. The fix involves using calling [`record_stream`](https://pytorch.org/docs/stable/generated/torch.Tensor.record_stream.html) for tensors allocated in one CUDA stream and used in another (an example is embedding tensors which are looked up in an embedding stream but later accessed in the main stream).
- Optimizes memory allocations to reduce memory fragmentation observed at high memory utilizations causing significant performance degradation due to expensive defragmentation calls (thanks to che-sh for initial observation and analysis). Optimizations include:
  -  Freeing context objects and cached module outputs as early as possible to save memory
  - Moving small tensor allocations earlier to minimize fragmentation
  - Using a single stream per embedding module (instead of even and odd streams) as memory allocations by PyTorch's CUDACachingAllocator are associated with streams, meaning freed memory blocks in one stream cannot be used by other streams. By using more streams, we effectively decrease memory available to each stream, making defrags more likely. 

This is joint work with che-sh (optimizations to reduce memory fragmentation) and dstaay-fb (record_stream fixes for embeddings)

Reviewed By: che-sh

Differential Revision: D64220706
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64220706

sarckk added a commit to sarckk/torchrec that referenced this pull request Dec 11, 2024
Summary:

This PR introduces a set of fixes and optimizations for TorchRec's semi-synchronous training pipeline (where embeddings are fetched in advance and in parallel with backward):
- Fixes memory safety issues causing CUDA Illegal Memory Access errors and NaNs during training, especially at high memory utilisations. The fix involves using calling [`record_stream`](https://pytorch.org/docs/stable/generated/torch.Tensor.record_stream.html) for tensors allocated in one CUDA stream and used in another (an example is embedding tensors which are looked up in an embedding stream but later accessed in the main stream).
- Optimizes memory allocations to reduce memory fragmentation observed at high memory utilizations causing significant performance degradation due to expensive defragmentation calls (thanks to che-sh for initial observation and analysis). Optimizations include:
  -  Freeing context objects and cached module outputs as early as possible to save memory
  - Moving small tensor allocations earlier to minimize fragmentation
  - Using a single stream per embedding module (instead of even and odd streams) as memory allocations by PyTorch's CUDACachingAllocator are associated with streams, meaning freed memory blocks in one stream cannot be used by other streams. By using more streams, we effectively decrease memory available to each stream, making defrags more likely. 

This is joint work with che-sh (optimizations to reduce memory fragmentation) and dstaay-fb (record_stream fixes for embeddings)

Reviewed By: che-sh

Differential Revision: D64220706
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64220706

sarckk added a commit to sarckk/torchrec that referenced this pull request Dec 11, 2024
Summary:

This PR introduces a set of fixes and optimizations for TorchRec's semi-synchronous training pipeline (where embeddings are fetched in advance and in parallel with backward):
- Fixes memory safety issues causing CUDA Illegal Memory Access errors and NaNs during training, especially at high memory utilisations. The fix involves using calling [`record_stream`](https://pytorch.org/docs/stable/generated/torch.Tensor.record_stream.html) for tensors allocated in one CUDA stream and used in another (an example is embedding tensors which are looked up in an embedding stream but later accessed in the main stream).
- Optimizes memory allocations to reduce memory fragmentation observed at high memory utilizations causing significant performance degradation due to expensive defragmentation calls (thanks to che-sh for initial observation and analysis). Optimizations include:
  -  Freeing context objects and cached module outputs as early as possible to save memory
  - Moving small tensor allocations earlier to minimize fragmentation
  - Using a single stream per embedding module (instead of even and odd streams) as memory allocations by PyTorch's CUDACachingAllocator are associated with streams, meaning freed memory blocks in one stream cannot be used by other streams. By using more streams, we effectively decrease memory available to each stream, making defrags more likely. 

This is joint work with che-sh (optimizations to reduce memory fragmentation) and dstaay-fb (record_stream fixes for embeddings)

Reviewed By: che-sh

Differential Revision: D64220706
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64220706

sarckk added a commit to sarckk/torchrec that referenced this pull request Dec 11, 2024
Summary:

This PR introduces a set of fixes and optimizations for TorchRec's semi-synchronous training pipeline (where embeddings are fetched in advance and in parallel with backward):
- Fixes memory safety issues causing CUDA Illegal Memory Access errors and NaNs during training, especially at high memory utilisations. The fix involves using calling [`record_stream`](https://pytorch.org/docs/stable/generated/torch.Tensor.record_stream.html) for tensors allocated in one CUDA stream and used in another (an example is embedding tensors which are looked up in an embedding stream but later accessed in the main stream).
- Optimizes memory allocations to reduce memory fragmentation observed at high memory utilizations causing significant performance degradation due to expensive defragmentation calls (thanks to che-sh for initial observation and analysis). Optimizations include:
  -  Freeing context objects and cached module outputs as early as possible to save memory
  - Moving small tensor allocations earlier to minimize fragmentation
  - Using a single stream per embedding module (instead of even and odd streams) as memory allocations by PyTorch's CUDACachingAllocator are associated with streams, meaning freed memory blocks in one stream cannot be used by other streams. By using more streams, we effectively decrease memory available to each stream, making defrags more likely. 

This is joint work with che-sh (optimizations to reduce memory fragmentation) and dstaay-fb (record_stream fixes for embeddings)

Reviewed By: che-sh

Differential Revision: D64220706
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64220706

sarckk added a commit to sarckk/torchrec that referenced this pull request Dec 11, 2024
Summary:

This PR introduces a set of fixes and optimizations for TorchRec's semi-synchronous training pipeline (where embeddings are fetched in advance and in parallel with backward):
- Fixes memory safety issues causing CUDA Illegal Memory Access errors and NaNs during training, especially at high memory utilisations. The fix involves using calling [`record_stream`](https://pytorch.org/docs/stable/generated/torch.Tensor.record_stream.html) for tensors allocated in one CUDA stream and used in another (an example is embedding tensors which are looked up in an embedding stream but later accessed in the main stream).
- Optimizes memory allocations to reduce memory fragmentation observed at high memory utilizations causing significant performance degradation due to expensive defragmentation calls (thanks to che-sh for initial observation and analysis). Optimizations include:
  -  Freeing context objects and cached module outputs as early as possible to save memory
  - Moving small tensor allocations earlier to minimize fragmentation
  - Using a single stream per embedding module (instead of even and odd streams) as memory allocations by PyTorch's CUDACachingAllocator are associated with streams, meaning freed memory blocks in one stream cannot be used by other streams. By using more streams, we effectively decrease memory available to each stream, making defrags more likely. 

This is joint work with che-sh (optimizations to reduce memory fragmentation) and dstaay-fb (record_stream fixes for embeddings)

Reviewed By: che-sh

Differential Revision: D64220706
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64220706

sarckk added a commit to sarckk/torchrec that referenced this pull request Dec 12, 2024
Summary:

This PR introduces a set of fixes and optimizations for TorchRec's semi-synchronous training pipeline (where embeddings are fetched in advance and in parallel with backward):
- Fixes memory safety issues causing CUDA Illegal Memory Access errors and NaNs during training, especially at high memory utilisations. The fix involves using calling [`record_stream`](https://pytorch.org/docs/stable/generated/torch.Tensor.record_stream.html) for tensors allocated in one CUDA stream and used in another (an example is embedding tensors which are looked up in an embedding stream but later accessed in the main stream).
- Optimizes memory allocations to reduce memory fragmentation observed at high memory utilizations causing significant performance degradation due to expensive defragmentation calls (thanks to che-sh for initial observation and analysis). Optimizations include:
  -  Freeing context objects and cached module outputs as early as possible to save memory
  - Moving small tensor allocations earlier to minimize fragmentation
  - Using a single stream per embedding module (instead of even and odd streams) as memory allocations by PyTorch's CUDACachingAllocator are associated with streams, meaning freed memory blocks in one stream cannot be used by other streams. By using more streams, we effectively decrease memory available to each stream, making defrags more likely. 

This is joint work with che-sh (optimizations to reduce memory fragmentation) and dstaay-fb (record_stream fixes for embeddings)

Reviewed By: che-sh

Differential Revision: D64220706
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64220706

Summary:

This PR introduces a set of fixes and optimizations for TorchRec's semi-synchronous training pipeline (where embeddings are fetched in advance and in parallel with backward):
- Fixes memory safety issues causing CUDA Illegal Memory Access errors and NaNs during training, especially at high memory utilisations. The fix involves using calling [`record_stream`](https://pytorch.org/docs/stable/generated/torch.Tensor.record_stream.html) for tensors allocated in one CUDA stream and used in another (an example is embedding tensors which are looked up in an embedding stream but later accessed in the main stream).
- Optimizes memory allocations to reduce memory fragmentation observed at high memory utilizations causing significant performance degradation due to expensive defragmentation calls (thanks to che-sh for initial observation and analysis). Optimizations include:
  -  Freeing context objects and cached module outputs as early as possible to save memory
  - Moving small tensor allocations earlier to minimize fragmentation
  - Using a single stream per embedding module (instead of even and odd streams) as memory allocations by PyTorch's CUDACachingAllocator are associated with streams, meaning freed memory blocks in one stream cannot be used by other streams. By using more streams, we effectively decrease memory available to each stream, making defrags more likely. 

This is joint work with che-sh (optimizations to reduce memory fragmentation) and dstaay-fb (record_stream fixes for embeddings)

Reviewed By: che-sh

Differential Revision: D64220706
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64220706

sarckk added a commit to sarckk/torchrec that referenced this pull request Dec 13, 2024
Summary:
pytorch#2598 causes failures when running on cpu:

```
RuntimeError: unknown parameter type
```

This is because `record_stream` doesn't accept CPU streams. This diff forward fixes the issue.

Differential Revision: D67171994
facebook-github-bot pushed a commit that referenced this pull request Dec 13, 2024
Summary:
Pull Request resolved: #2634

#2598 causes failures when running on cpu:

```
RuntimeError: unknown parameter type
```

This is because `record_stream` doesn't accept CPU streams. This diff forward fixes the issue.

Differential Revision: D67171994

fbshipit-source-id: a9ffddfcd978d343a58fb039a4fcd89c336ceee6
sarckk added a commit to sarckk/torchrec that referenced this pull request Dec 17, 2024
Summary: pytorch#2598 (D64220706) causes failures when using other accelerators that do not support CUDA. Making the stream contexts hardware agnostic.

Differential Revision: D67363141
sarckk added a commit to sarckk/torchrec that referenced this pull request Dec 18, 2024
Summary:

pytorch#2598 (D64220706) causes failures when using other accelerators that do not support CUDA. Making the stream contexts hardware agnostic.

Differential Revision: D67363141
sarckk added a commit to sarckk/torchrec that referenced this pull request Dec 18, 2024
Summary:

pytorch#2598 (D64220706) causes failures when using other accelerators that do not support CUDA. Making the stream contexts hardware agnostic.

Differential Revision: D67363141
sarckk added a commit to sarckk/torchrec that referenced this pull request Dec 18, 2024
Summary:

pytorch#2598 (D64220706) causes failures when using other accelerators that do not support CUDA. Making the stream contexts hardware agnostic.

Differential Revision: D67363141
facebook-github-bot pushed a commit that referenced this pull request Dec 18, 2024
Summary:
Pull Request resolved: #2644

#2598 (D64220706) causes failures when using other accelerators that do not support CUDA. Making the stream contexts hardware agnostic.

Reviewed By: hpnhxxwn, iamzainhuda

Differential Revision: D67363141

fbshipit-source-id: fc2c6fec1dcbbe15f0385e299b666207d2d9a8f5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants