-
Notifications
You must be signed in to change notification settings - Fork 478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix semi-sync race conditions and optimize memory usage #2598
Conversation
This pull request was exported from Phabricator. Differential Revision: D64220706 |
82a5651
to
ab693d3
Compare
This pull request was exported from Phabricator. Differential Revision: D64220706 |
ab693d3
to
fb4c04e
Compare
Summary: This PR introduces a set of fixes and optimizations for TorchRec's semi-synchronous training pipeline (where embeddings are fetched in advance and in parallel with backward): - Fixes memory safety issues causing CUDA Illegal Memory Access errors and NaNs during training, especially at high memory utilisations. The fix involves using calling [`record_stream`](https://pytorch.org/docs/stable/generated/torch.Tensor.record_stream.html) for tensors allocated in one CUDA stream and used in another (an example is embedding tensors which are looked up in an embedding stream but later accessed in the main stream). - Optimizes memory allocations to reduce memory fragmentation observed at high memory utilizations causing significant performance degradation due to expensive defragmentation calls (thanks to che-sh for initial observation and analysis). Optimizations include: - Freeing context objects and cached module outputs as early as possible to save memory - Moving small tensor allocations earlier to minimize fragmentation - Using a single stream per embedding module (instead of even and odd streams) as memory allocations by PyTorch's CUDACachingAllocator are associated with streams, meaning freed memory blocks in one stream cannot be used by other streams. By using more streams, we effectively decrease memory available to each stream, making defrags more likely. This is joint work with che-sh (optimizations to reduce memory fragmentation) and dstaay-fb (record_stream fixes for embeddings) Reviewed By: che-sh Differential Revision: D64220706
This pull request was exported from Phabricator. Differential Revision: D64220706 |
fb4c04e
to
4608336
Compare
Summary: This PR introduces a set of fixes and optimizations for TorchRec's semi-synchronous training pipeline (where embeddings are fetched in advance and in parallel with backward): - Fixes memory safety issues causing CUDA Illegal Memory Access errors and NaNs during training, especially at high memory utilisations. The fix involves using calling [`record_stream`](https://pytorch.org/docs/stable/generated/torch.Tensor.record_stream.html) for tensors allocated in one CUDA stream and used in another (an example is embedding tensors which are looked up in an embedding stream but later accessed in the main stream). - Optimizes memory allocations to reduce memory fragmentation observed at high memory utilizations causing significant performance degradation due to expensive defragmentation calls (thanks to che-sh for initial observation and analysis). Optimizations include: - Freeing context objects and cached module outputs as early as possible to save memory - Moving small tensor allocations earlier to minimize fragmentation - Using a single stream per embedding module (instead of even and odd streams) as memory allocations by PyTorch's CUDACachingAllocator are associated with streams, meaning freed memory blocks in one stream cannot be used by other streams. By using more streams, we effectively decrease memory available to each stream, making defrags more likely. This is joint work with che-sh (optimizations to reduce memory fragmentation) and dstaay-fb (record_stream fixes for embeddings) Reviewed By: che-sh Differential Revision: D64220706
This pull request was exported from Phabricator. Differential Revision: D64220706 |
Summary: This PR introduces a set of fixes and optimizations for TorchRec's semi-synchronous training pipeline (where embeddings are fetched in advance and in parallel with backward): - Fixes memory safety issues causing CUDA Illegal Memory Access errors and NaNs during training, especially at high memory utilisations. The fix involves using calling [`record_stream`](https://pytorch.org/docs/stable/generated/torch.Tensor.record_stream.html) for tensors allocated in one CUDA stream and used in another (an example is embedding tensors which are looked up in an embedding stream but later accessed in the main stream). - Optimizes memory allocations to reduce memory fragmentation observed at high memory utilizations causing significant performance degradation due to expensive defragmentation calls (thanks to che-sh for initial observation and analysis). Optimizations include: - Freeing context objects and cached module outputs as early as possible to save memory - Moving small tensor allocations earlier to minimize fragmentation - Using a single stream per embedding module (instead of even and odd streams) as memory allocations by PyTorch's CUDACachingAllocator are associated with streams, meaning freed memory blocks in one stream cannot be used by other streams. By using more streams, we effectively decrease memory available to each stream, making defrags more likely. This is joint work with che-sh (optimizations to reduce memory fragmentation) and dstaay-fb (record_stream fixes for embeddings) Reviewed By: che-sh Differential Revision: D64220706
4608336
to
5233e64
Compare
This pull request was exported from Phabricator. Differential Revision: D64220706 |
Summary: This PR introduces a set of fixes and optimizations for TorchRec's semi-synchronous training pipeline (where embeddings are fetched in advance and in parallel with backward): - Fixes memory safety issues causing CUDA Illegal Memory Access errors and NaNs during training, especially at high memory utilisations. The fix involves using calling [`record_stream`](https://pytorch.org/docs/stable/generated/torch.Tensor.record_stream.html) for tensors allocated in one CUDA stream and used in another (an example is embedding tensors which are looked up in an embedding stream but later accessed in the main stream). - Optimizes memory allocations to reduce memory fragmentation observed at high memory utilizations causing significant performance degradation due to expensive defragmentation calls (thanks to che-sh for initial observation and analysis). Optimizations include: - Freeing context objects and cached module outputs as early as possible to save memory - Moving small tensor allocations earlier to minimize fragmentation - Using a single stream per embedding module (instead of even and odd streams) as memory allocations by PyTorch's CUDACachingAllocator are associated with streams, meaning freed memory blocks in one stream cannot be used by other streams. By using more streams, we effectively decrease memory available to each stream, making defrags more likely. This is joint work with che-sh (optimizations to reduce memory fragmentation) and dstaay-fb (record_stream fixes for embeddings) Reviewed By: che-sh Differential Revision: D64220706
5233e64
to
75c3c62
Compare
This pull request was exported from Phabricator. Differential Revision: D64220706 |
Summary: This PR introduces a set of fixes and optimizations for TorchRec's semi-synchronous training pipeline (where embeddings are fetched in advance and in parallel with backward): - Fixes memory safety issues causing CUDA Illegal Memory Access errors and NaNs during training, especially at high memory utilisations. The fix involves using calling [`record_stream`](https://pytorch.org/docs/stable/generated/torch.Tensor.record_stream.html) for tensors allocated in one CUDA stream and used in another (an example is embedding tensors which are looked up in an embedding stream but later accessed in the main stream). - Optimizes memory allocations to reduce memory fragmentation observed at high memory utilizations causing significant performance degradation due to expensive defragmentation calls (thanks to che-sh for initial observation and analysis). Optimizations include: - Freeing context objects and cached module outputs as early as possible to save memory - Moving small tensor allocations earlier to minimize fragmentation - Using a single stream per embedding module (instead of even and odd streams) as memory allocations by PyTorch's CUDACachingAllocator are associated with streams, meaning freed memory blocks in one stream cannot be used by other streams. By using more streams, we effectively decrease memory available to each stream, making defrags more likely. This is joint work with che-sh (optimizations to reduce memory fragmentation) and dstaay-fb (record_stream fixes for embeddings) Reviewed By: che-sh Differential Revision: D64220706
75c3c62
to
c5484d9
Compare
This pull request was exported from Phabricator. Differential Revision: D64220706 |
c5484d9
to
8040d2f
Compare
Summary: This PR introduces a set of fixes and optimizations for TorchRec's semi-synchronous training pipeline (where embeddings are fetched in advance and in parallel with backward): - Fixes memory safety issues causing CUDA Illegal Memory Access errors and NaNs during training, especially at high memory utilisations. The fix involves using calling [`record_stream`](https://pytorch.org/docs/stable/generated/torch.Tensor.record_stream.html) for tensors allocated in one CUDA stream and used in another (an example is embedding tensors which are looked up in an embedding stream but later accessed in the main stream). - Optimizes memory allocations to reduce memory fragmentation observed at high memory utilizations causing significant performance degradation due to expensive defragmentation calls (thanks to che-sh for initial observation and analysis). Optimizations include: - Freeing context objects and cached module outputs as early as possible to save memory - Moving small tensor allocations earlier to minimize fragmentation - Using a single stream per embedding module (instead of even and odd streams) as memory allocations by PyTorch's CUDACachingAllocator are associated with streams, meaning freed memory blocks in one stream cannot be used by other streams. By using more streams, we effectively decrease memory available to each stream, making defrags more likely. This is joint work with che-sh (optimizations to reduce memory fragmentation) and dstaay-fb (record_stream fixes for embeddings) Reviewed By: che-sh Differential Revision: D64220706
This pull request was exported from Phabricator. Differential Revision: D64220706 |
Summary: This PR introduces a set of fixes and optimizations for TorchRec's semi-synchronous training pipeline (where embeddings are fetched in advance and in parallel with backward): - Fixes memory safety issues causing CUDA Illegal Memory Access errors and NaNs during training, especially at high memory utilisations. The fix involves using calling [`record_stream`](https://pytorch.org/docs/stable/generated/torch.Tensor.record_stream.html) for tensors allocated in one CUDA stream and used in another (an example is embedding tensors which are looked up in an embedding stream but later accessed in the main stream). - Optimizes memory allocations to reduce memory fragmentation observed at high memory utilizations causing significant performance degradation due to expensive defragmentation calls (thanks to che-sh for initial observation and analysis). Optimizations include: - Freeing context objects and cached module outputs as early as possible to save memory - Moving small tensor allocations earlier to minimize fragmentation - Using a single stream per embedding module (instead of even and odd streams) as memory allocations by PyTorch's CUDACachingAllocator are associated with streams, meaning freed memory blocks in one stream cannot be used by other streams. By using more streams, we effectively decrease memory available to each stream, making defrags more likely. This is joint work with che-sh (optimizations to reduce memory fragmentation) and dstaay-fb (record_stream fixes for embeddings) Reviewed By: che-sh Differential Revision: D64220706
8040d2f
to
a08ad09
Compare
This pull request was exported from Phabricator. Differential Revision: D64220706 |
Summary: This PR introduces a set of fixes and optimizations for TorchRec's semi-synchronous training pipeline (where embeddings are fetched in advance and in parallel with backward): - Fixes memory safety issues causing CUDA Illegal Memory Access errors and NaNs during training, especially at high memory utilisations. The fix involves using calling [`record_stream`](https://pytorch.org/docs/stable/generated/torch.Tensor.record_stream.html) for tensors allocated in one CUDA stream and used in another (an example is embedding tensors which are looked up in an embedding stream but later accessed in the main stream). - Optimizes memory allocations to reduce memory fragmentation observed at high memory utilizations causing significant performance degradation due to expensive defragmentation calls (thanks to che-sh for initial observation and analysis). Optimizations include: - Freeing context objects and cached module outputs as early as possible to save memory - Moving small tensor allocations earlier to minimize fragmentation - Using a single stream per embedding module (instead of even and odd streams) as memory allocations by PyTorch's CUDACachingAllocator are associated with streams, meaning freed memory blocks in one stream cannot be used by other streams. By using more streams, we effectively decrease memory available to each stream, making defrags more likely. This is joint work with che-sh (optimizations to reduce memory fragmentation) and dstaay-fb (record_stream fixes for embeddings) Reviewed By: che-sh Differential Revision: D64220706
a08ad09
to
3979ce5
Compare
This pull request was exported from Phabricator. Differential Revision: D64220706 |
Summary: This PR introduces a set of fixes and optimizations for TorchRec's semi-synchronous training pipeline (where embeddings are fetched in advance and in parallel with backward): - Fixes memory safety issues causing CUDA Illegal Memory Access errors and NaNs during training, especially at high memory utilisations. The fix involves using calling [`record_stream`](https://pytorch.org/docs/stable/generated/torch.Tensor.record_stream.html) for tensors allocated in one CUDA stream and used in another (an example is embedding tensors which are looked up in an embedding stream but later accessed in the main stream). - Optimizes memory allocations to reduce memory fragmentation observed at high memory utilizations causing significant performance degradation due to expensive defragmentation calls (thanks to che-sh for initial observation and analysis). Optimizations include: - Freeing context objects and cached module outputs as early as possible to save memory - Moving small tensor allocations earlier to minimize fragmentation - Using a single stream per embedding module (instead of even and odd streams) as memory allocations by PyTorch's CUDACachingAllocator are associated with streams, meaning freed memory blocks in one stream cannot be used by other streams. By using more streams, we effectively decrease memory available to each stream, making defrags more likely. This is joint work with che-sh (optimizations to reduce memory fragmentation) and dstaay-fb (record_stream fixes for embeddings) Reviewed By: che-sh Differential Revision: D64220706
3979ce5
to
98db96f
Compare
This pull request was exported from Phabricator. Differential Revision: D64220706 |
Summary: This PR introduces a set of fixes and optimizations for TorchRec's semi-synchronous training pipeline (where embeddings are fetched in advance and in parallel with backward): - Fixes memory safety issues causing CUDA Illegal Memory Access errors and NaNs during training, especially at high memory utilisations. The fix involves using calling [`record_stream`](https://pytorch.org/docs/stable/generated/torch.Tensor.record_stream.html) for tensors allocated in one CUDA stream and used in another (an example is embedding tensors which are looked up in an embedding stream but later accessed in the main stream). - Optimizes memory allocations to reduce memory fragmentation observed at high memory utilizations causing significant performance degradation due to expensive defragmentation calls (thanks to che-sh for initial observation and analysis). Optimizations include: - Freeing context objects and cached module outputs as early as possible to save memory - Moving small tensor allocations earlier to minimize fragmentation - Using a single stream per embedding module (instead of even and odd streams) as memory allocations by PyTorch's CUDACachingAllocator are associated with streams, meaning freed memory blocks in one stream cannot be used by other streams. By using more streams, we effectively decrease memory available to each stream, making defrags more likely. This is joint work with che-sh (optimizations to reduce memory fragmentation) and dstaay-fb (record_stream fixes for embeddings) Reviewed By: che-sh Differential Revision: D64220706
98db96f
to
13a7d2d
Compare
This pull request was exported from Phabricator. Differential Revision: D64220706 |
Summary: This PR introduces a set of fixes and optimizations for TorchRec's semi-synchronous training pipeline (where embeddings are fetched in advance and in parallel with backward): - Fixes memory safety issues causing CUDA Illegal Memory Access errors and NaNs during training, especially at high memory utilisations. The fix involves using calling [`record_stream`](https://pytorch.org/docs/stable/generated/torch.Tensor.record_stream.html) for tensors allocated in one CUDA stream and used in another (an example is embedding tensors which are looked up in an embedding stream but later accessed in the main stream). - Optimizes memory allocations to reduce memory fragmentation observed at high memory utilizations causing significant performance degradation due to expensive defragmentation calls (thanks to che-sh for initial observation and analysis). Optimizations include: - Freeing context objects and cached module outputs as early as possible to save memory - Moving small tensor allocations earlier to minimize fragmentation - Using a single stream per embedding module (instead of even and odd streams) as memory allocations by PyTorch's CUDACachingAllocator are associated with streams, meaning freed memory blocks in one stream cannot be used by other streams. By using more streams, we effectively decrease memory available to each stream, making defrags more likely. This is joint work with che-sh (optimizations to reduce memory fragmentation) and dstaay-fb (record_stream fixes for embeddings) Reviewed By: che-sh Differential Revision: D64220706
13a7d2d
to
4174768
Compare
This pull request was exported from Phabricator. Differential Revision: D64220706 |
Summary: This PR introduces a set of fixes and optimizations for TorchRec's semi-synchronous training pipeline (where embeddings are fetched in advance and in parallel with backward): - Fixes memory safety issues causing CUDA Illegal Memory Access errors and NaNs during training, especially at high memory utilisations. The fix involves using calling [`record_stream`](https://pytorch.org/docs/stable/generated/torch.Tensor.record_stream.html) for tensors allocated in one CUDA stream and used in another (an example is embedding tensors which are looked up in an embedding stream but later accessed in the main stream). - Optimizes memory allocations to reduce memory fragmentation observed at high memory utilizations causing significant performance degradation due to expensive defragmentation calls (thanks to che-sh for initial observation and analysis). Optimizations include: - Freeing context objects and cached module outputs as early as possible to save memory - Moving small tensor allocations earlier to minimize fragmentation - Using a single stream per embedding module (instead of even and odd streams) as memory allocations by PyTorch's CUDACachingAllocator are associated with streams, meaning freed memory blocks in one stream cannot be used by other streams. By using more streams, we effectively decrease memory available to each stream, making defrags more likely. This is joint work with che-sh (optimizations to reduce memory fragmentation) and dstaay-fb (record_stream fixes for embeddings) Reviewed By: che-sh Differential Revision: D64220706
4174768
to
3996038
Compare
This pull request was exported from Phabricator. Differential Revision: D64220706 |
Summary: pytorch#2598 causes failures when running on cpu: ``` RuntimeError: unknown parameter type ``` This is because `record_stream` doesn't accept CPU streams. This diff forward fixes the issue. Differential Revision: D67171994
Summary: Pull Request resolved: #2634 #2598 causes failures when running on cpu: ``` RuntimeError: unknown parameter type ``` This is because `record_stream` doesn't accept CPU streams. This diff forward fixes the issue. Differential Revision: D67171994 fbshipit-source-id: a9ffddfcd978d343a58fb039a4fcd89c336ceee6
Summary: pytorch#2598 (D64220706) causes failures when using other accelerators that do not support CUDA. Making the stream contexts hardware agnostic. Differential Revision: D67363141
Summary: pytorch#2598 (D64220706) causes failures when using other accelerators that do not support CUDA. Making the stream contexts hardware agnostic. Differential Revision: D67363141
Summary: pytorch#2598 (D64220706) causes failures when using other accelerators that do not support CUDA. Making the stream contexts hardware agnostic. Differential Revision: D67363141
Summary: pytorch#2598 (D64220706) causes failures when using other accelerators that do not support CUDA. Making the stream contexts hardware agnostic. Differential Revision: D67363141
Summary: Pull Request resolved: #2644 #2598 (D64220706) causes failures when using other accelerators that do not support CUDA. Making the stream contexts hardware agnostic. Reviewed By: hpnhxxwn, iamzainhuda Differential Revision: D67363141 fbshipit-source-id: fc2c6fec1dcbbe15f0385e299b666207d2d9a8f5
Summary:
This PR introduces a set of fixes and optimizations for TorchRec's semi-synchronous training pipeline (where embeddings are fetched in advance and in parallel with backward):
record_stream
for tensors allocated in one CUDA stream and used in another (an example is embedding tensors which are looked up in an embedding stream but later accessed in the main stream).This is joint work with @che-sh (optimizations to reduce memory fragmentation) and @dstaay-fb (record_stream fixes for embeddings)
Differential Revision: D64220706