-
Notifications
You must be signed in to change notification settings - Fork 544
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split up jagged_tensor_ops.cu
for build perf
#1753
Conversation
✅ Deploy Preview for pytorch-fbgemm-docs canceled.
|
This pull request was exported from Phabricator. Differential Revision: D45021218 |
Summary: Pull Request resolved: pytorch#1753 Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s. This change moves most functions in `jagged_tensor_ops.cu` into their own files. It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline. A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers. `SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`. The `const` qualifier may have been added in a few places because it's a good idea. Reviewed By: simpkins Differential Revision: D45021218 fbshipit-source-id: b42b97bfee0f71320353c8dc269aaeef90b82ba5
This pull request was exported from Phabricator. Differential Revision: D45021218 |
Summary: Pull Request resolved: pytorch#1753 Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s. This change moves most functions in `jagged_tensor_ops.cu` into their own files. It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline. A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers. `SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`. The `const` qualifier may have been added in a few places because it's a good idea. Reviewed By: sryap, simpkins Differential Revision: D45021218 fbshipit-source-id: c742a84ad45decfd06d3bf2ce30794d1b8838066
This pull request was exported from Phabricator. Differential Revision: D45021218 |
1 similar comment
This pull request was exported from Phabricator. Differential Revision: D45021218 |
Summary: Pull Request resolved: pytorch#1753 Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s. This change moves most functions in `jagged_tensor_ops.cu` into their own files. It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline. A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers. `SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`. The `const` qualifier may have been added in a few places because it's a good idea. Reviewed By: sryap, simpkins Differential Revision: D45021218 fbshipit-source-id: 329b890cf5b4c724805da5d827d215365e7519e3
Summary: Pull Request resolved: pytorch#1753 Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s. This change moves most functions in `jagged_tensor_ops.cu` into their own files. It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline. A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers. `SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`. The `const` qualifier may have been added in a few places because it's a good idea. Reviewed By: sryap, simpkins Differential Revision: D45021218 fbshipit-source-id: 159ef7325108729a2c6b13daba7a993d91b76cd9
This pull request was exported from Phabricator. Differential Revision: D45021218 |
Summary: Pull Request resolved: pytorch#1753 Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s. This change moves most functions in `jagged_tensor_ops.cu` into their own files. It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline. A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers. `SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`. The `const` qualifier may have been added in a few places because it's a good idea. Reviewed By: sryap, simpkins Differential Revision: D45021218 fbshipit-source-id: 4e014ffec41feb97b7fabbf8901b37145cfc44b0
This pull request was exported from Phabricator. Differential Revision: D45021218 |
Summary: Pull Request resolved: pytorch#1753 Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s. This change moves most functions in `jagged_tensor_ops.cu` into their own files. It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline. A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers. `SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`. The `const` qualifier may have been added in a few places because it's a good idea. Reviewed By: sryap, simpkins Differential Revision: D45021218 fbshipit-source-id: ef4939a227feb8b79951fedff5180250a0d1ac42
This pull request was exported from Phabricator. Differential Revision: D45021218 |
Summary: Pull Request resolved: pytorch#1753 Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s. This change moves most functions in `jagged_tensor_ops.cu` into their own files. It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline. A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers. `SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`. The `const` qualifier may have been added in a few places because it's a good idea. Reviewed By: sryap, simpkins Differential Revision: D45021218 fbshipit-source-id: 8e99a17519f3d71b84eaeace3d26060a77ee1185
This pull request was exported from Phabricator. Differential Revision: D45021218 |
Summary: Pull Request resolved: pytorch#1753 Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s. This change moves most functions in `jagged_tensor_ops.cu` into their own files. It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline. A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers. `SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`. The `const` qualifier may have been added in a few places because it's a good idea. Reviewed By: sryap, simpkins Differential Revision: D45021218 fbshipit-source-id: 6a82925fee60cec4bec9e57435e3f1d194b57c9b
This pull request was exported from Phabricator. Differential Revision: D45021218 |
Summary: Pull Request resolved: pytorch#1753 Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s. This change moves most functions in `jagged_tensor_ops.cu` into their own files. It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline. A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers. `SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`. The `const` qualifier may have been added in a few places because it's a good idea. Reviewed By: sryap, simpkins Differential Revision: D45021218 fbshipit-source-id: 66962253aaabc9ca7b3dd4ea37acf9f0d5bbc842
This pull request was exported from Phabricator. Differential Revision: D45021218 |
Summary: Pull Request resolved: pytorch#1753 Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s. This change moves most functions in `jagged_tensor_ops.cu` into their own files. It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline. A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers. `SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`. The `const` qualifier may have been added in a few places because it's a good idea. Reviewed By: sryap, simpkins Differential Revision: D45021218 fbshipit-source-id: 14dca88bb28eeedc21ac807723732312dbc38071
This pull request was exported from Phabricator. Differential Revision: D45021218 |
Summary: Pull Request resolved: pytorch#1753 Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s. This change moves most functions in `jagged_tensor_ops.cu` into their own files. It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline. A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers. `SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`. The `const` qualifier may have been added in a few places because it's a good idea. Reviewed By: sryap, simpkins Differential Revision: D45021218 fbshipit-source-id: 54116f80350a8b754dd4784c864b77e24f4a6cf7
This pull request was exported from Phabricator. Differential Revision: D45021218 |
Summary: Pull Request resolved: pytorch#1753 Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s. This change moves most functions in `jagged_tensor_ops.cu` into their own files. It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline. A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers. `SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`. The `const` qualifier may have been added in a few places because it's a good idea. Reviewed By: sryap, simpkins Differential Revision: D45021218 fbshipit-source-id: e6252e1d41742e89841db6027ae601020e50f7e2
This pull request was exported from Phabricator. Differential Revision: D45021218 |
Summary: Pull Request resolved: pytorch#1753 Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s. This change moves most functions in `jagged_tensor_ops.cu` into their own files. It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline. A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers. `SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`. The `const` qualifier may have been added in a few places because it's a good idea. Reviewed By: sryap, simpkins Differential Revision: D45021218 fbshipit-source-id: bf0511f1579437d5fa40dc2ab9520c9b0df38d79
37d2042
to
fb70771
Compare
This pull request was exported from Phabricator. Differential Revision: D45021218 |
Summary: Pull Request resolved: pytorch#1753 Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s. This change moves most functions in `jagged_tensor_ops.cu` into their own files. It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline. A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers. `SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`. The `const` qualifier may have been added in a few places because it's a good idea. Reviewed By: sryap, simpkins Differential Revision: D45021218 fbshipit-source-id: 4485cf1c0aa28e58100342a4d13f72c8937336fe
fb70771
to
238cf92
Compare
This pull request was exported from Phabricator. Differential Revision: D45021218 |
Summary: Pull Request resolved: pytorch#1753 Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s. This change moves most functions in `jagged_tensor_ops.cu` into their own files. It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline. A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers. `SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`. The `const` qualifier may have been added in a few places because it's a good idea. Reviewed By: sryap, simpkins Differential Revision: D45021218 fbshipit-source-id: 5ad7d312b89b93f40bc268d206dbf470577ead81
238cf92
to
d958c53
Compare
This pull request was exported from Phabricator. Differential Revision: D45021218 |
Summary: Pull Request resolved: pytorch#1753 Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s. This change moves most functions in `jagged_tensor_ops.cu` into their own files. It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline. A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers. `SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`. The `const` qualifier may have been added in a few places because it's a good idea. Reviewed By: sryap, simpkins Differential Revision: D45021218 fbshipit-source-id: 47a7d01685ce694d8483fd6f77e34a27fb776cf6
d958c53
to
8a9c171
Compare
This pull request was exported from Phabricator. Differential Revision: D45021218 |
Summary: Pull Request resolved: pytorch#1753 Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s. This change moves most functions in `jagged_tensor_ops.cu` into their own files. It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline. A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers. `SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`. The `const` qualifier may have been added in a few places because it's a good idea. Reviewed By: sryap, simpkins Differential Revision: D45021218 fbshipit-source-id: 4595b39f40e0bad7318b0458385bf00cbf2be8d6
8a9c171
to
aa6d046
Compare
This pull request was exported from Phabricator. Differential Revision: D45021218 |
This pull request has been merged in 3f2eaa8. |
Summary:
Splitting up
jagged_tensor_ops.cu
reduces build times on a test system from 179s->124s.This change moves most functions in
jagged_tensor_ops.cu
into their own files.It moves the inline function
asynchronous_complete_cumsum
into its own implementation file and makes it no longer inline.A macro
FBGEMM_DISPATCH_TO_CUDA_STANDALONE
is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers.SharedMemory
is removed fromcommon.cuh
and we instead rely on the one infbgemm_cuda_utils.cuh
.The
const
qualifier may have been added in a few places because it's a good idea.Reviewed By: simpkins
Differential Revision: D45021218