Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split up jagged_tensor_ops.cu for build perf #1753

Closed
wants to merge 1 commit into from

Conversation

r-barnes
Copy link
Contributor

@r-barnes r-barnes commented May 5, 2023

Summary:
Splitting up jagged_tensor_ops.cu reduces build times on a test system from 179s->124s.

This change moves most functions in jagged_tensor_ops.cu into their own files.

It moves the inline function asynchronous_complete_cumsum into its own implementation file and makes it no longer inline.

A macro FBGEMM_DISPATCH_TO_CUDA_STANDALONE is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers.

SharedMemory is removed from common.cuh and we instead rely on the one in fbgemm_cuda_utils.cuh.

The const qualifier may have been added in a few places because it's a good idea.

Reviewed By: simpkins

Differential Revision: D45021218

@netlify
Copy link

netlify bot commented May 5, 2023

Deploy Preview for pytorch-fbgemm-docs canceled.

Name Link
🔨 Latest commit aa6d046
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/645d6aff573760000991c832

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D45021218

r-barnes added a commit to r-barnes/FBGEMM that referenced this pull request May 5, 2023
Summary:
Pull Request resolved: pytorch#1753

Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s.

This change moves most functions in `jagged_tensor_ops.cu` into their own files.

It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline.

A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers.

`SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`.

The `const` qualifier may have been added in a few places because it's a good idea.

Reviewed By: simpkins

Differential Revision: D45021218

fbshipit-source-id: b42b97bfee0f71320353c8dc269aaeef90b82ba5
@r-barnes r-barnes force-pushed the export-D45021218 branch from 0d2ba4a to 5ebe662 Compare May 5, 2023 22:19
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D45021218

r-barnes added a commit to r-barnes/FBGEMM that referenced this pull request May 8, 2023
Summary:
Pull Request resolved: pytorch#1753

Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s.

This change moves most functions in `jagged_tensor_ops.cu` into their own files.

It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline.

A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers.

`SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`.

The `const` qualifier may have been added in a few places because it's a good idea.

Reviewed By: sryap, simpkins

Differential Revision: D45021218

fbshipit-source-id: c742a84ad45decfd06d3bf2ce30794d1b8838066
@r-barnes r-barnes force-pushed the export-D45021218 branch from 5ebe662 to cbe13ee Compare May 8, 2023 18:08
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D45021218

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D45021218

r-barnes added a commit to r-barnes/FBGEMM that referenced this pull request May 8, 2023
Summary:
Pull Request resolved: pytorch#1753

Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s.

This change moves most functions in `jagged_tensor_ops.cu` into their own files.

It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline.

A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers.

`SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`.

The `const` qualifier may have been added in a few places because it's a good idea.

Reviewed By: sryap, simpkins

Differential Revision: D45021218

fbshipit-source-id: 329b890cf5b4c724805da5d827d215365e7519e3
@r-barnes r-barnes force-pushed the export-D45021218 branch from cbe13ee to 61cc293 Compare May 8, 2023 18:43
r-barnes added a commit to r-barnes/FBGEMM that referenced this pull request May 8, 2023
Summary:
Pull Request resolved: pytorch#1753

Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s.

This change moves most functions in `jagged_tensor_ops.cu` into their own files.

It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline.

A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers.

`SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`.

The `const` qualifier may have been added in a few places because it's a good idea.

Reviewed By: sryap, simpkins

Differential Revision: D45021218

fbshipit-source-id: 159ef7325108729a2c6b13daba7a993d91b76cd9
@r-barnes r-barnes force-pushed the export-D45021218 branch from 61cc293 to 4e8d76e Compare May 8, 2023 18:48
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D45021218

r-barnes added a commit to r-barnes/FBGEMM that referenced this pull request May 8, 2023
Summary:
Pull Request resolved: pytorch#1753

Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s.

This change moves most functions in `jagged_tensor_ops.cu` into their own files.

It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline.

A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers.

`SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`.

The `const` qualifier may have been added in a few places because it's a good idea.

Reviewed By: sryap, simpkins

Differential Revision: D45021218

fbshipit-source-id: 4e014ffec41feb97b7fabbf8901b37145cfc44b0
@r-barnes r-barnes force-pushed the export-D45021218 branch from 4e8d76e to 9cf9431 Compare May 8, 2023 22:52
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D45021218

r-barnes added a commit to r-barnes/FBGEMM that referenced this pull request May 8, 2023
Summary:
Pull Request resolved: pytorch#1753

Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s.

This change moves most functions in `jagged_tensor_ops.cu` into their own files.

It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline.

A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers.

`SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`.

The `const` qualifier may have been added in a few places because it's a good idea.

Reviewed By: sryap, simpkins

Differential Revision: D45021218

fbshipit-source-id: ef4939a227feb8b79951fedff5180250a0d1ac42
@r-barnes r-barnes force-pushed the export-D45021218 branch from 9cf9431 to a3f3b81 Compare May 8, 2023 22:58
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D45021218

r-barnes added a commit to r-barnes/FBGEMM that referenced this pull request May 8, 2023
Summary:
Pull Request resolved: pytorch#1753

Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s.

This change moves most functions in `jagged_tensor_ops.cu` into their own files.

It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline.

A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers.

`SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`.

The `const` qualifier may have been added in a few places because it's a good idea.

Reviewed By: sryap, simpkins

Differential Revision: D45021218

fbshipit-source-id: 8e99a17519f3d71b84eaeace3d26060a77ee1185
@r-barnes r-barnes force-pushed the export-D45021218 branch from a3f3b81 to d4d91d7 Compare May 8, 2023 23:04
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D45021218

r-barnes added a commit to r-barnes/FBGEMM that referenced this pull request May 9, 2023
Summary:
Pull Request resolved: pytorch#1753

Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s.

This change moves most functions in `jagged_tensor_ops.cu` into their own files.

It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline.

A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers.

`SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`.

The `const` qualifier may have been added in a few places because it's a good idea.

Reviewed By: sryap, simpkins

Differential Revision: D45021218

fbshipit-source-id: 6a82925fee60cec4bec9e57435e3f1d194b57c9b
@r-barnes r-barnes force-pushed the export-D45021218 branch from d4d91d7 to 3fede11 Compare May 9, 2023 07:07
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D45021218

r-barnes added a commit to r-barnes/FBGEMM that referenced this pull request May 9, 2023
Summary:
Pull Request resolved: pytorch#1753

Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s.

This change moves most functions in `jagged_tensor_ops.cu` into their own files.

It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline.

A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers.

`SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`.

The `const` qualifier may have been added in a few places because it's a good idea.

Reviewed By: sryap, simpkins

Differential Revision: D45021218

fbshipit-source-id: 66962253aaabc9ca7b3dd4ea37acf9f0d5bbc842
@r-barnes r-barnes force-pushed the export-D45021218 branch from 3fede11 to e64f327 Compare May 9, 2023 07:16
@r-barnes r-barnes force-pushed the export-D45021218 branch from e64f327 to d5fd794 Compare May 9, 2023 07:22
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D45021218

r-barnes added a commit to r-barnes/FBGEMM that referenced this pull request May 9, 2023
Summary:
Pull Request resolved: pytorch#1753

Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s.

This change moves most functions in `jagged_tensor_ops.cu` into their own files.

It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline.

A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers.

`SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`.

The `const` qualifier may have been added in a few places because it's a good idea.

Reviewed By: sryap, simpkins

Differential Revision: D45021218

fbshipit-source-id: 14dca88bb28eeedc21ac807723732312dbc38071
@r-barnes r-barnes force-pushed the export-D45021218 branch from d5fd794 to cdafce6 Compare May 9, 2023 23:22
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D45021218

r-barnes added a commit to r-barnes/FBGEMM that referenced this pull request May 9, 2023
Summary:
Pull Request resolved: pytorch#1753

Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s.

This change moves most functions in `jagged_tensor_ops.cu` into their own files.

It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline.

A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers.

`SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`.

The `const` qualifier may have been added in a few places because it's a good idea.

Reviewed By: sryap, simpkins

Differential Revision: D45021218

fbshipit-source-id: 54116f80350a8b754dd4784c864b77e24f4a6cf7
@r-barnes r-barnes force-pushed the export-D45021218 branch from cdafce6 to 9efd0b3 Compare May 9, 2023 23:28
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D45021218

r-barnes added a commit to r-barnes/FBGEMM that referenced this pull request May 9, 2023
Summary:
Pull Request resolved: pytorch#1753

Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s.

This change moves most functions in `jagged_tensor_ops.cu` into their own files.

It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline.

A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers.

`SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`.

The `const` qualifier may have been added in a few places because it's a good idea.

Reviewed By: sryap, simpkins

Differential Revision: D45021218

fbshipit-source-id: e6252e1d41742e89841db6027ae601020e50f7e2
@r-barnes r-barnes force-pushed the export-D45021218 branch from 9efd0b3 to 37d2042 Compare May 9, 2023 23:34
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D45021218

r-barnes added a commit to r-barnes/FBGEMM that referenced this pull request May 10, 2023
Summary:
Pull Request resolved: pytorch#1753

Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s.

This change moves most functions in `jagged_tensor_ops.cu` into their own files.

It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline.

A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers.

`SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`.

The `const` qualifier may have been added in a few places because it's a good idea.

Reviewed By: sryap, simpkins

Differential Revision: D45021218

fbshipit-source-id: bf0511f1579437d5fa40dc2ab9520c9b0df38d79
@r-barnes r-barnes force-pushed the export-D45021218 branch from 37d2042 to fb70771 Compare May 10, 2023 18:36
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D45021218

r-barnes added a commit to r-barnes/FBGEMM that referenced this pull request May 11, 2023
Summary:
Pull Request resolved: pytorch#1753

Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s.

This change moves most functions in `jagged_tensor_ops.cu` into their own files.

It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline.

A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers.

`SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`.

The `const` qualifier may have been added in a few places because it's a good idea.

Reviewed By: sryap, simpkins

Differential Revision: D45021218

fbshipit-source-id: 4485cf1c0aa28e58100342a4d13f72c8937336fe
@r-barnes r-barnes force-pushed the export-D45021218 branch from fb70771 to 238cf92 Compare May 11, 2023 18:11
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D45021218

r-barnes added a commit to r-barnes/FBGEMM that referenced this pull request May 11, 2023
Summary:
Pull Request resolved: pytorch#1753

Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s.

This change moves most functions in `jagged_tensor_ops.cu` into their own files.

It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline.

A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers.

`SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`.

The `const` qualifier may have been added in a few places because it's a good idea.

Reviewed By: sryap, simpkins

Differential Revision: D45021218

fbshipit-source-id: 5ad7d312b89b93f40bc268d206dbf470577ead81
@r-barnes r-barnes force-pushed the export-D45021218 branch from 238cf92 to d958c53 Compare May 11, 2023 22:13
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D45021218

r-barnes added a commit to r-barnes/FBGEMM that referenced this pull request May 11, 2023
Summary:
Pull Request resolved: pytorch#1753

Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s.

This change moves most functions in `jagged_tensor_ops.cu` into their own files.

It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline.

A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers.

`SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`.

The `const` qualifier may have been added in a few places because it's a good idea.

Reviewed By: sryap, simpkins

Differential Revision: D45021218

fbshipit-source-id: 47a7d01685ce694d8483fd6f77e34a27fb776cf6
@r-barnes r-barnes force-pushed the export-D45021218 branch from d958c53 to 8a9c171 Compare May 11, 2023 22:18
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D45021218

Summary:
Pull Request resolved: pytorch#1753

Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s.

This change moves most functions in `jagged_tensor_ops.cu` into their own files.

It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline.

A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers.

`SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`.

The `const` qualifier may have been added in a few places because it's a good idea.

Reviewed By: sryap, simpkins

Differential Revision: D45021218

fbshipit-source-id: 4595b39f40e0bad7318b0458385bf00cbf2be8d6
@r-barnes r-barnes force-pushed the export-D45021218 branch from 8a9c171 to aa6d046 Compare May 11, 2023 22:23
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D45021218

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 3f2eaa8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants