New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Split up `jagged_tensor_ops.cu` for build perf #1753

Closed

r-barnes wants to merge 1 commit into pytorch:main from r-barnes:export-D45021218

Contributor

r-barnes commented May 5, 2023

Summary:
Splitting up jagged_tensor_ops.cu reduces build times on a test system from 179s->124s.

This change moves most functions in jagged_tensor_ops.cu into their own files.

It moves the inline function asynchronous_complete_cumsum into its own implementation file and makes it no longer inline.

A macro FBGEMM_DISPATCH_TO_CUDA_STANDALONE is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers.

SharedMemory is removed from common.cuh and we instead rely on the one in fbgemm_cuda_utils.cuh.

The const qualifier may have been added in a few places because it's a good idea.

Reviewed By: simpkins

Differential Revision: D45021218

facebook-github-bot added the cla signed label

netlify bot commented May 5, 2023 •

edited

Loading

✅ Deploy Preview for pytorch-fbgemm-docs canceled.

Name	Link
🔨 Latest commit	`aa6d046`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/645d6aff573760000991c832

facebook-github-bot added the fb-exported label

Contributor

facebook-github-bot commented May 5, 2023

This pull request was exported from Phabricator. Differential Revision: D45021218

r-barnes added a commit to r-barnes/FBGEMM that referenced this pull request


          Split up jagged_tensor_ops.cu for build perf (pytorch#1753)

5ebe662

Summary:
Pull Request resolved: pytorch#1753

Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s.

This change moves most functions in `jagged_tensor_ops.cu` into their own files.

It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline.

A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers.

`SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`.

The `const` qualifier may have been added in a few places because it's a good idea.

Reviewed By: simpkins

Differential Revision: D45021218

fbshipit-source-id: b42b97bfee0f71320353c8dc269aaeef90b82ba5

r-barnes force-pushed the export-D45021218 branch from 0d2ba4a to 5ebe662 Compare

May 5, 2023 22:19

Contributor

facebook-github-bot commented May 5, 2023

This pull request was exported from Phabricator. Differential Revision: D45021218

r-barnes added a commit to r-barnes/FBGEMM that referenced this pull request


          Split up jagged_tensor_ops.cu for build perf (pytorch#1753)

cbe13ee

Summary:
Pull Request resolved: pytorch#1753

Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s.

This change moves most functions in `jagged_tensor_ops.cu` into their own files.

It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline.

A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers.

`SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`.

The `const` qualifier may have been added in a few places because it's a good idea.

Reviewed By: sryap, simpkins

Differential Revision: D45021218

fbshipit-source-id: c742a84ad45decfd06d3bf2ce30794d1b8838066

r-barnes force-pushed the export-D45021218 branch from 5ebe662 to cbe13ee Compare

May 8, 2023 18:08

Contributor

facebook-github-bot commented May 8, 2023

This pull request was exported from Phabricator. Differential Revision: D45021218

1 similar comment

Contributor

facebook-github-bot commented May 8, 2023

This pull request was exported from Phabricator. Differential Revision: D45021218

r-barnes added a commit to r-barnes/FBGEMM that referenced this pull request


          Split up jagged_tensor_ops.cu for build perf (pytorch#1753)

61cc293

Summary:
Pull Request resolved: pytorch#1753

Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s.

This change moves most functions in `jagged_tensor_ops.cu` into their own files.

It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline.

A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers.

`SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`.

The `const` qualifier may have been added in a few places because it's a good idea.

Reviewed By: sryap, simpkins

Differential Revision: D45021218

fbshipit-source-id: 329b890cf5b4c724805da5d827d215365e7519e3

r-barnes force-pushed the export-D45021218 branch from cbe13ee to 61cc293 Compare

May 8, 2023 18:43

r-barnes added a commit to r-barnes/FBGEMM that referenced this pull request


          Split up jagged_tensor_ops.cu for build perf (pytorch#1753)

4e8d76e

Summary:
Pull Request resolved: pytorch#1753

Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s.

This change moves most functions in `jagged_tensor_ops.cu` into their own files.

It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline.

A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers.

`SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`.

The `const` qualifier may have been added in a few places because it's a good idea.

Reviewed By: sryap, simpkins

Differential Revision: D45021218

fbshipit-source-id: 159ef7325108729a2c6b13daba7a993d91b76cd9

r-barnes force-pushed the export-D45021218 branch from 61cc293 to 4e8d76e Compare

May 8, 2023 18:48

Contributor

facebook-github-bot commented May 8, 2023

This pull request was exported from Phabricator. Differential Revision: D45021218

r-barnes added a commit to r-barnes/FBGEMM that referenced this pull request


          Split up jagged_tensor_ops.cu for build perf (pytorch#1753)

9cf9431

Summary:
Pull Request resolved: pytorch#1753

Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s.

This change moves most functions in `jagged_tensor_ops.cu` into their own files.

It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline.

A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers.

`SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`.

The `const` qualifier may have been added in a few places because it's a good idea.

Reviewed By: sryap, simpkins

Differential Revision: D45021218

fbshipit-source-id: 4e014ffec41feb97b7fabbf8901b37145cfc44b0

r-barnes force-pushed the export-D45021218 branch from 4e8d76e to 9cf9431 Compare

May 8, 2023 22:52

Contributor

facebook-github-bot commented May 8, 2023

This pull request was exported from Phabricator. Differential Revision: D45021218

r-barnes added a commit to r-barnes/FBGEMM that referenced this pull request


          Split up jagged_tensor_ops.cu for build perf (pytorch#1753)

a3f3b81

Summary:
Pull Request resolved: pytorch#1753

Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s.

This change moves most functions in `jagged_tensor_ops.cu` into their own files.

It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline.

A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers.

`SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`.

The `const` qualifier may have been added in a few places because it's a good idea.

Reviewed By: sryap, simpkins

Differential Revision: D45021218

fbshipit-source-id: ef4939a227feb8b79951fedff5180250a0d1ac42

r-barnes force-pushed the export-D45021218 branch from 9cf9431 to a3f3b81 Compare

May 8, 2023 22:58

Contributor

facebook-github-bot commented May 8, 2023

This pull request was exported from Phabricator. Differential Revision: D45021218

r-barnes added a commit to r-barnes/FBGEMM that referenced this pull request


          Split up jagged_tensor_ops.cu for build perf (pytorch#1753)

d4d91d7

Summary:
Pull Request resolved: pytorch#1753

Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s.

This change moves most functions in `jagged_tensor_ops.cu` into their own files.

It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline.

A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers.

`SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`.

The `const` qualifier may have been added in a few places because it's a good idea.

Reviewed By: sryap, simpkins

Differential Revision: D45021218

fbshipit-source-id: 8e99a17519f3d71b84eaeace3d26060a77ee1185

r-barnes force-pushed the export-D45021218 branch from a3f3b81 to d4d91d7 Compare

May 8, 2023 23:04

Contributor

facebook-github-bot commented May 8, 2023

This pull request was exported from Phabricator. Differential Revision: D45021218

r-barnes added a commit to r-barnes/FBGEMM that referenced this pull request


          Split up jagged_tensor_ops.cu for build perf (pytorch#1753)

3fede11

Summary:
Pull Request resolved: pytorch#1753

Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s.

This change moves most functions in `jagged_tensor_ops.cu` into their own files.

It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline.

A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers.

`SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`.

The `const` qualifier may have been added in a few places because it's a good idea.

Reviewed By: sryap, simpkins

Differential Revision: D45021218

fbshipit-source-id: 6a82925fee60cec4bec9e57435e3f1d194b57c9b

r-barnes force-pushed the export-D45021218 branch from d4d91d7 to 3fede11 Compare

May 9, 2023 07:07

Contributor

facebook-github-bot commented May 9, 2023

This pull request was exported from Phabricator. Differential Revision: D45021218

r-barnes added a commit to r-barnes/FBGEMM that referenced this pull request


          Split up jagged_tensor_ops.cu for build perf (pytorch#1753)

e64f327

Summary:
Pull Request resolved: pytorch#1753

Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s.

This change moves most functions in `jagged_tensor_ops.cu` into their own files.

It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline.

A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers.

`SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`.

The `const` qualifier may have been added in a few places because it's a good idea.

Reviewed By: sryap, simpkins

Differential Revision: D45021218

fbshipit-source-id: 66962253aaabc9ca7b3dd4ea37acf9f0d5bbc842

r-barnes force-pushed the export-D45021218 branch from 3fede11 to e64f327 Compare

May 9, 2023 07:16

r-barnes force-pushed the export-D45021218 branch from e64f327 to d5fd794 Compare

May 9, 2023 07:22

Contributor

facebook-github-bot commented May 9, 2023

This pull request was exported from Phabricator. Differential Revision: D45021218

r-barnes added a commit to r-barnes/FBGEMM that referenced this pull request


          Split up jagged_tensor_ops.cu for build perf (pytorch#1753)

cdafce6

Summary:
Pull Request resolved: pytorch#1753

Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s.

This change moves most functions in `jagged_tensor_ops.cu` into their own files.

It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline.

A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers.

`SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`.

The `const` qualifier may have been added in a few places because it's a good idea.

Reviewed By: sryap, simpkins

Differential Revision: D45021218

fbshipit-source-id: 14dca88bb28eeedc21ac807723732312dbc38071

r-barnes force-pushed the export-D45021218 branch from d5fd794 to cdafce6 Compare

May 9, 2023 23:22

Contributor

facebook-github-bot commented May 9, 2023

This pull request was exported from Phabricator. Differential Revision: D45021218

r-barnes added a commit to r-barnes/FBGEMM that referenced this pull request


          Split up jagged_tensor_ops.cu for build perf (pytorch#1753)

9efd0b3

Summary:
Pull Request resolved: pytorch#1753

Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s.

This change moves most functions in `jagged_tensor_ops.cu` into their own files.

It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline.

A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers.

`SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`.

The `const` qualifier may have been added in a few places because it's a good idea.

Reviewed By: sryap, simpkins

Differential Revision: D45021218

fbshipit-source-id: 54116f80350a8b754dd4784c864b77e24f4a6cf7

r-barnes force-pushed the export-D45021218 branch from cdafce6 to 9efd0b3 Compare

May 9, 2023 23:28

Contributor

facebook-github-bot commented May 9, 2023

This pull request was exported from Phabricator. Differential Revision: D45021218

r-barnes added a commit to r-barnes/FBGEMM that referenced this pull request


          Split up jagged_tensor_ops.cu for build perf (pytorch#1753)

37d2042

Summary:
Pull Request resolved: pytorch#1753

Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s.

This change moves most functions in `jagged_tensor_ops.cu` into their own files.

It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline.

A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers.

`SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`.

The `const` qualifier may have been added in a few places because it's a good idea.

Reviewed By: sryap, simpkins

Differential Revision: D45021218

fbshipit-source-id: e6252e1d41742e89841db6027ae601020e50f7e2

r-barnes force-pushed the export-D45021218 branch from 9efd0b3 to 37d2042 Compare

May 9, 2023 23:34

Contributor

facebook-github-bot commented May 9, 2023

This pull request was exported from Phabricator. Differential Revision: D45021218

r-barnes added a commit to r-barnes/FBGEMM that referenced this pull request


          Split up jagged_tensor_ops.cu for build perf (pytorch#1753)

fb70771

Summary:
Pull Request resolved: pytorch#1753

Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s.

This change moves most functions in `jagged_tensor_ops.cu` into their own files.

It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline.

A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers.

`SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`.

The `const` qualifier may have been added in a few places because it's a good idea.

Reviewed By: sryap, simpkins

Differential Revision: D45021218

fbshipit-source-id: bf0511f1579437d5fa40dc2ab9520c9b0df38d79

r-barnes force-pushed the export-D45021218 branch from 37d2042 to fb70771 Compare

May 10, 2023 18:36

Contributor

facebook-github-bot commented May 10, 2023

This pull request was exported from Phabricator. Differential Revision: D45021218

r-barnes added a commit to r-barnes/FBGEMM that referenced this pull request


          Split up jagged_tensor_ops.cu for build perf (pytorch#1753)

238cf92

Summary:
Pull Request resolved: pytorch#1753

Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s.

This change moves most functions in `jagged_tensor_ops.cu` into their own files.

It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline.

A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers.

`SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`.

The `const` qualifier may have been added in a few places because it's a good idea.

Reviewed By: sryap, simpkins

Differential Revision: D45021218

fbshipit-source-id: 4485cf1c0aa28e58100342a4d13f72c8937336fe

r-barnes force-pushed the export-D45021218 branch from fb70771 to 238cf92 Compare

May 11, 2023 18:11

Contributor

facebook-github-bot commented May 11, 2023

This pull request was exported from Phabricator. Differential Revision: D45021218

r-barnes added a commit to r-barnes/FBGEMM that referenced this pull request


          Split up jagged_tensor_ops.cu for build perf (pytorch#1753)

d958c53

Summary:
Pull Request resolved: pytorch#1753

Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s.

This change moves most functions in `jagged_tensor_ops.cu` into their own files.

It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline.

A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers.

`SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`.

The `const` qualifier may have been added in a few places because it's a good idea.

Reviewed By: sryap, simpkins

Differential Revision: D45021218

fbshipit-source-id: 5ad7d312b89b93f40bc268d206dbf470577ead81

r-barnes force-pushed the export-D45021218 branch from 238cf92 to d958c53 Compare

May 11, 2023 22:13

Contributor

facebook-github-bot commented May 11, 2023

This pull request was exported from Phabricator. Differential Revision: D45021218

r-barnes added a commit to r-barnes/FBGEMM that referenced this pull request


          Split up jagged_tensor_ops.cu for build perf (pytorch#1753)

8a9c171

Summary:
Pull Request resolved: pytorch#1753

Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s.

This change moves most functions in `jagged_tensor_ops.cu` into their own files.

It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline.

A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers.

`SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`.

The `const` qualifier may have been added in a few places because it's a good idea.

Reviewed By: sryap, simpkins

Differential Revision: D45021218

fbshipit-source-id: 47a7d01685ce694d8483fd6f77e34a27fb776cf6

r-barnes force-pushed the export-D45021218 branch from d958c53 to 8a9c171 Compare

May 11, 2023 22:18

Contributor

facebook-github-bot commented May 11, 2023

This pull request was exported from Phabricator. Differential Revision: D45021218


          Split up jagged_tensor_ops.cu for build perf (pytorch#1753)

aa6d046

Summary:
Pull Request resolved: pytorch#1753

Splitting up `jagged_tensor_ops.cu` reduces build times on a test system from 179s->124s.

This change moves most functions in `jagged_tensor_ops.cu` into their own files.

It moves the inline function `asynchronous_complete_cumsum` into its own implementation file and makes it no longer inline.

A macro `FBGEMM_DISPATCH_TO_CUDA_STANDALONE` is introduced which allows per-file registration of operators, thereby preventing the need to add a whole mess of headers.

`SharedMemory` is removed from `common.cuh` and we instead rely on the one in `fbgemm_cuda_utils.cuh`.

The `const` qualifier may have been added in a few places because it's a good idea.

Reviewed By: sryap, simpkins

Differential Revision: D45021218

fbshipit-source-id: 4595b39f40e0bad7318b0458385bf00cbf2be8d6

r-barnes force-pushed the export-D45021218 branch from 8a9c171 to aa6d046 Compare

May 11, 2023 22:23

Contributor

facebook-github-bot commented May 11, 2023

This pull request was exported from Phabricator. Differential Revision: D45021218

facebook-github-bot closed this in

3f2eaa8

facebook-github-bot added the Merged label

Contributor

facebook-github-bot commented May 15, 2023

This pull request has been merged in 3f2eaa8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed fb-exported Merged