-
Notifications
You must be signed in to change notification settings - Fork 508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FP8 kernel development and enablement. #2866
Closed
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
✅ Deploy Preview for pytorch-fbgemm-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
This pull request was exported from Phabricator. Differential Revision: D59240214 |
This pull request was exported from Phabricator. Differential Revision: D59240214 |
manishucsd
added a commit
to manishucsd/FBGEMM
that referenced
this pull request
Jul 19, 2024
Summary: Pull Request resolved: pytorch#2866 This diff creates a library of cutlass kernels starting fp8 tensorwise scaled gemms as cutlass_extensions to be used within Meta. CUTLASS extensions v2 is being built with an intension to consolidate all of Meta's CUTLASS use-cases within cutlass_extensions. # Itemized Summary a) Isolate each kernel instance in its own file within a separate folder `fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/fp8_gemm_with_tensorwise`. - These kernel instances follow the OSS device-level API. We will have a similar structure for rowwise and tensorwise and follow CUTLASS's device-side 3.x API. This will allow us to upstream the kernels and code changes to upstream NVIDIA/CUTLASS. - Speeds up the compilation times. - Eventually, we will auto-generate these instances the same way as in OSS using our own version of`generator.py`, but instances can also be manually instantiated to check one off kernels. b) Create a common calling interface class`cutlass_extensions::GemmOperationWrapper3x<GemmKernel>` provided by `cutlass::library::Operation`. - Allows us to initialize, update arguments, and run every CUTLASS kernel using a common interface. - Creates a runtime description object `cutlass_extensions::GemmDescription` of kernel's compile-time variables. - Allows the library code built on top of cutlass kernels to create table of various operations (See operation_table.h/cu). - Separates the CUDA kernel template, host-side code, host-side call, and heuristics. c) Operation table to hold various gemm operations that can be mapped based on `cutlass_extensions::GemmFunctionalKey` and `cutlass_extensions::GemmPerformanceKey`. - The structure is used select functionally matching operation map, followed by selecting an operation from the map using a `cutlass_extensions::GemmPerformanceKey`. d) The main driver code for f8f8bf16 tensorwise is here `fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16.cu`. This file has the following: - A separate operation specific heuristic class to select and instance of `cutlass_extensions::GemmOperationWrapper3x<GemmKernel>` based on the problem_shape (M, N, K). - Use the common interface to initialize and call the underlying FP8 GemmKernel with tensorwise scaling. - Separation allows us to add new instances (manually for now, auto-generated in next) in (a) and use them (d) using a common `cutlass_extensions::GemmOperationWrapper3x`. # Next Steps (Plan to take these in the next diffs) - Auto-generate kernel instances. - Apply the same structure `f8f8bf16_rowwise`, `f8f8bf16_blockwise` and others kernels in the legacy `fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions.cu`. I will need help here if we want to move all the kernels to the new structure. - Give some thoughts creating `f8f8bf16_profile` which runs all the operations that matches a `GemmFunctionalKey` and creates a data structure mapping important problem shapes to `GemmPerformanceKey` that can be used at run time from within `Fp8GemmTensorWiseHeuristics::_strict_heurstics`. # More notes - We are using some structures from `cutlass::library`namespace, which has components on how to properly build a library out of CUTLASS kernels, the structures that needs changes are in`cutlass_extensions`. We have spelled out the namespaces instead of using `using` at the top of the file to know which structures are coming from `cutlass::library` and which are coming from `cutlass_extensions`. This will be helpful if and when we upstream `cutlass_extensions` components to NVIDIA/CUTLASS. # Some performance runs Performance results [sweep on llama shapes](https://docs.google.com/spreadsheets/d/1DaP0H2MGPo3A07gbuHIgRqCzFAzU5D3_QsXbxg9eisU/edit?gid=1195193097#gid=1195193097), [sweeps on random shapes here](https://docs.google.com/spreadsheets/d/1DaP0H2MGPo3A07gbuHIgRqCzFAzU5D3_QsXbxg9eisU/edit?gid=1268977823#gid=1268977823). We are perf equivalent as this diff doesn't make changes in the heuristics, number of kernels, type of kernels. The diff is to allow us to make all of these changes in the upcoming diffs. Differential Revision: D59240214
manishucsd
force-pushed
the
export-D59240214
branch
from
July 19, 2024 21:15
26ca504
to
e978472
Compare
Summary: Pull Request resolved: pytorch#2866 This diff creates a library of cutlass kernels starting fp8 tensorwise scaled gemms as cutlass_extensions to be used within Meta. CUTLASS extensions v2 is being built with an intension to consolidate all of Meta's CUTLASS use-cases within cutlass_extensions. # Itemized Summary a) Isolate each kernel instance in its own file within a separate folder `fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/fp8_gemm_with_tensorwise`. - These kernel instances follow the OSS device-level API. We will have a similar structure for rowwise and tensorwise and follow CUTLASS's device-side 3.x API. This will allow us to upstream the kernels and code changes to upstream NVIDIA/CUTLASS. - Speeds up the compilation times. - Eventually, we will auto-generate these instances the same way as in OSS using our own version of`generator.py`, but instances can also be manually instantiated to check one off kernels. b) Create a common calling interface class`cutlass_extensions::GemmOperationWrapper3x<GemmKernel>` provided by `cutlass::library::Operation`. - Allows us to initialize, update arguments, and run every CUTLASS kernel using a common interface. - Creates a runtime description object `cutlass_extensions::GemmDescription` of kernel's compile-time variables. - Allows the library code built on top of cutlass kernels to create table of various operations (See operation_table.h/cu). - Separates the CUDA kernel template, host-side code, host-side call, and heuristics. c) Operation table to hold various gemm operations that can be mapped based on `cutlass_extensions::GemmFunctionalKey` and `cutlass_extensions::GemmPerformanceKey`. - The structure is used select functionally matching operation map, followed by selecting an operation from the map using a `cutlass_extensions::GemmPerformanceKey`. d) The main driver code for f8f8bf16 tensorwise is here `fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16.cu`. This file has the following: - A separate operation specific heuristic class to select and instance of `cutlass_extensions::GemmOperationWrapper3x<GemmKernel>` based on the problem_shape (M, N, K). - Use the common interface to initialize and call the underlying FP8 GemmKernel with tensorwise scaling. - Separation allows us to add new instances (manually for now, auto-generated in next) in (a) and use them (d) using a common `cutlass_extensions::GemmOperationWrapper3x`. # Next Steps (Plan to take these in the next diffs) - Auto-generate kernel instances. - Apply the same structure `f8f8bf16_rowwise`, `f8f8bf16_blockwise` and others kernels in the legacy `fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions.cu`. I will need help here if we want to move all the kernels to the new structure. - Give some thoughts creating `f8f8bf16_profile` which runs all the operations that matches a `GemmFunctionalKey` and creates a data structure mapping important problem shapes to `GemmPerformanceKey` that can be used at run time from within `Fp8GemmTensorWiseHeuristics::_strict_heurstics`. # More notes - We are using some structures from `cutlass::library`namespace, which has components on how to properly build a library out of CUTLASS kernels, the structures that needs changes are in`cutlass_extensions`. We have spelled out the namespaces instead of using `using` at the top of the file to know which structures are coming from `cutlass::library` and which are coming from `cutlass_extensions`. This will be helpful if and when we upstream `cutlass_extensions` components to NVIDIA/CUTLASS. Differential Revision: D59240214
This pull request was exported from Phabricator. Differential Revision: D59240214 |
manishucsd
force-pushed
the
export-D59240214
branch
from
July 19, 2024 21:51
e978472
to
55c4f37
Compare
This pull request has been merged in 3634f22. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary:
This diff creates a library of cutlass kernels starting fp8 tensorwise scaled gemms as cutlass_extensions to be used within Meta. CUTLASS extensions v2 is being built with an intension to consolidate all of Meta's CUTLASS use-cases within cutlass_extensions.
Differential Revision: D59240214