FP8 kernel development and enablement. #2866

manishucsd · 2024-07-19T21:07:13Z

Summary:
This diff creates a library of cutlass kernels starting fp8 tensorwise scaled gemms as cutlass_extensions to be used within Meta. CUTLASS extensions v2 is being built with an intension to consolidate all of Meta's CUTLASS use-cases within cutlass_extensions.

Differential Revision: D59240214

netlify · 2024-07-19T21:07:35Z

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`55c4f37`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/669adfd3cc9d66000860b495
😎 Deploy Preview	https://deploy-preview-2866--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

facebook-github-bot · 2024-07-19T21:07:44Z

This pull request was exported from Phabricator. Differential Revision: D59240214

facebook-github-bot · 2024-07-19T21:15:01Z

This pull request was exported from Phabricator. Differential Revision: D59240214

Summary: Pull Request resolved: pytorch#2866 This diff creates a library of cutlass kernels starting fp8 tensorwise scaled gemms as cutlass_extensions to be used within Meta. CUTLASS extensions v2 is being built with an intension to consolidate all of Meta's CUTLASS use-cases within cutlass_extensions. # Itemized Summary a) Isolate each kernel instance in its own file within a separate folder `fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/fp8_gemm_with_tensorwise`. - These kernel instances follow the OSS device-level API. We will have a similar structure for rowwise and tensorwise and follow CUTLASS's device-side 3.x API. This will allow us to upstream the kernels and code changes to upstream NVIDIA/CUTLASS. - Speeds up the compilation times. - Eventually, we will auto-generate these instances the same way as in OSS using our own version of`generator.py`, but instances can also be manually instantiated to check one off kernels. b) Create a common calling interface class`cutlass_extensions::GemmOperationWrapper3x<GemmKernel>` provided by `cutlass::library::Operation`. - Allows us to initialize, update arguments, and run every CUTLASS kernel using a common interface. - Creates a runtime description object `cutlass_extensions::GemmDescription` of kernel's compile-time variables. - Allows the library code built on top of cutlass kernels to create table of various operations (See operation_table.h/cu). - Separates the CUDA kernel template, host-side code, host-side call, and heuristics. c) Operation table to hold various gemm operations that can be mapped based on `cutlass_extensions::GemmFunctionalKey` and `cutlass_extensions::GemmPerformanceKey`. - The structure is used select functionally matching operation map, followed by selecting an operation from the map using a `cutlass_extensions::GemmPerformanceKey`. d) The main driver code for f8f8bf16 tensorwise is here `fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16.cu`. This file has the following: - A separate operation specific heuristic class to select and instance of `cutlass_extensions::GemmOperationWrapper3x<GemmKernel>` based on the problem_shape (M, N, K). - Use the common interface to initialize and call the underlying FP8 GemmKernel with tensorwise scaling. - Separation allows us to add new instances (manually for now, auto-generated in next) in (a) and use them (d) using a common `cutlass_extensions::GemmOperationWrapper3x`. # Next Steps (Plan to take these in the next diffs) - Auto-generate kernel instances. - Apply the same structure `f8f8bf16_rowwise`, `f8f8bf16_blockwise` and others kernels in the legacy `fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions.cu`. I will need help here if we want to move all the kernels to the new structure. - Give some thoughts creating `f8f8bf16_profile` which runs all the operations that matches a `GemmFunctionalKey` and creates a data structure mapping important problem shapes to `GemmPerformanceKey` that can be used at run time from within `Fp8GemmTensorWiseHeuristics::_strict_heurstics`. # More notes - We are using some structures from `cutlass::library`namespace, which has components on how to properly build a library out of CUTLASS kernels, the structures that needs changes are in`cutlass_extensions`. We have spelled out the namespaces instead of using `using` at the top of the file to know which structures are coming from `cutlass::library` and which are coming from `cutlass_extensions`. This will be helpful if and when we upstream `cutlass_extensions` components to NVIDIA/CUTLASS. # Some performance runs Performance results [sweep on llama shapes](https://docs.google.com/spreadsheets/d/1DaP0H2MGPo3A07gbuHIgRqCzFAzU5D3_QsXbxg9eisU/edit?gid=1195193097#gid=1195193097), [sweeps on random shapes here](https://docs.google.com/spreadsheets/d/1DaP0H2MGPo3A07gbuHIgRqCzFAzU5D3_QsXbxg9eisU/edit?gid=1268977823#gid=1268977823). We are perf equivalent as this diff doesn't make changes in the heuristics, number of kernels, type of kernels. The diff is to allow us to make all of these changes in the upcoming diffs. Differential Revision: D59240214

Summary: Pull Request resolved: pytorch#2866 This diff creates a library of cutlass kernels starting fp8 tensorwise scaled gemms as cutlass_extensions to be used within Meta. CUTLASS extensions v2 is being built with an intension to consolidate all of Meta's CUTLASS use-cases within cutlass_extensions. # Itemized Summary a) Isolate each kernel instance in its own file within a separate folder `fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/fp8_gemm_with_tensorwise`. - These kernel instances follow the OSS device-level API. We will have a similar structure for rowwise and tensorwise and follow CUTLASS's device-side 3.x API. This will allow us to upstream the kernels and code changes to upstream NVIDIA/CUTLASS. - Speeds up the compilation times. - Eventually, we will auto-generate these instances the same way as in OSS using our own version of`generator.py`, but instances can also be manually instantiated to check one off kernels. b) Create a common calling interface class`cutlass_extensions::GemmOperationWrapper3x<GemmKernel>` provided by `cutlass::library::Operation`. - Allows us to initialize, update arguments, and run every CUTLASS kernel using a common interface. - Creates a runtime description object `cutlass_extensions::GemmDescription` of kernel's compile-time variables. - Allows the library code built on top of cutlass kernels to create table of various operations (See operation_table.h/cu). - Separates the CUDA kernel template, host-side code, host-side call, and heuristics. c) Operation table to hold various gemm operations that can be mapped based on `cutlass_extensions::GemmFunctionalKey` and `cutlass_extensions::GemmPerformanceKey`. - The structure is used select functionally matching operation map, followed by selecting an operation from the map using a `cutlass_extensions::GemmPerformanceKey`. d) The main driver code for f8f8bf16 tensorwise is here `fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16.cu`. This file has the following: - A separate operation specific heuristic class to select and instance of `cutlass_extensions::GemmOperationWrapper3x<GemmKernel>` based on the problem_shape (M, N, K). - Use the common interface to initialize and call the underlying FP8 GemmKernel with tensorwise scaling. - Separation allows us to add new instances (manually for now, auto-generated in next) in (a) and use them (d) using a common `cutlass_extensions::GemmOperationWrapper3x`. # Next Steps (Plan to take these in the next diffs) - Auto-generate kernel instances. - Apply the same structure `f8f8bf16_rowwise`, `f8f8bf16_blockwise` and others kernels in the legacy `fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions.cu`. I will need help here if we want to move all the kernels to the new structure. - Give some thoughts creating `f8f8bf16_profile` which runs all the operations that matches a `GemmFunctionalKey` and creates a data structure mapping important problem shapes to `GemmPerformanceKey` that can be used at run time from within `Fp8GemmTensorWiseHeuristics::_strict_heurstics`. # More notes - We are using some structures from `cutlass::library`namespace, which has components on how to properly build a library out of CUTLASS kernels, the structures that needs changes are in`cutlass_extensions`. We have spelled out the namespaces instead of using `using` at the top of the file to know which structures are coming from `cutlass::library` and which are coming from `cutlass_extensions`. This will be helpful if and when we upstream `cutlass_extensions` components to NVIDIA/CUTLASS. Differential Revision: D59240214

facebook-github-bot · 2024-07-19T21:51:06Z

This pull request was exported from Phabricator. Differential Revision: D59240214

facebook-github-bot · 2024-07-22T16:59:32Z

This pull request has been merged in 3634f22.

facebook-github-bot added the cla signed label Jul 19, 2024

facebook-github-bot added the fb-exported label Jul 19, 2024

manishucsd force-pushed the export-D59240214 branch from 26ca504 to e978472 Compare July 19, 2024 21:15

manishucsd force-pushed the export-D59240214 branch from e978472 to 55c4f37 Compare July 19, 2024 21:51

facebook-github-bot closed this in 3634f22 Jul 22, 2024

facebook-github-bot added the Merged label Jul 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FP8 kernel development and enablement. #2866

FP8 kernel development and enablement. #2866

manishucsd commented Jul 19, 2024 •

edited by jianyuh

Loading

netlify bot commented Jul 19, 2024 •

edited

Loading

facebook-github-bot commented Jul 19, 2024

facebook-github-bot commented Jul 19, 2024

facebook-github-bot commented Jul 19, 2024

facebook-github-bot commented Jul 22, 2024

FP8 kernel development and enablement. #2866

FP8 kernel development and enablement. #2866

Conversation

manishucsd commented Jul 19, 2024 • edited by jianyuh Loading

netlify bot commented Jul 19, 2024 • edited Loading

✅ Deploy Preview for pytorch-fbgemm-docs ready!

facebook-github-bot commented Jul 19, 2024

facebook-github-bot commented Jul 19, 2024

facebook-github-bot commented Jul 19, 2024

facebook-github-bot commented Jul 22, 2024

manishucsd commented Jul 19, 2024 •

edited by jianyuh

Loading

netlify bot commented Jul 19, 2024 •

edited

Loading