Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FP8 kernel development and enablement. #2866

Closed
wants to merge 1 commit into from

Conversation

manishucsd
Copy link
Contributor

@manishucsd manishucsd commented Jul 19, 2024

Summary:
This diff creates a library of cutlass kernels starting fp8 tensorwise scaled gemms as cutlass_extensions to be used within Meta. CUTLASS extensions v2 is being built with an intension to consolidate all of Meta's CUTLASS use-cases within cutlass_extensions.

Differential Revision: D59240214

Copy link

netlify bot commented Jul 19, 2024

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit 55c4f37
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/669adfd3cc9d66000860b495
😎 Deploy Preview https://deploy-preview-2866--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D59240214

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D59240214

manishucsd added a commit to manishucsd/FBGEMM that referenced this pull request Jul 19, 2024
Summary:
Pull Request resolved: pytorch#2866

This diff creates a library of cutlass kernels starting fp8 tensorwise scaled gemms as cutlass_extensions to be used within Meta. CUTLASS extensions v2 is being built with an intension to consolidate all of Meta's CUTLASS use-cases within cutlass_extensions.

# Itemized Summary
a) Isolate each kernel instance in its own file within a separate folder `fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/fp8_gemm_with_tensorwise`.

  - These kernel instances follow the OSS device-level API. We will have a similar structure for rowwise and tensorwise and follow CUTLASS's device-side 3.x API. This will allow us to upstream the kernels and code changes to upstream NVIDIA/CUTLASS.
  - Speeds up the compilation times.
  - Eventually, we will auto-generate these instances the same way as in OSS using our own version of`generator.py`, but instances can also be manually instantiated to check one off kernels.

b)  Create a common calling interface class`cutlass_extensions::GemmOperationWrapper3x<GemmKernel>` provided by `cutlass::library::Operation`.
   - Allows us to initialize, update arguments, and run every CUTLASS kernel using a common interface.
   - Creates a runtime description object `cutlass_extensions::GemmDescription` of kernel's compile-time variables.
   - Allows the library code built on top of cutlass kernels to create table of various operations (See operation_table.h/cu).
   - Separates the CUDA kernel template, host-side code, host-side call, and heuristics.

c) Operation table to hold various gemm operations that can be mapped based on `cutlass_extensions::GemmFunctionalKey` and `cutlass_extensions::GemmPerformanceKey`.
   - The structure is used select functionally  matching operation map, followed by selecting an operation from the map using a `cutlass_extensions::GemmPerformanceKey`.

 d) The main driver code for f8f8bf16 tensorwise is here `fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16.cu`. This file has the following:
 - A separate operation specific heuristic class to select and instance of `cutlass_extensions::GemmOperationWrapper3x<GemmKernel>` based on the problem_shape (M, N, K).
 - Use the common interface to initialize and call the underlying FP8 GemmKernel with tensorwise scaling.
 - Separation allows us to add new instances (manually for now, auto-generated in next) in (a) and use them (d) using a common `cutlass_extensions::GemmOperationWrapper3x`.

# Next Steps (Plan to take these in the next diffs)
- Auto-generate kernel instances.
- Apply the same structure `f8f8bf16_rowwise`, `f8f8bf16_blockwise` and others kernels in the legacy `fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions.cu`. I will need help here if we want to move all the kernels to the new structure.
- Give some thoughts creating `f8f8bf16_profile` which runs all the operations that matches a `GemmFunctionalKey` and creates a data structure mapping important problem shapes to `GemmPerformanceKey` that can be used at run time from within `Fp8GemmTensorWiseHeuristics::_strict_heurstics`.

# More notes
 - We are using some structures from `cutlass::library`namespace, which has components on how to properly build a library out of CUTLASS kernels, the structures that needs changes are in`cutlass_extensions`. We have spelled out the namespaces instead of using `using` at the top of the file to know which structures are coming from `cutlass::library` and which are coming from `cutlass_extensions`. This will be helpful if and when we upstream `cutlass_extensions` components to NVIDIA/CUTLASS.

# Some performance runs
Performance results [sweep on llama shapes](https://docs.google.com/spreadsheets/d/1DaP0H2MGPo3A07gbuHIgRqCzFAzU5D3_QsXbxg9eisU/edit?gid=1195193097#gid=1195193097), [sweeps on random shapes here](https://docs.google.com/spreadsheets/d/1DaP0H2MGPo3A07gbuHIgRqCzFAzU5D3_QsXbxg9eisU/edit?gid=1268977823#gid=1268977823). We are perf equivalent as this diff doesn't make changes in the heuristics, number of kernels, type of kernels. The diff is to allow us to make all of these changes in the upcoming diffs.

Differential Revision: D59240214
Summary:
Pull Request resolved: pytorch#2866

This diff creates a library of cutlass kernels starting fp8 tensorwise scaled gemms as cutlass_extensions to be used within Meta. CUTLASS extensions v2 is being built with an intension to consolidate all of Meta's CUTLASS use-cases within cutlass_extensions.

# Itemized Summary
a) Isolate each kernel instance in its own file within a separate folder `fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/fp8_gemm_with_tensorwise`.

  - These kernel instances follow the OSS device-level API. We will have a similar structure for rowwise and tensorwise and follow CUTLASS's device-side 3.x API. This will allow us to upstream the kernels and code changes to upstream NVIDIA/CUTLASS.
  - Speeds up the compilation times.
  - Eventually, we will auto-generate these instances the same way as in OSS using our own version of`generator.py`, but instances can also be manually instantiated to check one off kernels.

b)  Create a common calling interface class`cutlass_extensions::GemmOperationWrapper3x<GemmKernel>` provided by `cutlass::library::Operation`.
   - Allows us to initialize, update arguments, and run every CUTLASS kernel using a common interface.
   - Creates a runtime description object `cutlass_extensions::GemmDescription` of kernel's compile-time variables.
   - Allows the library code built on top of cutlass kernels to create table of various operations (See operation_table.h/cu).
   - Separates the CUDA kernel template, host-side code, host-side call, and heuristics.

c) Operation table to hold various gemm operations that can be mapped based on `cutlass_extensions::GemmFunctionalKey` and `cutlass_extensions::GemmPerformanceKey`.
   - The structure is used select functionally  matching operation map, followed by selecting an operation from the map using a `cutlass_extensions::GemmPerformanceKey`.

 d) The main driver code for f8f8bf16 tensorwise is here `fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16.cu`. This file has the following:
 - A separate operation specific heuristic class to select and instance of `cutlass_extensions::GemmOperationWrapper3x<GemmKernel>` based on the problem_shape (M, N, K).
 - Use the common interface to initialize and call the underlying FP8 GemmKernel with tensorwise scaling.
 - Separation allows us to add new instances (manually for now, auto-generated in next) in (a) and use them (d) using a common `cutlass_extensions::GemmOperationWrapper3x`.

# Next Steps (Plan to take these in the next diffs)
- Auto-generate kernel instances.
- Apply the same structure `f8f8bf16_rowwise`, `f8f8bf16_blockwise` and others kernels in the legacy `fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions.cu`. I will need help here if we want to move all the kernels to the new structure.
- Give some thoughts creating `f8f8bf16_profile` which runs all the operations that matches a `GemmFunctionalKey` and creates a data structure mapping important problem shapes to `GemmPerformanceKey` that can be used at run time from within `Fp8GemmTensorWiseHeuristics::_strict_heurstics`.

# More notes
 - We are using some structures from `cutlass::library`namespace, which has components on how to properly build a library out of CUTLASS kernels, the structures that needs changes are in`cutlass_extensions`. We have spelled out the namespaces instead of using `using` at the top of the file to know which structures are coming from `cutlass::library` and which are coming from `cutlass_extensions`. This will be helpful if and when we upstream `cutlass_extensions` components to NVIDIA/CUTLASS.

Differential Revision: D59240214
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D59240214

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 3634f22.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants