Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BF16 in FP8 quantize ops #1961

Closed
wants to merge 1 commit into from
Closed

Commits on Aug 24, 2023

  1. Add BF16 in FP8 quantize ops (pytorch#1961)

    Summary:
    Pull Request resolved: pytorch#1961
    
    - Added output_dtype for half, bfloat16 and float as output in
      dequantization functions; currently it's an integer value defined by
      Sparse_dtype (float:0, half:1, bfloat16:5)
    - Added type conversion in quant and dequant kernels by using native
      CUDA/HIP functions for half to float conversion and writing
      everything explicitly.
    
    Reviewed By: jianyuh
    
    Differential Revision: D47904459
    
    fbshipit-source-id: d48da0fc7b0b158c46628952a7c7ec8e1aa502df
    sryap authored and facebook-github-bot committed Aug 24, 2023
    Configuration menu
    Copy the full SHA
    19fb8e1 View commit details
    Browse the repository at this point in the history