Add BF16 in FP8 quantize ops #1961

Summary: Pull Request resolved: pytorch#1961 - Added output_dtype for half, bfloat16 and float as output in dequantization functions; currently it's an integer value defined by Sparse_dtype (float:0, half:1, bfloat16:5) - Added type conversion in quant and dequant kernels by using native CUDA/HIP functions for half to float conversion and writing everything explicitly. Reviewed By: jianyuh Differential Revision: D47904459 fbshipit-source-id: d48da0fc7b0b158c46628952a7c7ec8e1aa502df

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add BF16 in FP8 quantize ops #1961

Add BF16 in FP8 quantize ops #1961

Commits on Aug 24, 2023