[QST] question about cutlass int8 *int8 = int32 kernel #1163

zwshan · 2023-10-31T16:05:22Z

What is your question?
Hello, I am very interested in INT8 matrix multiplication in Cutlass, but I have encountered some confusion while trying to use it. Let's assume we have a matrix A multiplied by matrix B (the original matrix data format is both fp32).

My first question is: If I quantize both matrix A and matrix B to INT8 (with scale factors, scale A and scale B), and I use the int8 * int8 = int8 API in Cutlass, then when I want to de-quantize the INT8 result matrix back to fp32, I simply need to multiply the INT8 result by scale A and scale B. However, if I use int8 * int8 = int32, how should I de-quantize it?

My second question is: Suppose I have a series of consecutive matrix multiplications, where the output of the previous matrix multiplication is the input for the next one. Do I need to quantize and de-quantize before and after each matrix multiplication, or should I only quantize before the first matrix multiplication and de-quantize after the last matrix multiplication? I would greatly appreciate your assistance in answering these questions.

zwshan · 2023-10-31T16:06:52Z

@alexsamardzic hello, could you please help me answer the question?

zwshan · 2023-10-31T16:07:43Z

hello, @manishucsd could you please help me?

manishucsd · 2023-11-01T22:40:19Z

Q1) Are you looking for int32_t to float de-quantization? If that is what you are looking for find all the conversions numeric_conversion.h

Q2) I think you are asking for GEMM-GEMM fusion with de-quantization and scaling involved?

Ideally, if the output of previous GEMM is quantized, you should keep it quantized until it reachs registers (closer to the math units) for the performance reasons from pure GEMM perspective, but you might need some scaling (additional operations) for handling accuracy. The scaling can also be fused in the mainloop of the second GEMM.

Can list out your sequence operations with data types and problem shapes? Hopefully, we can provide insights once we have more details of what you trying to implement.

zwshan · 2023-11-02T14:54:52Z

Q2) I think you are asking for GEMM-GEMM fusion with de-quantization and scaling involved?

Ideally, if the output of previous GEMM is quantized, you should keep it quantized until it reachs registers (closer to the math units) for the performance reasons from pure GEMM perspective, but you might need some scaling (additional operations) for handling accuracy. The scaling can also be fused in the mainloop of the second GEMM.

I'm sorry, I'm not particularly clear about the meaning of "GEMM-GEMM fusion with de-quantization and scaling involved." If you could clarify this, I would greatly appreciate it. Additionally, if you could guide me on how to implement fp32 to int8 in the latest version of Cutlass for the code I've listed, I would be very thankful.

zwshan · 2023-11-03T14:05:50Z

I have read some blog to know how to make it work. Please make this issue open until I give a reply, thanks.

zwshan added ? - Needs Triage question Question labels Oct 31, 2023

mnicely removed the ? - Needs Triage label Nov 7, 2023

zwshan closed this as completed Nov 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] question about cutlass int8 *int8 = int32 kernel #1163

[QST] question about cutlass int8 *int8 = int32 kernel #1163

zwshan commented Oct 31, 2023

zwshan commented Oct 31, 2023

zwshan commented Oct 31, 2023

manishucsd commented Nov 1, 2023

zwshan commented Nov 2, 2023

zwshan commented Nov 3, 2023

[QST] question about cutlass int8 *int8 = int32 kernel #1163

[QST] question about cutlass int8 *int8 = int32 kernel #1163

Comments

zwshan commented Oct 31, 2023

zwshan commented Oct 31, 2023

zwshan commented Oct 31, 2023

manishucsd commented Nov 1, 2023

zwshan commented Nov 2, 2023

zwshan commented Nov 3, 2023