-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST] question about cutlass int8 *int8 = int32 kernel #1163
Comments
@alexsamardzic hello, could you please help me answer the question? |
hello, @manishucsd could you please help me? |
Q1) Are you looking for Q2) I think you are asking for GEMM-GEMM fusion with de-quantization and scaling involved? Ideally, if the output of previous GEMM is quantized, you should keep it quantized until it reachs registers (closer to the math units) for the performance reasons from pure GEMM perspective, but you might need some scaling (additional operations) for handling accuracy. The scaling can also be fused in the mainloop of the second GEMM. Can list out your sequence operations with data types and problem shapes? Hopefully, we can provide insights once we have more details of what you trying to implement. |
I'm sorry, I'm not particularly clear about the meaning of "GEMM-GEMM fusion with de-quantization and scaling involved." If you could clarify this, I would greatly appreciate it. Additionally, if you could guide me on how to implement fp32 to int8 in the latest version of Cutlass for the code I've listed, I would be very thankful. |
I have read some blog to know how to make it work. Please make this issue open until I give a reply, thanks. |
What is your question?
Hello, I am very interested in INT8 matrix multiplication in Cutlass, but I have encountered some confusion while trying to use it. Let's assume we have a matrix A multiplied by matrix B (the original matrix data format is both fp32).
My first question is: If I quantize both matrix A and matrix B to INT8 (with scale factors, scale A and scale B), and I use the int8 * int8 = int8 API in Cutlass, then when I want to de-quantize the INT8 result matrix back to fp32, I simply need to multiply the INT8 result by scale A and scale B. However, if I use int8 * int8 = int32, how should I de-quantize it?
My second question is: Suppose I have a series of consecutive matrix multiplications, where the output of the previous matrix multiplication is the input for the next one. Do I need to quantize and de-quantize before and after each matrix multiplication, or should I only quantize before the first matrix multiplication and de-quantize after the last matrix multiplication? I would greatly appreciate your assistance in answering these questions.
The text was updated successfully, but these errors were encountered: