Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] question about cutlass int8 *int8 = int32 kernel #1163

Closed
zwshan opened this issue Oct 31, 2023 · 5 comments
Closed

[QST] question about cutlass int8 *int8 = int32 kernel #1163

zwshan opened this issue Oct 31, 2023 · 5 comments
Labels
question Question

Comments

@zwshan
Copy link

zwshan commented Oct 31, 2023

What is your question?
Hello, I am very interested in INT8 matrix multiplication in Cutlass, but I have encountered some confusion while trying to use it. Let's assume we have a matrix A multiplied by matrix B (the original matrix data format is both fp32).

My first question is: If I quantize both matrix A and matrix B to INT8 (with scale factors, scale A and scale B), and I use the int8 * int8 = int8 API in Cutlass, then when I want to de-quantize the INT8 result matrix back to fp32, I simply need to multiply the INT8 result by scale A and scale B. However, if I use int8 * int8 = int32, how should I de-quantize it?

My second question is: Suppose I have a series of consecutive matrix multiplications, where the output of the previous matrix multiplication is the input for the next one. Do I need to quantize and de-quantize before and after each matrix multiplication, or should I only quantize before the first matrix multiplication and de-quantize after the last matrix multiplication? I would greatly appreciate your assistance in answering these questions.

@zwshan
Copy link
Author

zwshan commented Oct 31, 2023

@alexsamardzic hello, could you please help me answer the question?

@zwshan
Copy link
Author

zwshan commented Oct 31, 2023

hello, @manishucsd could you please help me?

@manishucsd
Copy link
Contributor

Q1) Are you looking for int32_t to float de-quantization? If that is what you are looking for find all the conversions numeric_conversion.h

Q2) I think you are asking for GEMM-GEMM fusion with de-quantization and scaling involved?

Ideally, if the output of previous GEMM is quantized, you should keep it quantized until it reachs registers (closer to the math units) for the performance reasons from pure GEMM perspective, but you might need some scaling (additional operations) for handling accuracy. The scaling can also be fused in the mainloop of the second GEMM.

Can list out your sequence operations with data types and problem shapes? Hopefully, we can provide insights once we have more details of what you trying to implement.

@zwshan
Copy link
Author

zwshan commented Nov 2, 2023

Q2) I think you are asking for GEMM-GEMM fusion with de-quantization and scaling involved?

Ideally, if the output of previous GEMM is quantized, you should keep it quantized until it reachs registers (closer to the math units) for the performance reasons from pure GEMM perspective, but you might need some scaling (additional operations) for handling accuracy. The scaling can also be fused in the mainloop of the second GEMM.

I'm sorry, I'm not particularly clear about the meaning of "GEMM-GEMM fusion with de-quantization and scaling involved." If you could clarify this, I would greatly appreciate it. Additionally, if you could guide me on how to implement fp32 to int8 in the latest version of Cutlass for the code I've listed, I would be very thankful.

@zwshan
Copy link
Author

zwshan commented Nov 3, 2023

I have read some blog to know how to make it work. Please make this issue open until I give a reply, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Question
Projects
None yet
Development

No branches or pull requests

3 participants