-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST] How to use splitk in cutlass python? #1295
Comments
Good catch. This is currently missing from |
If I want to use splitk in python, is it a feasible way to write the c++ code first, compile it into dynamic libraries and then load the dynamic libraries with python's ctypes? Or do you have any other better suggestion? |
My suggestion would be to use tools like Pybind to create Python-C++ bindings for calling into a CUTLASS C++ kernel via Python. For example, you can take a look at how PyTorch CUDA extensions can be created for a CUTLASS kernel by following this unit test. If you set You could follow a similar pattern by pasting your desired CUTLASS C++ kernel into |
This issue has been labeled |
What is your question?
Background: A100, nvidia-cutlass 3.3.0.0
When k is much larger than m n , I want to find a cutlass kernel with faster running time than cupy.dot because cupy.dot does not reach the peak performance of A100. My data size is m=n=128,k=16384. I used cutlass profiler to find a potentially faster kernel. But I found that there is no way to specify split_k_mode and split_k_slices in cutlass.op.Gemm.
The text was updated successfully, but these errors were encountered: