-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[metal] Use grid-stride loop to implement listgen
kernels
#682
Conversation
Also added KernelAttributes::debug_string() method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Looks good to me!
Based my CUDA experience, For block size, I usually use 64. For example, on a GPU with 20 SM, the dim would be |
My understanding is that the number of blocks an SM supports depends on the size of that block. And #blocks * block_size is probably close to the number of cores an SM has..? Anyway, i couldn't find the corresponding hardware concepts in Metal, although i'm pretty sure it uses a very similar architecture. Let me just use |
Sorry about the delayed reply. About max #blocks per SM, please search "Maximum number of resident blocks per multiprocessor" in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities :-)
Right, as long as the behavior under that number of threads is well-defined, we can just pick the config with empirically fastest performance :-) |
Right I see. I saw your block size was BTW, I think I found a bug in Metal's implementation. Here for taichi/taichi/backends/metal/shaders/runtime_utils.metal.h Lines 69 to 71 in 30701fc
This can be problematic if the index is within the POT size, but goes beyond the actual size (e.g. Will make a fix later. |
It's true that if each thread uses too many registers/block uses too much shared memory, the concurrency(core occupancy) will be limited. As long as we reach full core occupancy, then #parallel warps running simultaneously on SM * 32 = #cores per SM. Note that it could be the case that #parallel warps running simultaneously on SM << #concurrent warps on SM, and then the warp scheduler decides a subset of all concurrent warps to run for each clock cycle.
This is actually fine. I'll explain this in greater detail in the documentation as a PR. |
I still found one edge case in Metal, where I forgot to bound the end to be
Sent out #691 |
Also added
KernelAttributes::debug_string()
method, making the PR a bit noisy. These are the files that deserve attention..taichi/backends/metal/shaders/runtime_kernels.metal.h
taichi/codegen/codegen_metal.cpp
Note that I just put a limit of #threads to be
1024 x 1024
. Does this makes sense (i.e. is there a fancier way to decide this limit)?I used
taichi_sparse.py
as a rough benchmark. The FPS went from35
to40
. I guess that's because each thread now only handles a few child slots. (Previously each thread had to handle all the slots.)Related issue = #678
[Click here for the format server]