Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[metal] Use grid-stride loop to implement listgen kernels #682

Merged
merged 4 commits into from
Mar 31, 2020

Conversation

k-ye
Copy link
Member

@k-ye k-ye commented Mar 30, 2020

Also added KernelAttributes::debug_string() method, making the PR a bit noisy. These are the files that deserve attention..

  • taichi/backends/metal/shaders/runtime_kernels.metal.h
  • taichi/codegen/codegen_metal.cpp

Note that I just put a limit of #threads to be 1024 x 1024. Does this makes sense (i.e. is there a fancier way to decide this limit)?


I used taichi_sparse.py as a rough benchmark. The FPS went from 35 to 40. I guess that's because each thread now only handles a few child slots. (Previously each thread had to handle all the slots.)

Related issue = #678

[Click here for the format server]

Also added KernelAttributes::debug_string() method
@k-ye k-ye requested a review from yuanming-hu March 30, 2020 13:08
Copy link
Member

@yuanming-hu yuanming-hu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looks good to me!

@yuanming-hu
Copy link
Member

Note that I just put a limit of #threads to be 1024 x 1024. Does this makes sense (i.e. is there a fancier way to decide this limit)?

Based my CUDA experience, 1024x1024 sounds a little too big to me. For grid size, usually I query how many streaming multiprocessor the GPU has, then multiple the number by 32. This is because in CUDA each SM can hold at most 16-32 blocks (work groups).

For block size, I usually use 64.

For example, on a GPU with 20 SM, the dim would be 640x64.

@k-ye
Copy link
Member Author

k-ye commented Mar 31, 2020

then multiple the number by 32. This is because in CUDA each SM can hold at most 16-32 blocks (work groups).

My understanding is that the number of blocks an SM supports depends on the size of that block. And #blocks * block_size is probably close to the number of cores an SM has..?

Anyway, i couldn't find the corresponding hardware concepts in Metal, although i'm pretty sure it uses a very similar architecture. Let me just use 64 * 1024 = 65536. I feel like so long as the number is within a reasonable range, things should be fine..

@k-ye k-ye merged commit dd173df into taichi-dev:master Mar 31, 2020
@k-ye k-ye deleted the stride branch March 31, 2020 11:38
@yuanming-hu
Copy link
Member

yuanming-hu commented Mar 31, 2020

My understanding is that the number of blocks an SM supports depends on the size of that block. And #blocks * block_size is probably close to the number of cores an SM has..?

Sorry about the delayed reply. About max #blocks per SM, please search "Maximum number of resident blocks per multiprocessor" in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities :-)

#blocks * block_size can be way bigger then #cores per SM (which is usually 32-96 if I remember correctly). GPUs needs many blocks on a single SM, so that they can run in a pipelined manner to hide instruction latency. Note that CUDA scheduling happens at the unit of warps (32 threads), and a block can have multiple warps. This allows many warps to execute on the fly to hide latency.

Anyway, i couldn't find the corresponding hardware concepts in Metal, although i'm pretty sure it uses a very similar architecture. Let me just use 64 * 1024 = 65536. I feel like so long as the number is within a reasonable range, things should be fine..

Right, as long as the behavior under that number of threads is well-defined, we can just pick the config with empirically fastest performance :-)

@k-ye
Copy link
Member Author

k-ye commented Apr 1, 2020

Right I see. I saw your block size was 64, and thought you intended to set it close to a warp to make an SM support more blocks. But I guess SM can support a fixed amount of blocks, regardless of the block size? Internally, as you mentioned, it breaks down that block into warps as the basic scheduling unit... So maybe it's better to claim that the number of warps that can run truly concurrently is fixed..? (Maybe not, because while the warp size is 32, each thread may require a different amount of memory, registers, etc)


BTW, I think I found a bug in Metal's implementation. Here for dense Snodes, is_active always returns true:

if (meta.type == SNodeMeta::Root || meta.type == SNodeMeta::Dense) {
return true;
}

This can be problematic if the index is within the POT size, but goes beyond the actual size (e.g. actual_size = 9, num_(pot_)slots= 16. Then during listgen, it may cause overflow, because the memory of ListManager is allocated according to the actual_size...

Will make a fix later.

@yuanming-hu
Copy link
Member

Right I see. I saw your block size was 64, and thought you intended to set it close to a warp to make an SM support more blocks. But I guess SM can support a fixed amount of blocks, regardless of the block size? Internally, as you mentioned, it breaks down that block into warps as the basic scheduling unit... So maybe it's better to claim that the number of warps that can run truly concurrently is fixed..? (Maybe not, because while the warp size is 32, each thread may require a different amount of memory, registers, etc)

It's true that if each thread uses too many registers/block uses too much shared memory, the concurrency(core occupancy) will be limited. As long as we reach full core occupancy, then #parallel warps running simultaneously on SM * 32 = #cores per SM. Note that it could be the case that #parallel warps running simultaneously on SM << #concurrent warps on SM, and then the warp scheduler decides a subset of all concurrent warps to run for each clock cycle.

BTW, I think I found a bug in Metal's implementation. Here for dense Snodes, is_active always returns true:

if (meta.type == SNodeMeta::Root || meta.type == SNodeMeta::Dense) {
return true;
}

This can be problematic if the index is within the POT size, but goes beyond the actual size (e.g. actual_size = 9, num_(pot_)slots= 16. Then during listgen, it may cause overflow, because the memory of ListManager is allocated according to the actual_size...

Will make a fix later.

This is actually fine. I'll explain this in greater detail in the documentation as a PR.

@k-ye
Copy link
Member Author

k-ye commented Apr 1, 2020

This is actually fine. I'll explain this in greater detail in the documentation as a PR.

I still found one edge case in Metal, where I forgot to bound the end to be min(begin + range, child_list->max_num_elems):

for (int ii = begin; ii < (begin + range); ++ii) {

Sent out #691

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants