[metal] Use grid-stride loop to implement `listgen` kernels #682

k-ye · 2020-03-30T13:08:38Z

Also added KernelAttributes::debug_string() method, making the PR a bit noisy. These are the files that deserve attention..

taichi/backends/metal/shaders/runtime_kernels.metal.h
taichi/codegen/codegen_metal.cpp

Note that I just put a limit of #threads to be 1024 x 1024. Does this makes sense (i.e. is there a fancier way to decide this limit)?

I used taichi_sparse.py as a rough benchmark. The FPS went from 35 to 40. I guess that's because each thread now only handles a few child slots. (Previously each thread had to handle all the slots.)

Related issue = #678

[Click here for the format server]

Also added KernelAttributes::debug_string() method

yuanming-hu

Thanks! Looks good to me!

yuanming-hu · 2020-03-30T23:01:14Z

Note that I just put a limit of #threads to be 1024 x 1024. Does this makes sense (i.e. is there a fancier way to decide this limit)?

Based my CUDA experience, 1024x1024 sounds a little too big to me. For grid size, usually I query how many streaming multiprocessor the GPU has, then multiple the number by 32. This is because in CUDA each SM can hold at most 16-32 blocks (work groups).

For block size, I usually use 64.

For example, on a GPU with 20 SM, the dim would be 640x64.

k-ye · 2020-03-31T11:08:02Z

then multiple the number by 32. This is because in CUDA each SM can hold at most 16-32 blocks (work groups).

My understanding is that the number of blocks an SM supports depends on the size of that block. And #blocks * block_size is probably close to the number of cores an SM has..?

Anyway, i couldn't find the corresponding hardware concepts in Metal, although i'm pretty sure it uses a very similar architecture. Let me just use 64 * 1024 = 65536. I feel like so long as the number is within a reasonable range, things should be fine..

yuanming-hu · 2020-03-31T21:35:41Z

My understanding is that the number of blocks an SM supports depends on the size of that block. And #blocks * block_size is probably close to the number of cores an SM has..?

Sorry about the delayed reply. About max #blocks per SM, please search "Maximum number of resident blocks per multiprocessor" in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities :-)

#blocks * block_size can be way bigger then #cores per SM (which is usually 32-96 if I remember correctly). GPUs needs many blocks on a single SM, so that they can run in a pipelined manner to hide instruction latency. Note that CUDA scheduling happens at the unit of warps (32 threads), and a block can have multiple warps. This allows many warps to execute on the fly to hide latency.

Anyway, i couldn't find the corresponding hardware concepts in Metal, although i'm pretty sure it uses a very similar architecture. Let me just use 64 * 1024 = 65536. I feel like so long as the number is within a reasonable range, things should be fine..

Right, as long as the behavior under that number of threads is well-defined, we can just pick the config with empirically fastest performance :-)

k-ye · 2020-04-01T00:16:31Z

Right I see. I saw your block size was 64, and thought you intended to set it close to a warp to make an SM support more blocks. But I guess SM can support a fixed amount of blocks, regardless of the block size? Internally, as you mentioned, it breaks down that block into warps as the basic scheduling unit... ~~So maybe it's better to claim that the number of warps that can run truly concurrently is fixed..?~~ (Maybe not, because while the warp size is 32, each thread may require a different amount of memory, registers, etc)

BTW, I think I found a bug in Metal's implementation. Here for dense Snodes, is_active always returns true:

taichi/taichi/backends/metal/shaders/runtime_utils.metal.h

Lines 69 to 71 in 30701fc

    
           if (meta.type == SNodeMeta::Root || meta.type == SNodeMeta::Dense) { 
        
             return true; 
        
           }

This can be problematic if the index is within the POT size, but goes beyond the actual size (e.g. actual_size = 9, num_(pot_)slots= 16. Then during listgen, it may cause overflow, because the memory of ListManager is allocated according to the actual_size...

Will make a fix later.

yuanming-hu · 2020-04-01T01:25:12Z

Right I see. I saw your block size was 64, and thought you intended to set it close to a warp to make an SM support more blocks. But I guess SM can support a fixed amount of blocks, regardless of the block size? Internally, as you mentioned, it breaks down that block into warps as the basic scheduling unit... So maybe it's better to claim that the number of warps that can run truly concurrently is fixed..? (Maybe not, because while the warp size is 32, each thread may require a different amount of memory, registers, etc)

It's true that if each thread uses too many registers/block uses too much shared memory, the concurrency(core occupancy) will be limited. As long as we reach full core occupancy, then #parallel warps running simultaneously on SM * 32 = #cores per SM. Note that it could be the case that #parallel warps running simultaneously on SM << #concurrent warps on SM, and then the warp scheduler decides a subset of all concurrent warps to run for each clock cycle.

BTW, I think I found a bug in Metal's implementation. Here for dense Snodes, is_active always returns true:

taichi/taichi/backends/metal/shaders/runtime_utils.metal.h

Lines 69 to 71 in 30701fc

if (meta.type == SNodeMeta::Root || meta.type == SNodeMeta::Dense) {

return true;

}

This can be problematic if the index is within the POT size, but goes beyond the actual size (e.g. actual_size = 9, num_(pot_)slots= 16. Then during listgen, it may cause overflow, because the memory of ListManager is allocated according to the actual_size...

Will make a fix later.

This is actually fine. I'll explain this in greater detail in the documentation as a PR.

k-ye · 2020-04-01T12:03:29Z

This is actually fine. I'll explain this in greater detail in the documentation as a PR.

I still found one edge case in Metal, where I forgot to bound the end to be min(begin + range, child_list->max_num_elems):

taichi/taichi/backends/metal/shaders/runtime_kernels.metal.h

Line 80 in 62cdc60

for (int ii = begin; ii < (begin + range); ++ii) {

Sent out #691

[metal] Use grid-stride loop to implement listgen kernels

e60192f

Also added KernelAttributes::debug_string() method

k-ye requested a review from yuanming-hu March 30, 2020 13:08

Update test_bitmasked.py

12cda6e

yuanming-hu approved these changes Mar 30, 2020

View reviewed changes

[skip ci] enforce code format

0d5a180

reduce size

b5a3256

k-ye merged commit dd173df into taichi-dev:master Mar 31, 2020

k-ye deleted the stride branch March 31, 2020 11:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[metal] Use grid-stride loop to implement `listgen` kernels #682

[metal] Use grid-stride loop to implement `listgen` kernels #682

k-ye commented Mar 30, 2020 •

edited

Loading

yuanming-hu left a comment

yuanming-hu commented Mar 30, 2020

k-ye commented Mar 31, 2020 •

edited

Loading

yuanming-hu commented Mar 31, 2020 •

edited

Loading

k-ye commented Apr 1, 2020 •

edited

Loading

yuanming-hu commented Apr 1, 2020

k-ye commented Apr 1, 2020

[metal] Use grid-stride loop to implement listgen kernels #682

[metal] Use grid-stride loop to implement listgen kernels #682

Conversation

k-ye commented Mar 30, 2020 • edited Loading

yuanming-hu left a comment

Choose a reason for hiding this comment

yuanming-hu commented Mar 30, 2020

k-ye commented Mar 31, 2020 • edited Loading

yuanming-hu commented Mar 31, 2020 • edited Loading

k-ye commented Apr 1, 2020 • edited Loading

yuanming-hu commented Apr 1, 2020

k-ye commented Apr 1, 2020

[metal] Use grid-stride loop to implement `listgen` kernels #682

[metal] Use grid-stride loop to implement `listgen` kernels #682

k-ye commented Mar 30, 2020 •

edited

Loading

k-ye commented Mar 31, 2020 •

edited

Loading

yuanming-hu commented Mar 31, 2020 •

edited

Loading

k-ye commented Apr 1, 2020 •

edited

Loading