[Perf] [metal] Support TLS and SIMD group reduction for range-for kernels #1358

k-ye · 2020-06-29T15:13:42Z

misc/benchmark_reduction.py: no perf difference using i32. I switched to f32, and had to reduce the tensor size from 1024*1024*1024 (4GB no longer fits into i32..) to 1024*1024*16. Reduction duration went from ~7s -> 0.2s.
mpm_langrangian_force.py got about 4x improvement...

This PR isn't small, but a big part of it comes from the fact that we cannot just enable SIMD group by default. This might not be available on OS X <= 10.14.

Apple claims that SIMD group is supported in compute kernels since Metal Shader Language 2.0 (released on OSX 10.13), but my test showed that we had to compile it at >= MSL 2.1 (https://developer.apple.com/documentation/metal/mtllanguageversion/version2_1).

Example:

SIMD on

kernel void mtl_k0001_reduce_c6_0_0(
    device byte* root_addr [[buffer(0)]],
    device byte* global_tmps_addr [[buffer(1)]],
    device byte* runtime_addr [[buffer(2)]],
    device byte* print_addr [[buffer(3)]],
    const uint ugrid_size_ [[threads_per_grid]],
    const uint utid_in_simdg_ [[thread_index_in_simdgroup]],  <-- Available since MSL 2.1..
    const uint utid_ [[thread_position_in_grid]]) {
  // range_for, range known at compile time
  const int total_elems = 1048576;
  const int begin_ = utid_ + 0;
  const int end_ = total_elems + 0;
  // TLS prologue
  int32_t tls_bufi32_[1];
  thread char* tls_buffer_ = reinterpret_cast<thread char*>(tls_bufi32_);
    thread auto* tmp65 = reinterpret_cast<thread float*>(tls_buffer_ + 0);
    constexpr float tmp66 = 0.0;
    *tmp65 = tmp66;
  for (int ii = begin_; ii < end_; ii += ugrid_size_) {
    mtl_k0001_reduce_c6_0_0_func(root_addr, global_tmps_addr, runtime_addr, print_addr, tls_buffer_, ii);
  }
  {  // TLS epilogue
    thread auto* tmp69 = reinterpret_cast<thread float*>(tls_buffer_ + 0);
    float tmp70 = *tmp69;
    S0 tmp83(root_addr);
    constexpr int32_t tmp99 = 0;
    S0_ch tmp85 = tmp83.children(tmp99);
    S3 tmp86 = tmp85.get1();
    S3_ch tmp88 = tmp86.children(tmp99);
    device float* tmp89 = tmp88.get0().val;
    const auto tmp70_simd_val_ = simd_sum(tmp70);  <-- SIMD reduction
    if (utid_in_simdg_ == 0) {
      const float tmp72 = fatomic_fetch_add(tmp89, tmp70_simd_val_);
    }
  }
}

SIMD off

kernel void mtl_k0001_reduce_c6_0_0(
    device byte* root_addr [[buffer(0)]],
    device byte* global_tmps_addr [[buffer(1)]],
    device byte* runtime_addr [[buffer(2)]],
    device byte* print_addr [[buffer(3)]],
    const uint ugrid_size_ [[threads_per_grid]],
    const uint utid_ [[thread_position_in_grid]]) {
  // range_for, range known at compile time
  const int total_elems = 1048576;
  const int begin_ = utid_ + 0;
  const int end_ = total_elems + 0;
  // TLS prologue
  int32_t tls_bufi32_[1];
  thread char* tls_buffer_ = reinterpret_cast<thread char*>(tls_bufi32_);
    thread auto* tmp65 = reinterpret_cast<thread float*>(tls_buffer_ + 0);
    constexpr float tmp66 = 0.0;
    *tmp65 = tmp66;
  for (int ii = begin_; ii < end_; ii += ugrid_size_) {
    mtl_k0001_reduce_c6_0_0_func(root_addr, global_tmps_addr, runtime_addr, print_addr, tls_buffer_, ii);
  }
  {  // TLS epilogue
    thread auto* tmp69 = reinterpret_cast<thread float*>(tls_buffer_ + 0);
    float tmp70 = *tmp69;
    S0 tmp83(root_addr);
    constexpr int32_t tmp99 = 0;
    S0_ch tmp85 = tmp83.children(tmp99);
    S3 tmp86 = tmp85.get1();
    S3_ch tmp88 = tmp86.children(tmp99);
    device float* tmp89 = tmp88.get0().val;
    const float tmp72 = fatomic_fetch_add(tmp89, tmp70);
  }
}

Related issue = #576

[Click here for the format server]

codecov · 2020-06-29T16:18:04Z

Codecov Report

Merging #1358 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #1358   +/-   ##
=======================================
  Coverage   85.57%   85.57%           
=======================================
  Files          19       19           
  Lines        3368     3368           
  Branches      623      623           
=======================================
  Hits         2882     2882           
  Misses        356      356           
  Partials      130      130

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c58a262...1e1dc89. Read the comment docs.

yuanming-hu

Awesome! Btw, do you observe any performance difference when you turn on/off SIMD reduction in the epilogue?

k-ye · 2020-06-30T14:58:47Z

Btw, do you observe any performance difference when you turn on/off SIMD reduction in the epilogue?

Ah yep. The number I gave was basically comparing SIMD on vs off. TBH TLS itself didn't help much here, since I didn't seem to implement grid-strided loops for range-for loops..

yuanming-hu · 2020-06-30T15:03:33Z

Btw, do you observe any performance difference when you turn on/off SIMD reduction in the epilogue?

Ah yep. The number I gave was basically comparing SIMD on vs off. TBH TLS itself didn't help much here, since I didn't seem to implement grid-strided loops for range-for loops..

Ah I see. I thought the numbers were no TLS v.s. TLS. It's interesting to see that SIMD has such a great improvement. on CUDA the improvement was small. I guess we will need both. If your grid size is small, then grid-strided loops can significantly reduce the number of atomics needed.

k-ye · 2020-06-30T15:09:31Z

Good point! I will try switching to that later.. (Also I probably still need some time to cleanup in this PR, so as to simplify the process of passing enable_simdgroup around..)

k-ye · 2020-07-02T15:08:17Z

@yuanming-hu FYI, I ran another benchmark_reduction.py with the threads capped at 65536. Even with SIMD off, it took about 0.17s for one reduce() step. With SIMD on, it was about 0.023s... Note that both data excluded the first run, which has to do the JIT...

k-ye added 2 commits June 30, 2020 00:00

[Perf] [metal] Support TLS and SIMD group reduction

59441b2

newline

2aeb1e4

[skip ci] enforce code format

1e1dc89

yuanming-hu approved these changes Jun 30, 2020

View reviewed changes

k-ye merged commit 6100ee2 into taichi-dev:master Jul 1, 2020

k-ye deleted the tls branch July 1, 2020 14:01

k-ye mentioned this pull request Jul 2, 2020

[opt][metal] Improve Metal Performance #935

Closed

6 tasks

FantasyVR mentioned this pull request Jul 4, 2020

[release] v0.6.16 #1398

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf] [metal] Support TLS and SIMD group reduction for range-for kernels #1358

[Perf] [metal] Support TLS and SIMD group reduction for range-for kernels #1358

k-ye commented Jun 29, 2020

codecov bot commented Jun 29, 2020 •

edited

Loading

yuanming-hu left a comment

k-ye commented Jun 30, 2020

yuanming-hu commented Jun 30, 2020

k-ye commented Jun 30, 2020

k-ye commented Jul 2, 2020

[Perf] [metal] Support TLS and SIMD group reduction for range-for kernels #1358

[Perf] [metal] Support TLS and SIMD group reduction for range-for kernels #1358

Conversation

k-ye commented Jun 29, 2020

codecov bot commented Jun 29, 2020 • edited Loading

Codecov Report

yuanming-hu left a comment

Choose a reason for hiding this comment

k-ye commented Jun 30, 2020

yuanming-hu commented Jun 30, 2020

k-ye commented Jun 30, 2020

k-ye commented Jul 2, 2020

codecov bot commented Jun 29, 2020 •

edited

Loading