Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Perf] [metal] Support TLS and SIMD group reduction for range-for kernels #1358

Merged
merged 3 commits into from
Jul 1, 2020

Conversation

k-ye
Copy link
Member

@k-ye k-ye commented Jun 29, 2020

  • misc/benchmark_reduction.py: no perf difference using i32. I switched to f32, and had to reduce the tensor size from 1024*1024*1024 (4GB no longer fits into i32..) to 1024*1024*16. Reduction duration went from ~7s -> 0.2s.
  • mpm_langrangian_force.py got about 4x improvement...

This PR isn't small, but a big part of it comes from the fact that we cannot just enable SIMD group by default. This might not be available on OS X <= 10.14.

Apple claims that SIMD group is supported in compute kernels since Metal Shader Language 2.0 (released on OSX 10.13), but my test showed that we had to compile it at >= MSL 2.1 (https://developer.apple.com/documentation/metal/mtllanguageversion/version2_1).


Example:

  • SIMD on
kernel void mtl_k0001_reduce_c6_0_0(
    device byte* root_addr [[buffer(0)]],
    device byte* global_tmps_addr [[buffer(1)]],
    device byte* runtime_addr [[buffer(2)]],
    device byte* print_addr [[buffer(3)]],
    const uint ugrid_size_ [[threads_per_grid]],
    const uint utid_in_simdg_ [[thread_index_in_simdgroup]],  <-- Available since MSL 2.1..
    const uint utid_ [[thread_position_in_grid]]) {
  // range_for, range known at compile time
  const int total_elems = 1048576;
  const int begin_ = utid_ + 0;
  const int end_ = total_elems + 0;
  // TLS prologue
  int32_t tls_bufi32_[1];
  thread char* tls_buffer_ = reinterpret_cast<thread char*>(tls_bufi32_);
    thread auto* tmp65 = reinterpret_cast<thread float*>(tls_buffer_ + 0);
    constexpr float tmp66 = 0.0;
    *tmp65 = tmp66;
  for (int ii = begin_; ii < end_; ii += ugrid_size_) {
    mtl_k0001_reduce_c6_0_0_func(root_addr, global_tmps_addr, runtime_addr, print_addr, tls_buffer_, ii);
  }
  {  // TLS epilogue
    thread auto* tmp69 = reinterpret_cast<thread float*>(tls_buffer_ + 0);
    float tmp70 = *tmp69;
    S0 tmp83(root_addr);
    constexpr int32_t tmp99 = 0;
    S0_ch tmp85 = tmp83.children(tmp99);
    S3 tmp86 = tmp85.get1();
    S3_ch tmp88 = tmp86.children(tmp99);
    device float* tmp89 = tmp88.get0().val;
    const auto tmp70_simd_val_ = simd_sum(tmp70);  <-- SIMD reduction
    if (utid_in_simdg_ == 0) {
      const float tmp72 = fatomic_fetch_add(tmp89, tmp70_simd_val_);
    }
  }
}
  • SIMD off
kernel void mtl_k0001_reduce_c6_0_0(
    device byte* root_addr [[buffer(0)]],
    device byte* global_tmps_addr [[buffer(1)]],
    device byte* runtime_addr [[buffer(2)]],
    device byte* print_addr [[buffer(3)]],
    const uint ugrid_size_ [[threads_per_grid]],
    const uint utid_ [[thread_position_in_grid]]) {
  // range_for, range known at compile time
  const int total_elems = 1048576;
  const int begin_ = utid_ + 0;
  const int end_ = total_elems + 0;
  // TLS prologue
  int32_t tls_bufi32_[1];
  thread char* tls_buffer_ = reinterpret_cast<thread char*>(tls_bufi32_);
    thread auto* tmp65 = reinterpret_cast<thread float*>(tls_buffer_ + 0);
    constexpr float tmp66 = 0.0;
    *tmp65 = tmp66;
  for (int ii = begin_; ii < end_; ii += ugrid_size_) {
    mtl_k0001_reduce_c6_0_0_func(root_addr, global_tmps_addr, runtime_addr, print_addr, tls_buffer_, ii);
  }
  {  // TLS epilogue
    thread auto* tmp69 = reinterpret_cast<thread float*>(tls_buffer_ + 0);
    float tmp70 = *tmp69;
    S0 tmp83(root_addr);
    constexpr int32_t tmp99 = 0;
    S0_ch tmp85 = tmp83.children(tmp99);
    S3 tmp86 = tmp85.get1();
    S3_ch tmp88 = tmp86.children(tmp99);
    device float* tmp89 = tmp88.get0().val;
    const float tmp72 = fatomic_fetch_add(tmp89, tmp70);
  }
}

Related issue = #576

[Click here for the format server]


@codecov
Copy link

codecov bot commented Jun 29, 2020

Codecov Report

Merging #1358 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #1358   +/-   ##
=======================================
  Coverage   85.57%   85.57%           
=======================================
  Files          19       19           
  Lines        3368     3368           
  Branches      623      623           
=======================================
  Hits         2882     2882           
  Misses        356      356           
  Partials      130      130           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c58a262...1e1dc89. Read the comment docs.

Copy link
Member

@yuanming-hu yuanming-hu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Btw, do you observe any performance difference when you turn on/off SIMD reduction in the epilogue?

@k-ye
Copy link
Member Author

k-ye commented Jun 30, 2020

Btw, do you observe any performance difference when you turn on/off SIMD reduction in the epilogue?

Ah yep. The number I gave was basically comparing SIMD on vs off. TBH TLS itself didn't help much here, since I didn't seem to implement grid-strided loops for range-for loops..

@yuanming-hu
Copy link
Member

Btw, do you observe any performance difference when you turn on/off SIMD reduction in the epilogue?

Ah yep. The number I gave was basically comparing SIMD on vs off. TBH TLS itself didn't help much here, since I didn't seem to implement grid-strided loops for range-for loops..

Ah I see. I thought the numbers were no TLS v.s. TLS. It's interesting to see that SIMD has such a great improvement. on CUDA the improvement was small. I guess we will need both. If your grid size is small, then grid-strided loops can significantly reduce the number of atomics needed.

@k-ye
Copy link
Member Author

k-ye commented Jun 30, 2020

Good point! I will try switching to that later.. (Also I probably still need some time to cleanup in this PR, so as to simplify the process of passing enable_simdgroup around..)

@k-ye k-ye merged commit 6100ee2 into taichi-dev:master Jul 1, 2020
@k-ye k-ye deleted the tls branch July 1, 2020 14:01
@k-ye k-ye mentioned this pull request Jul 2, 2020
6 tasks
@k-ye
Copy link
Member Author

k-ye commented Jul 2, 2020

@yuanming-hu FYI, I ran another benchmark_reduction.py with the threads capped at 65536. Even with SIMD off, it took about 0.17s for one reduce() step. With SIMD on, it was about 0.023s... Note that both data excluded the first run, which has to do the JIT...

@FantasyVR FantasyVR mentioned this pull request Jul 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants