[Metal] Support fast_math, preps for saturating_grid_dim #1443

k-ye · 2020-07-09T11:38:44Z

fast_math helped mostly with computation-bounded tasks. E.g.
- sdf_renderer: 25 -> 43 sps
- cornell_box: 95 -> ~130 sps
As you can tell, enabling fast_math broke a few tests. Given that these tests were doing a 3.0 / 3.0, i'm not too surprised by that. But I had to turn off fast_math.
@yuanming-hu IIUC, saturating_grid_dims can be used to limit the total number of threads? I'd like to do that once users can actually set it in ti.init()..

Related issue = #935

codecov · 2020-07-09T12:17:19Z

Codecov Report

Merging #1443 into master will increase coverage by 0.27%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #1443      +/-   ##
==========================================
+ Coverage   66.41%   66.69%   +0.27%     
==========================================
  Files          38       38              
  Lines        5291     5305      +14     
  Branches      951      948       -3     
==========================================
+ Hits         3514     3538      +24     
+ Misses       1613     1607       -6     
+ Partials      164      160       -4

Impacted Files	Coverage Δ
python/taichi/misc/gui.py	`25.00% <0.00%> (-0.23%)`	⬇️
python/taichi/core/util.py	`21.72% <0.00%> (+0.53%)`	⬆️
python/taichi/lang/__init__.py	`78.66% <0.00%> (+3.66%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f515747...69eefb3. Read the comment docs.

k-ye · 2020-07-09T12:17:22Z

FYI, I think the ring artifacts were also due to fast math. Now it looks like this with max_ray_depth=1...

archibate

LGTMig + 1 question + 1 nit.

archibate · 2020-07-09T14:29:47Z

tests/python/test_element_wise.py

@@ -10,7 +10,7 @@ def _c_mod(a, b):

 @pytest.mark.parametrize('lhs_is_mat,rhs_is_mat', [(True, True), (True, False),
                                                   (False, True)])
-@ti.all_archs
+@ti.all_archs_with(fast_math=False)


Does pytest.approx with larger rel= and abs= solve your issue?

Nope, it failed at //, which translates to floor(y / z). For y[1][1] and z[1][1], they are both 3, so I guess that the division becomes something like 0.99999, which gets floored to 0.0 instead of 1.0. I just tested by setting y at a slightly higher value, e.g. 3.0001, and it worked.. But i think such kind of rounding error is expected when enabling fast math, so it's not a big deal? Maybe we should also disable fast math in tests by default.

archibate · 2020-07-09T14:33:15Z

taichi/backends/metal/kernel_manager.cpp

-    int num_threads_per_group = 0;
-    // Sometimes it is helpful to limit the maximum GPU block dim for the
-    // kernels. E.g., when you are generating iPhone shaders on a Mac.
-    const int prescribed_block_dim =
-        (std::size_t)get_current_program().config.max_block_dim;
-    if (prescribed_block_dim != 0) {
-      num_threads_per_group = std::min(native_block_dim, prescribed_block_dim);
-    } else {
-      num_threads_per_group = native_block_dim;
-    }


Not really :) These are the necessary changes for supporting saturing_grid_dim.

archibate · 2020-07-09T14:36:37Z

taichi/backends/metal/api.cpp

                                                     int msl_version) {
  auto source_str = mac::wrap_string_as_ns_string(source);

  id options = clscall("MTLCompileOptions", "alloc");
  options = call(options, "init");
  auto options_cleanup = wrap_as_nsobj_unique_ptr(options);
-  call(options, "setFastMathEnabled:", false);
+  call(options, "setFastMathEnabled:", fast_math);


Cool! Not sure about how fast_math works. IIUC is this the same as specifying precision mediump float; in OpenGL?

They are both for optimizing performance, but take different approaches. From what i can tell, precision mediump float reduces the bits to represent float. Metal also has a similar concept of half (16bits). Fast math, on the other hand, reduces the instructions in the computation to produce an approximated result.

yuanming-hu

Cool!

@yuanming-hu IIUC, saturating_grid_dims can be used to limit the total number of threads? I'd like to do that once users can actually set it in ti.init()..

That's used to limit the maximum number of groups (blocks on CUDA). On CUDA if we use grid-stride loops, there's no need to allocate more than 16/32 blocks per streaming multiprocessor. Otherwise the blocks will be queued to bind to available SMs.

taichi/taichi/program/program.cpp

Line 180 in f515747

config.saturating_grid_dim = num_SMs * 32;

Total number of threads is grid_dim x block_dim. Maybe that's a slightly different metric? :-)

FYI, I think the ring artifacts were also due to fast math. Now it looks like this with max_ray_depth=1...

We might need to take a closer look into this later - maybe using a larger eps or something. Note that although max larger max_ray_depth, you may seemingly rid of this artifact, this can still affect convergence/accuracy.

k-ye · 2020-07-10T15:39:30Z

That's used to limit the maximum number of groups (blocks on CUDA). On CUDA if we use grid-stride loops, there's no need to allocate more than 16/32 blocks per streaming multiprocessor. Otherwise the blocks will be queued to bind to available SMs.

Ah I see. It seems like users aren't given the option to turn off grid-strided loop? That is, if saturating_grid_dim = 0, then it gets set to num_SM * 32... For some reason, sdf_render.py does run faster without grid-stride loop, so I guess the default params would work a bit differently on Metal.

Note that although max larger max_ray_depth, you may seemingly rid of this artifact, this can still affect convergence/accuracy.

Ack. But if i turn off fast math, and still use max_ray_depth=1, the artifacts are gone (maybe not completely, I still seem to see a round shape if zoomed in..) See below:

yuanming-hu · 2020-07-11T03:04:03Z

Ah I see. It seems like users aren't given the option to turn off grid-strided loop? That is, if saturating_grid_dim = 0, then it gets set to num_SM * 32... For some reason, sdf_render.py does run faster without grid-stride loop, so I guess the default params would work a bit differently on Metal.

It's interesting that without grid-stride loops sdf_renderer.py runs faster. Maybe not using grid-stride loops gives you better load balancing? I'm not sure what happens if CUDA doesn't use grid-stride loops :-) The other motivation to use grid-stride loops, is that when the bounds of the loop depend on the previous kernel, using grid-stride loop allows CUDA to launch the kernel without waiting for the previous kernel to finish, since you don't need to know how many iterations are needed when you launch the kernel with grid-stride loop.

Feel free to do any specialization for Metal if that's beneficial to performance.

Ack. But if i turn off fast math, and still use max_ray_depth=1, the artifacts are gone (maybe not completely, I still seem to see a round shape if zoomed in..) See below:

Interesting... I guess it's something related to numerical precision. The ring artifacts sometimes happen when you have sqrt/1/sqrt functions calls that need to be high-precision.

k-ye · 2020-07-11T03:39:15Z

using grid-stride loop allows CUDA to launch the kernel without waiting for the previous kernel to finish

Yep, that I remember and have done for Metal as well :)

taichi/taichi/backends/metal/codegen_metal.cpp

Lines 846 to 854 in 0d2e34f

    
           emit("// range_for, range known at runtime"); 
        
           begin_expr = stmt->const_begin 
        
                            ? std::to_string(stmt->begin_value) 
        
                            : inject_load_global_tmp(stmt->begin_offset); 
        
           const auto end_expr = stmt->const_end 
        
                                     ? std::to_string(stmt->end_value) 
        
                                     : inject_load_global_tmp(stmt->end_offset); 
        
           emit("const int {} = {} - {};", total_elems_name, end_expr, begin_expr); 
        
           ka.num_threads = kMaxNumThreadsGridStrideLoop;

I still feel like that fast math broke some example on mac, but couldn't find it now. Let me try a few more before merging this.

archibate

Thanks! Didn't pass very carefully. Given @yuanming-hu's approvement, let's merge this for now to pvc with #1474.

k-ye · 2020-07-12T14:23:58Z

Thanks! Didn't pass very carefully. Given @yuanming-hu's approvement, let's merge this for now to pvc with #1474.

Thanks! I've been working on #1480 this whole weekend and didn't get the time to re-check this. Let me verify if there's any example broken by fast math in the next few days.

Tested the examples again, but I couldn't find any that's broken. Merging.

[Metal] Support fast_math, preps for saturating_grid_dim

de897b9

k-ye requested review from yuanming-hu, archibate and taichi-gardener July 9, 2020 11:39

archibate reviewed Jul 9, 2020

View reviewed changes

k-ye requested a review from archibate July 10, 2020 11:42

yuanming-hu approved these changes Jul 10, 2020

View reviewed changes

grid dim

69eefb3

yuanming-hu approved these changes Jul 11, 2020

View reviewed changes

archibate added the waiting for reply label Jul 12, 2020

archibate approved these changes Jul 12, 2020

View reviewed changes

archibate added LGTM and removed waiting for reply labels Jul 12, 2020

k-ye merged commit d756cbd into taichi-dev:master Jul 14, 2020

k-ye deleted the fm branch July 14, 2020 10:27

FantasyVR mentioned this pull request Jul 15, 2020

[release] v0.6.19 #1503

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Metal] Support fast_math, preps for saturating_grid_dim #1443

[Metal] Support fast_math, preps for saturating_grid_dim #1443

k-ye commented Jul 9, 2020 •

edited

Loading

codecov bot commented Jul 9, 2020 •

edited

Loading

k-ye commented Jul 9, 2020

archibate left a comment •

edited

Loading

archibate Jul 9, 2020

k-ye Jul 10, 2020 •

edited

Loading

archibate Jul 9, 2020

k-ye Jul 10, 2020

archibate Jul 9, 2020

k-ye Jul 10, 2020

yuanming-hu left a comment •

edited

Loading

k-ye commented Jul 10, 2020 •

edited

Loading

yuanming-hu commented Jul 11, 2020

k-ye commented Jul 11, 2020 •

edited

Loading

archibate left a comment

k-ye commented Jul 12, 2020 •

edited

Loading

[Metal] Support fast_math, preps for saturating_grid_dim #1443

[Metal] Support fast_math, preps for saturating_grid_dim #1443

Conversation

k-ye commented Jul 9, 2020 • edited Loading

codecov bot commented Jul 9, 2020 • edited Loading

Codecov Report

k-ye commented Jul 9, 2020

archibate left a comment • edited Loading

Choose a reason for hiding this comment

archibate Jul 9, 2020

Choose a reason for hiding this comment

k-ye Jul 10, 2020 • edited Loading

Choose a reason for hiding this comment

archibate Jul 9, 2020

Choose a reason for hiding this comment

k-ye Jul 10, 2020

Choose a reason for hiding this comment

archibate Jul 9, 2020

Choose a reason for hiding this comment

k-ye Jul 10, 2020

Choose a reason for hiding this comment

yuanming-hu left a comment • edited Loading

Choose a reason for hiding this comment

k-ye commented Jul 10, 2020 • edited Loading

yuanming-hu commented Jul 11, 2020

k-ye commented Jul 11, 2020 • edited Loading

archibate left a comment

Choose a reason for hiding this comment

k-ye commented Jul 12, 2020 • edited Loading

k-ye commented Jul 9, 2020 •

edited

Loading

codecov bot commented Jul 9, 2020 •

edited

Loading

archibate left a comment •

edited

Loading

k-ye Jul 10, 2020 •

edited

Loading

yuanming-hu left a comment •

edited

Loading

k-ye commented Jul 10, 2020 •

edited

Loading

k-ye commented Jul 11, 2020 •

edited

Loading

k-ye commented Jul 12, 2020 •

edited

Loading