Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Metal] Support fast_math, preps for saturating_grid_dim #1443

Merged
merged 2 commits into from
Jul 14, 2020

Conversation

k-ye
Copy link
Member

@k-ye k-ye commented Jul 9, 2020

  • fast_math helped mostly with computation-bounded tasks. E.g.
    • sdf_renderer: 25 -> 43 sps
    • cornell_box: 95 -> ~130 sps
  • As you can tell, enabling fast_math broke a few tests. Given that these tests were doing a 3.0 / 3.0, i'm not too surprised by that. But I had to turn off fast_math.
  • @yuanming-hu IIUC, saturating_grid_dims can be used to limit the total number of threads? I'd like to do that once users can actually set it in ti.init()..

Related issue = #935

[Click here for the format server]


@codecov
Copy link

codecov bot commented Jul 9, 2020

Codecov Report

Merging #1443 into master will increase coverage by 0.27%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1443      +/-   ##
==========================================
+ Coverage   66.41%   66.69%   +0.27%     
==========================================
  Files          38       38              
  Lines        5291     5305      +14     
  Branches      951      948       -3     
==========================================
+ Hits         3514     3538      +24     
+ Misses       1613     1607       -6     
+ Partials      164      160       -4     
Impacted Files Coverage Δ
python/taichi/misc/gui.py 25.00% <0.00%> (-0.23%) ⬇️
python/taichi/core/util.py 21.72% <0.00%> (+0.53%) ⬆️
python/taichi/lang/__init__.py 78.66% <0.00%> (+3.66%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f515747...69eefb3. Read the comment docs.

@k-ye
Copy link
Member Author

k-ye commented Jul 9, 2020

FYI, I think the ring artifacts were also due to fast math. Now it looks like this with max_ray_depth=1...

Screen Shot 2020-07-09 at 21 15 25

Copy link
Collaborator

@archibate archibate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTMig + 1 question + 1 nit.

@@ -10,7 +10,7 @@ def _c_mod(a, b):

@pytest.mark.parametrize('lhs_is_mat,rhs_is_mat', [(True, True), (True, False),
(False, True)])
@ti.all_archs
@ti.all_archs_with(fast_math=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does pytest.approx with larger rel= and abs= solve your issue?

Copy link
Member Author

@k-ye k-ye Jul 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, it failed at //, which translates to floor(y / z). For y[1][1] and z[1][1], they are both 3, so I guess that the division becomes something like 0.99999, which gets floored to 0.0 instead of 1.0. I just tested by setting y at a slightly higher value, e.g. 3.0001, and it worked.. But i think such kind of rounding error is expected when enabling fast math, so it's not a big deal? Maybe we should also disable fast math in tests by default.

Comment on lines -136 to -145
int num_threads_per_group = 0;
// Sometimes it is helpful to limit the maximum GPU block dim for the
// kernels. E.g., when you are generating iPhone shaders on a Mac.
const int prescribed_block_dim =
(std::size_t)get_current_program().config.max_block_dim;
if (prescribed_block_dim != 0) {
num_threads_per_group = std::min(native_block_dim, prescribed_block_dim);
} else {
num_threads_per_group = native_block_dim;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OFT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really :) These are the necessary changes for supporting saturing_grid_dim.

int msl_version) {
auto source_str = mac::wrap_string_as_ns_string(source);

id options = clscall("MTLCompileOptions", "alloc");
options = call(options, "init");
auto options_cleanup = wrap_as_nsobj_unique_ptr(options);
call(options, "setFastMathEnabled:", false);
call(options, "setFastMathEnabled:", fast_math);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Not sure about how fast_math works. IIUC is this the same as specifying precision mediump float; in OpenGL?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are both for optimizing performance, but take different approaches. From what i can tell, precision mediump float reduces the bits to represent float. Metal also has a similar concept of half (16bits). Fast math, on the other hand, reduces the instructions in the computation to produce an approximated result.

@k-ye k-ye requested a review from archibate July 10, 2020 11:42
Copy link
Member

@yuanming-hu yuanming-hu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

@yuanming-hu IIUC, saturating_grid_dims can be used to limit the total number of threads? I'd like to do that once users can actually set it in ti.init()..

That's used to limit the maximum number of groups (blocks on CUDA). On CUDA if we use grid-stride loops, there's no need to allocate more than 16/32 blocks per streaming multiprocessor. Otherwise the blocks will be queued to bind to available SMs.

config.saturating_grid_dim = num_SMs * 32;

Total number of threads is grid_dim x block_dim. Maybe that's a slightly different metric? :-)

FYI, I think the ring artifacts were also due to fast math. Now it looks like this with max_ray_depth=1...

We might need to take a closer look into this later - maybe using a larger eps or something. Note that although max larger max_ray_depth, you may seemingly rid of this artifact, this can still affect convergence/accuracy.

@k-ye
Copy link
Member Author

k-ye commented Jul 10, 2020

That's used to limit the maximum number of groups (blocks on CUDA). On CUDA if we use grid-stride loops, there's no need to allocate more than 16/32 blocks per streaming multiprocessor. Otherwise the blocks will be queued to bind to available SMs.

Ah I see. It seems like users aren't given the option to turn off grid-strided loop? That is, if saturating_grid_dim = 0, then it gets set to num_SM * 32... For some reason, sdf_render.py does run faster without grid-stride loop, so I guess the default params would work a bit differently on Metal.

Note that although max larger max_ray_depth, you may seemingly rid of this artifact, this can still affect convergence/accuracy.

Ack. But if i turn off fast math, and still use max_ray_depth=1, the artifacts are gone (maybe not completely, I still seem to see a round shape if zoomed in..) See below:

Screen Shot 2020-07-11 at 0 32 40

@yuanming-hu
Copy link
Member

Ah I see. It seems like users aren't given the option to turn off grid-strided loop? That is, if saturating_grid_dim = 0, then it gets set to num_SM * 32... For some reason, sdf_render.py does run faster without grid-stride loop, so I guess the default params would work a bit differently on Metal.

It's interesting that without grid-stride loops sdf_renderer.py runs faster. Maybe not using grid-stride loops gives you better load balancing? I'm not sure what happens if CUDA doesn't use grid-stride loops :-) The other motivation to use grid-stride loops, is that when the bounds of the loop depend on the previous kernel, using grid-stride loop allows CUDA to launch the kernel without waiting for the previous kernel to finish, since you don't need to know how many iterations are needed when you launch the kernel with grid-stride loop.

Feel free to do any specialization for Metal if that's beneficial to performance.

Ack. But if i turn off fast math, and still use max_ray_depth=1, the artifacts are gone (maybe not completely, I still seem to see a round shape if zoomed in..) See below:

Interesting... I guess it's something related to numerical precision. The ring artifacts sometimes happen when you have sqrt/1/sqrt functions calls that need to be high-precision.

@k-ye
Copy link
Member Author

k-ye commented Jul 11, 2020

using grid-stride loop allows CUDA to launch the kernel without waiting for the previous kernel to finish

Yep, that I remember and have done for Metal as well :)

emit("// range_for, range known at runtime");
begin_expr = stmt->const_begin
? std::to_string(stmt->begin_value)
: inject_load_global_tmp(stmt->begin_offset);
const auto end_expr = stmt->const_end
? std::to_string(stmt->end_value)
: inject_load_global_tmp(stmt->end_offset);
emit("const int {} = {} - {};", total_elems_name, end_expr, begin_expr);
ka.num_threads = kMaxNumThreadsGridStrideLoop;


I still feel like that fast math broke some example on mac, but couldn't find it now. Let me try a few more before merging this.

Copy link
Collaborator

@archibate archibate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Didn't pass very carefully. Given @yuanming-hu's approvement, let's merge this for now to pvc with #1474.

@k-ye
Copy link
Member Author

k-ye commented Jul 12, 2020

Thanks! Didn't pass very carefully. Given @yuanming-hu's approvement, let's merge this for now to pvc with #1474.

Thanks! I've been working on #1480 this whole weekend and didn't get the time to re-check this. Let me verify if there's any example broken by fast math in the next few days.


Tested the examples again, but I couldn't find any that's broken. Merging.

@k-ye k-ye merged commit d756cbd into taichi-dev:master Jul 14, 2020
@k-ye k-ye deleted the fm branch July 14, 2020 10:27
@FantasyVR FantasyVR mentioned this pull request Jul 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants