-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Metal] Support fast_math, preps for saturating_grid_dim #1443
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1443 +/- ##
==========================================
+ Coverage 66.41% 66.69% +0.27%
==========================================
Files 38 38
Lines 5291 5305 +14
Branches 951 948 -3
==========================================
+ Hits 3514 3538 +24
+ Misses 1613 1607 -6
+ Partials 164 160 -4
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTMig + 1 question + 1 nit.
@@ -10,7 +10,7 @@ def _c_mod(a, b): | |||
|
|||
@pytest.mark.parametrize('lhs_is_mat,rhs_is_mat', [(True, True), (True, False), | |||
(False, True)]) | |||
@ti.all_archs | |||
@ti.all_archs_with(fast_math=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does pytest.approx
with larger rel=
and abs=
solve your issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, it failed at //
, which translates to floor(y / z)
. For y[1][1]
and z[1][1]
, they are both 3
, so I guess that the division becomes something like 0.99999
, which gets floored to 0.0
instead of 1.0
. I just tested by setting y
at a slightly higher value, e.g. 3.0001
, and it worked.. But i think such kind of rounding error is expected when enabling fast math, so it's not a big deal? Maybe we should also disable fast math in tests by default.
int num_threads_per_group = 0; | ||
// Sometimes it is helpful to limit the maximum GPU block dim for the | ||
// kernels. E.g., when you are generating iPhone shaders on a Mac. | ||
const int prescribed_block_dim = | ||
(std::size_t)get_current_program().config.max_block_dim; | ||
if (prescribed_block_dim != 0) { | ||
num_threads_per_group = std::min(native_block_dim, prescribed_block_dim); | ||
} else { | ||
num_threads_per_group = native_block_dim; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OFT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really :) These are the necessary changes for supporting saturing_grid_dim
.
int msl_version) { | ||
auto source_str = mac::wrap_string_as_ns_string(source); | ||
|
||
id options = clscall("MTLCompileOptions", "alloc"); | ||
options = call(options, "init"); | ||
auto options_cleanup = wrap_as_nsobj_unique_ptr(options); | ||
call(options, "setFastMathEnabled:", false); | ||
call(options, "setFastMathEnabled:", fast_math); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! Not sure about how fast_math
works. IIUC is this the same as specifying precision mediump float;
in OpenGL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are both for optimizing performance, but take different approaches. From what i can tell, precision mediump float
reduces the bits to represent float. Metal also has a similar concept of half
(16bits). Fast math, on the other hand, reduces the instructions in the computation to produce an approximated result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool!
@yuanming-hu IIUC, saturating_grid_dims can be used to limit the total number of threads? I'd like to do that once users can actually set it in ti.init()..
That's used to limit the maximum number of groups (blocks on CUDA). On CUDA if we use grid-stride loops, there's no need to allocate more than 16/32 blocks per streaming multiprocessor. Otherwise the blocks will be queued to bind to available SMs.
taichi/taichi/program/program.cpp
Line 180 in f515747
config.saturating_grid_dim = num_SMs * 32; |
Total number of threads is grid_dim x block_dim. Maybe that's a slightly different metric? :-)
FYI, I think the ring artifacts were also due to fast math. Now it looks like this with max_ray_depth=1...
We might need to take a closer look into this later - maybe using a larger eps or something. Note that although max larger max_ray_depth
, you may seemingly rid of this artifact, this can still affect convergence/accuracy.
Ah I see. It seems like users aren't given the option to turn off grid-strided loop? That is, if
Ack. But if i turn off fast math, and still use |
It's interesting that without grid-stride loops Feel free to do any specialization for Metal if that's beneficial to performance.
Interesting... I guess it's something related to numerical precision. The ring artifacts sometimes happen when you have |
Yep, that I remember and have done for Metal as well :) taichi/taichi/backends/metal/codegen_metal.cpp Lines 846 to 854 in 0d2e34f
I still feel like that fast math broke some example on mac, but couldn't find it now. Let me try a few more before merging this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Didn't pass very carefully. Given @yuanming-hu's approvement, let's merge this for now to pvc with #1474.
Thanks! I've been working on #1480 this whole weekend and didn't get the time to re-check this. Let me verify if there's any example broken by fast math in the next few days. Tested the examples again, but I couldn't find any that's broken. Merging. |
fast_math
helped mostly with computation-bounded tasks. E.g.fast_math
broke a few tests. Given that these tests were doing a3.0 / 3.0
, i'm not too surprised by that. But I had to turn offfast_math
.saturating_grid_dims
can be used to limit the total number of threads? I'd like to do that once users can actually set it inti.init()
..Related issue = #935
[Click here for the format server]