You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In function runCublasTF32 the comment is misleading/incomplete. Based on the cublas docs, the effect of CUBLAS_COMPUTE_32F_FAST_TF32 is that it will use reduced precision TF32 math with tensor cores for faster GEMM. Based on the documentation of wmma ops in the CUDA programming guide, the input will be converted with __float_to_tf32 to a float (of numerically reduced tf32 precision).
As tensor cores in recent architectures support fp64 natively, I am curious what is the performance benefit of their usage over plain fp64 CUDA computation.
In function
runCublasTF32
the comment is misleading/incomplete. Based on the cublas docs, the effect ofCUBLAS_COMPUTE_32F_FAST_TF32
is that it will use reduced precisionTF32
math with tensor cores for faster GEMM. Based on the documentation ofwmma
ops in the CUDA programming guide, the input will be converted with__float_to_tf32
to afloat
(of numerically reducedtf32
precision).As tensor cores in recent architectures support fp64 natively, I am curious what is the performance benefit of their usage over plain
fp64
CUDA computation.SGEMM_CUDA/src/runner.cu
Line 145 in 60cba6f
The text was updated successfully, but these errors were encountered: