You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In example 48_hopper_warp_specialized_gemm, I've observed that the behaviours of initializing input blocks can lead to substantial differences in measured GFLOPs .
The example 48 calls BlockFillRandomUniform with bits=0 when initialiing input blokcs. Under this configuration, all floating-point mantissas are truncated to zero. And the example yields a GFLOPs measurement of 326,225.
after changing the bits parameter to -1 (do not truncat the data), the GLOPS drops to 304768
Profiling Insights:
Memory Throughput: Using ncu for profiling reveals that the truncated data (bits=0) exhibits higher L2 cache and DRAM throughput compared to the non-truncated data (bits=-1).
SM Efficiency: The compute efficiency of the SM remains consistent across both initialization methods.
This behavior is not limited to specific input sizes or tile configurations. Similar patterns are observed across various configurations, indicating a potential systemic issue.
Question
Is this result expected? Why does the distribution of input data (truncated vs. non-truncated) affect data transfer speeds and overall performance?
The text was updated successfully, but these errors were encountered:
yuukidach
changed the title
[QST] Significant GFLOPs Variations Due to Different Input Initialization behaviours in CUTLASS
[QST] Significant GFLOPs variations due to different input initialization behaviours
Dec 20, 2024
hi @thakkarV, thanks for the confirmation. Would you mind explaining more about how the truncated bits affect the GPU memory transfer? Does this mean that there is actually a data compression process during the transfer of data in GPU memory
What is your question?
env:
Background
In example
48_hopper_warp_specialized_gemm
, I've observed that the behaviours of initializing input blocks can lead to substantial differences in measured GFLOPs .The example 48 calls
BlockFillRandomUniform
withbits=0
when initialiing input blokcs. Under this configuration, all floating-point mantissas are truncated to zero. And the example yields a GFLOPs measurement of 326,225.after changing the bits parameter to -1 (do not truncat the data), the GLOPS drops to 304768
Profiling Insights:
Memory Throughput: Using ncu for profiling reveals that the truncated data (bits=0) exhibits higher L2 cache and DRAM throughput compared to the non-truncated data (bits=-1).
SM Efficiency: The compute efficiency of the SM remains consistent across both initialization methods.
This behavior is not limited to specific input sizes or tile configurations. Similar patterns are observed across various configurations, indicating a potential systemic issue.
Question
Is this result expected? Why does the distribution of input data (truncated vs. non-truncated) affect data transfer speeds and overall performance?
The text was updated successfully, but these errors were encountered: