[QST] Significant GFLOPs variations due to different input initialization behaviours #2004

yuukidach · 2024-12-20T17:22:19Z

What is your question?

env:

GPU: NVIDIA H100 80GB HBM3
CUDA Version: 12.3
cuDNN Version: 90201
CUTLASS commit: e1cd8c7

Background

In example 48_hopper_warp_specialized_gemm, I've observed that the behaviours of initializing input blocks can lead to substantial differences in measured GFLOPs .

The example 48 calls BlockFillRandomUniform with bits=0 when initialiing input blokcs. Under this configuration, all floating-point mantissas are truncated to zero. And the example yields a GFLOPs measurement of 326,225.

after changing the bits parameter to -1 (do not truncat the data), the GLOPS drops to 304768

Profiling Insights:

Memory Throughput: Using ncu for profiling reveals that the truncated data (bits=0) exhibits higher L2 cache and DRAM throughput compared to the non-truncated data (bits=-1).
SM Efficiency: The compute efficiency of the SM remains consistent across both initialization methods.

This behavior is not limited to specific input sizes or tile configurations. Similar patterns are observed across various configurations, indicating a potential systemic issue.

Question

Is this result expected? Why does the distribution of input data (truncated vs. non-truncated) affect data transfer speeds and overall performance?

The text was updated successfully, but these errors were encountered:

thakkarV · 2024-12-20T18:34:04Z

Yes this is expected.

yuukidach · 2024-12-21T09:38:37Z

hi @thakkarV, thanks for the confirmation. Would you mind explaining more about how the truncated bits affect the GPU memory transfer? Does this mean that there is actually a data compression process during the transfer of data in GPU memory

thakkarV · 2024-12-21T11:35:14Z

https://www.thonking.ai/p/strangely-matrix-multiplications

yuukidach · 2024-12-23T04:22:48Z

Thanks for the post. Never thought it would be related to the flipping of transistors

yuukidach added ? - Needs Triage question Question labels Dec 20, 2024

yuukidach changed the title ~~[QST] Significant GFLOPs Variations Due to Different Input Initialization behaviours in CUTLASS~~ [QST] Significant GFLOPs variations due to different input initialization behaviours Dec 20, 2024

thakkarV closed this as completed Dec 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] Significant GFLOPs variations due to different input initialization behaviours #2004

[QST] Significant GFLOPs variations due to different input initialization behaviours #2004

yuukidach commented Dec 20, 2024

thakkarV commented Dec 20, 2024

yuukidach commented Dec 21, 2024

thakkarV commented Dec 21, 2024

yuukidach commented Dec 23, 2024

[QST] Significant GFLOPs variations due to different input initialization behaviours #2004

[QST] Significant GFLOPs variations due to different input initialization behaviours #2004

Comments

yuukidach commented Dec 20, 2024

Background

Profiling Insights:

Question

thakkarV commented Dec 20, 2024

yuukidach commented Dec 21, 2024

thakkarV commented Dec 21, 2024

yuukidach commented Dec 23, 2024