Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] Significant GFLOPs variations due to different input initialization behaviours #2004

Closed
yuukidach opened this issue Dec 20, 2024 · 4 comments

Comments

@yuukidach
Copy link

What is your question?

env:

  • GPU: NVIDIA H100 80GB HBM3
  • CUDA Version: 12.3
  • cuDNN Version: 90201
  • CUTLASS commit: e1cd8c7

Background

In example 48_hopper_warp_specialized_gemm, I've observed that the behaviours of initializing input blocks can lead to substantial differences in measured GFLOPs .

The example 48 calls BlockFillRandomUniform with bits=0 when initialiing input blokcs. Under this configuration, all floating-point mantissas are truncated to zero. And the example yields a GFLOPs measurement of 326,225.
speed of bits=0

after changing the bits parameter to -1 (do not truncat the data), the GLOPS drops to 304768
speed of bits=-1

Profiling Insights:

Memory Throughput: Using ncu for profiling reveals that the truncated data (bits=0) exhibits higher L2 cache and DRAM throughput compared to the non-truncated data (bits=-1).
SM Efficiency: The compute efficiency of the SM remains consistent across both initialization methods.

This behavior is not limited to specific input sizes or tile configurations. Similar patterns are observed across various configurations, indicating a potential systemic issue.

Question

Is this result expected? Why does the distribution of input data (truncated vs. non-truncated) affect data transfer speeds and overall performance?

@yuukidach yuukidach changed the title [QST] Significant GFLOPs Variations Due to Different Input Initialization behaviours in CUTLASS [QST] Significant GFLOPs variations due to different input initialization behaviours Dec 20, 2024
@thakkarV
Copy link
Collaborator

Yes this is expected.

@yuukidach
Copy link
Author

hi @thakkarV, thanks for the confirmation. Would you mind explaining more about how the truncated bits affect the GPU memory transfer? Does this mean that there is actually a data compression process during the transfer of data in GPU memory

@thakkarV
Copy link
Collaborator

https://www.thonking.ai/p/strangely-matrix-multiplications

@yuukidach
Copy link
Author

Thanks for the post. Never thought it would be related to the flipping of transistors

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants