Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Memory access patterns become increasingly critical at larger-scale devices. This PR explored optimising the attention window mask implementation by changing from unidirectional
q_idx - kv_idx < sliding_window_num_blocks
to bidirectional patternabs(q_idx - kv_idx) < sliding_window_num_blocks
. This modification provides more regular memory access patterns that better utilize modern GPU architectures, resulting in improved training speed without compromising model performance.The bidirectional window mask creates a band-diagonal pattern instead of a triangular pattern, resulting in:
Since no H100 available on my side, I tested on A100 GPUs with different configurations (1x, 2x, and 8x), 15 times for each setting (bidirectional pattern vs 121024_MFUTweaks @YouJiacheng ) and showed consistent improvements (all records submitted):
8x A100 Configuration (Most Significant Improvement)
2x A100 Configuration
1x A100 Configuration
It would be interesting to see how it works on 8x H100; I would greatly appreciate if anyone could help with it!