Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bidirectional Window Tweaks #60

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

LinglongQian
Copy link

@LinglongQian LinglongQian commented Dec 22, 2024

Memory access patterns become increasingly critical at larger-scale devices. This PR explored optimising the attention window mask implementation by changing from unidirectional q_idx - kv_idx < sliding_window_num_blocks to bidirectional pattern abs(q_idx - kv_idx) < sliding_window_num_blocks. This modification provides more regular memory access patterns that better utilize modern GPU architectures, resulting in improved training speed without compromising model performance.

The bidirectional window mask creates a band-diagonal pattern instead of a triangular pattern, resulting in:

  1. More regular memory access patterns
  2. Better GPU memory bandwidth utilization
  3. More efficient parallel computation
    window_masks

Since no H100 available on my side, I tested on A100 GPUs with different configurations (1x, 2x, and 8x), 15 times for each setting (bidirectional pattern vs 121024_MFUTweaks @YouJiacheng ) and showed consistent improvements (all records submitted):

8x A100 Configuration (Most Significant Improvement)

  • Validation Loss: 3.2787 (bidirectional) vs 3.2788 (baseline)
  • Training Time: 12.2910m vs 12.3380m
  • Step Average: 501.67 vs 503.59
  • Variance remains stable across all metrics
    Train Time Comparison (8xA100)

2x A100 Configuration

  • Validation Loss: 3.2787 vs 3.2785
  • Training Time: 39.5508m vs 39.5650m
  • Step Average: 1614.31 vs 1614.89
    Train Time Comparison (2xA100)

1x A100 Configuration

  • Validation Loss: 3.2785 vs 3.2790
  • Training Time: 75.1865m vs 75.4922m
  • Step Average: 3068.83 vs 3081.31
    Train Time Comparison (1xA100)

It would be interesting to see how it works on 8x H100; I would greatly appreciate if anyone could help with it!

@YouJiacheng
Copy link
Contributor

YouJiacheng commented Dec 22, 2024

I get confused. after logical AND with causal mask, these two masks should be the same...
but it's possible that it's faster to create window_bm in this way.

@YouJiacheng
Copy link
Contributor

I saw no improvement upon this record https://x.com/YouJiacheng/status/1868938024731787640

@LinglongQian
Copy link
Author

Thanks for quick checking; that's quite interesting! The difference is only in AND with the causal mask; the bidirectional window constraint makes window_bm more sparse.

@LinglongQian
Copy link
Author

Hi @YouJiacheng

After checking, a possible explanation lies in the NVIDIA Ampere (A100) architecture's implementation of Structured Sparsity, which supports efficient memory access, significant acceleration, and easy recovery of accuracy. This optimisation allows the A100 to benefit even when the sparse optimisation path is not explicitly triggered, particularly by leveraging its ability to handle partial sparsity effectively. In such cases, the A100's general hardware architecture can still take advantage of sparsity to reduce computational load and improve memory efficiency.

While the Hopper (H100) architecture inherits support for Structured Sparsity, it places greater emphasis on the newly introduced Transformer Engine, which is optimised for floating-point operations and excels in dense Transformer workloads. This shift in focus means that when sparsity does not strictly adhere to the 2:4 structured format, the H100 might treat the computation as a dense operation, thus diminishing the potential benefits of sparsity and overshadowing any marginal gains from sparsity in scenarios where the sparse optimisation path is not fully utilised.

With PyTorch already incorporating support for semi-structured (2:4) sparsity to accelerate neural network training, further exploring this direction to enhance training efficiency might be interesting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants