Bidirectional Window Tweaks #60

LinglongQian · 2024-12-22T19:11:05Z

Memory access patterns become increasingly critical at larger-scale devices. This PR explored optimising the attention window mask implementation by changing from unidirectional q_idx - kv_idx < sliding_window_num_blocks to bidirectional pattern abs(q_idx - kv_idx) < sliding_window_num_blocks. This modification provides more regular memory access patterns that better utilize modern GPU architectures, resulting in improved training speed without compromising model performance.

The bidirectional window mask creates a band-diagonal pattern instead of a triangular pattern, resulting in:

More regular memory access patterns
Better GPU memory bandwidth utilization
More efficient parallel computation

Since no H100 available on my side, I tested on A100 GPUs with different configurations (1x, 2x, and 8x), 15 times for each setting (bidirectional pattern vs 121024_MFUTweaks @YouJiacheng ) and showed consistent improvements (all records submitted):

8x A100 Configuration (Most Significant Improvement)

Validation Loss: 3.2787 (bidirectional) vs 3.2788 (baseline)
Training Time: 12.2910m vs 12.3380m
Step Average: 501.67 vs 503.59
Variance remains stable across all metrics

2x A100 Configuration

Validation Loss: 3.2787 vs 3.2785
Training Time: 39.5508m vs 39.5650m
Step Average: 1614.31 vs 1614.89

1x A100 Configuration

Validation Loss: 3.2785 vs 3.2790
Training Time: 75.1865m vs 75.4922m
Step Average: 3068.83 vs 3081.31

It would be interesting to see how it works on 8x H100; I would greatly appreciate if anyone could help with it!

YouJiacheng · 2024-12-22T23:58:02Z

I get confused. after logical AND with causal mask, these two masks should be the same...
but it's possible that it's faster to create window_bm in this way.

YouJiacheng · 2024-12-23T00:48:09Z

I saw no improvement upon this record https://x.com/YouJiacheng/status/1868938024731787640

LinglongQian · 2024-12-23T09:10:33Z

Thanks for quick checking; that's quite interesting! The difference is only in AND with the causal mask; the bidirectional window constraint makes window_bm more sparse.

LinglongQian · 2024-12-23T09:48:42Z

Hi @YouJiacheng

After checking, a possible explanation lies in the NVIDIA Ampere (A100) architecture's implementation of Structured Sparsity, which supports efficient memory access, significant acceleration, and easy recovery of accuracy. This optimisation allows the A100 to benefit even when the sparse optimisation path is not explicitly triggered, particularly by leveraging its ability to handle partial sparsity effectively. In such cases, the A100's general hardware architecture can still take advantage of sparsity to reduce computational load and improve memory efficiency.

While the Hopper (H100) architecture inherits support for Structured Sparsity, it places greater emphasis on the newly introduced Transformer Engine, which is optimised for floating-point operations and excels in dense Transformer workloads. This shift in focus means that when sparsity does not strictly adhere to the 2:4 structured format, the H100 might treat the computation as a dense operation, thus diminishing the potential benefits of sparsity and overshadowing any marginal gains from sparsity in scenarios where the sparse optimisation path is not fully utilised.

With PyTorch already incorporating support for semi-structured (2:4) sparsity to accelerate neural network training, further exploring this direction to enhance training efficiency might be interesting.

Bidirectional Window Mask

a72e9f5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bidirectional Window Tweaks #60

Bidirectional Window Tweaks #60

LinglongQian commented Dec 22, 2024 •

edited

Loading

YouJiacheng commented Dec 22, 2024 •

edited

Loading

YouJiacheng commented Dec 23, 2024

LinglongQian commented Dec 23, 2024

LinglongQian commented Dec 23, 2024

Bidirectional Window Tweaks #60

Are you sure you want to change the base?

Bidirectional Window Tweaks #60

Conversation

LinglongQian commented Dec 22, 2024 • edited Loading

8x A100 Configuration (Most Significant Improvement)

2x A100 Configuration

1x A100 Configuration

YouJiacheng commented Dec 22, 2024 • edited Loading

YouJiacheng commented Dec 23, 2024

LinglongQian commented Dec 23, 2024

LinglongQian commented Dec 23, 2024

LinglongQian commented Dec 22, 2024 •

edited

Loading

YouJiacheng commented Dec 22, 2024 •

edited

Loading