Fix potential random layout inconsistency issues in sparse attention modules #534

Justin1904 · 2020-11-18T00:54:26Z

There are two changes made to the SparseSelfAttention module in this PR:

Now SparseSelfAttention module will create a "master_layout" upfront and register it as a buffer, this saves us the need to create new layout on-the-fly later (which can cause inconsistency if there's randomness in layout creation) and also makes it easy for us to save & load the layout from checkpoint;
Add a broadcast of layout at the beginning to ensure different processes in distributed training will have consistent layout.

…point; 2) Add a broadcast of layout at the beginning to ensure different processes will have consistent layout during distributed training.

arashashari · 2020-11-18T01:03:30Z

deepspeed/ops/sparse_attention/sparse_self_attention.py

@@ -22,7 +23,8 @@ def __init__(
        # SparsityConfig parameters needs to be set accordingly
        sparsity_config=SparsityConfig(num_heads=4),
        key_padding_mask_mode='add',
-        attn_mask_mode='mul'):
+        attn_mask_mode='mul',
+        max_seq_length=2048):


Could you please add docstring for the new parameter as well?

Sure, just added.

jeffra · 2020-11-18T06:23:57Z

deepspeed/ops/sparse_attention/sparse_self_attention.py

+    def get_layout(self, L):
+        # if layout is never synchronized across GPUs, broadcast the layout from global rank 0
+        if self._need_layout_synchronization and dist.is_initialized():
+            dist.broadcast(self.master_layout, src=0)


This might break with model parallelism (e.g., megatron-style or pipeline parallelism). However, it might be tricky to get the correct process group and rank inside the op since we can't easily communicate with the deepspeed engine to get this info here. /cc @ShadenSmith, @samyam

That's a good point @jeffra . I think we want to only broadcast along the data parallel group, similar to our weight initialization? But getting the group is tricky as you pointed out. We could add a data_parallel_group=None parameter to the constructor, and if present broadcast along that torch.distributed group? It'll be up to the modeling side of things to ensure that the data parallel group is created/provided. Alternatively, I think we'd need a reference to the training engine.

Yep that makes sense. The training engine hasn't been created yet in this timeline, so that's a bit tricky. However, for now let's just the data_parallel_group passed into the constructor and use if if it's not None in this broadcast. Then it allows the option for this at least.

If we have multiple data_parallel_groups (i.e. in model parallel scenario), does that mean we would also require passing in the source rank to broadcast from within that process group? Do you think we would also need an optional argument for broadcast_src_rank in the constructor?

Yep, we could also add the broadcast_src_rank parameter. This just means the caller has to do this translation instead of us, which sounds fine.

1) Register layout as buffer of module so that we can save/load check…

fb28742

…point; 2) Add a broadcast of layout at the beginning to ensure different processes will have consistent layout during distributed training.

Justin1904 requested review from arashashari, awan-10, cli99, conglongli, eltonzheng, jeffra, minjiaz, niumanar, RezaYazdaniAminabadi, samyam, ShadenSmith and tjruwase as code owners November 18, 2020 00:54

arashashari reviewed Nov 18, 2020

View reviewed changes

arashashari approved these changes Nov 18, 2020

View reviewed changes

Add docstring for max_seq_length argument in SparseSelfAttention

2509bad

jeffra reviewed Nov 18, 2020

View reviewed changes

jeffra added 5 commits November 19, 2020 13:31

Merge branch 'master' into master

2a3e0a8

Merge branch 'master' into master

e087d99

Merge branch 'master' into master

da2dd1e

Merge branch 'master' into master

fbdf2bf

Merge branch 'master' into master

8bed417

arashashari merged commit 1e44d48 into microsoft:master Dec 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix potential random layout inconsistency issues in sparse attention modules #534

Fix potential random layout inconsistency issues in sparse attention modules #534

Justin1904 commented Nov 18, 2020

arashashari Nov 18, 2020

Justin1904 Nov 18, 2020

jeffra Nov 18, 2020 •

edited

Loading

ShadenSmith Nov 18, 2020

jeffra Nov 18, 2020

Justin1904 Nov 18, 2020 •

edited

Loading

jeffra Nov 19, 2020

Fix potential random layout inconsistency issues in sparse attention modules #534

Fix potential random layout inconsistency issues in sparse attention modules #534

Conversation

Justin1904 commented Nov 18, 2020

arashashari Nov 18, 2020

Choose a reason for hiding this comment

Justin1904 Nov 18, 2020

Choose a reason for hiding this comment

jeffra Nov 18, 2020 • edited Loading

Choose a reason for hiding this comment

ShadenSmith Nov 18, 2020

Choose a reason for hiding this comment

jeffra Nov 18, 2020

Choose a reason for hiding this comment

Justin1904 Nov 18, 2020 • edited Loading

Choose a reason for hiding this comment

jeffra Nov 19, 2020

Choose a reason for hiding this comment

jeffra Nov 18, 2020 •

edited

Loading

Justin1904 Nov 18, 2020 •

edited

Loading