-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] The NCCL timed out while using the zero3 model. How can I solve this problem? #5066
Comments
@tohtana can you help me ??? |
same here |
1 similar comment
same here |
Found the potential cause, some experts during training don't see any token, thus no gradients, all other processes will get stucked. After feed fake gradient to experts that don't see any token, the training goes smooth. |
Could you please provide an example on how to feed fake gradient to experts? Much appreciated! @hanxiaotian |
something like below modification in HF Mixtral implementation for expert_idx in range(self.num_experts):
expert_layer = self.experts[expert_idx]
idx, top_x = torch.where(expert_mask[expert_idx])
if top_x.shape[0] == 0 and self.training:
if self.training:
top_x_ = torch.zeros(1).to(hidden_states.device).to(torch.int32)
top_x_list = top_x_.tolist()
current_state = hidden_states[None, top_x_list].reshape(
-1, hidden_dim
)
fake_state = expert_layer(current_state * 0)
final_hidden_states.index_add_(
0, top_x_, fake_state.to(hidden_states.dtype)
)
else:
continue
else:
# in torch it is faster to index using lists than torch tensors
top_x_list = top_x.tolist()
idx_list = idx.tolist()
# Index the correct hidden states and compute the expert hidden state for
# the current expert. We need to make sure to multiply the output hidden
# states by `routing_weights` on the corresponding tokens (top-1 and top-2)
current_state = hidden_states[None, top_x_list].reshape(-1, hidden_dim)
current_hidden_states = (
expert_layer(current_state)
* routing_weights[top_x_list, idx_list, None]
)
# However `index_add_` only support torch tensors for indexing so we'll use
# the `top_x` tensor here.
final_hidden_states.index_add_(
0, top_x, current_hidden_states.to(hidden_states.dtype)
) Hope this can help. |
This comment is very very gorgeous! God bless you! |
The NCCL timed out while using the zero3 model. How can I solve this problem?
I inherited the large model Mixtral 7BX8 and utilized the Llama architecture, augmenting it with multi-modal capabilities for video and audio.
The architecture of my model is as follows:
After initializing the model, I have already called deepspeed.utils.set_z3_leaf_modules(model, [MixtralSparseMoeBlock])
print('model z3_leaf_model is ',deepspeed.utils.get_z3_leaf_modules(model))
The printed result is as follows.:
The training process is as follows:
Scenario 1: When I use zero3 for deepspeed training, if the training data source only contains images, there are no issues, and training can proceed safely.
Scenario 2: When I use zero3 for deepspeed training, if the training data source contains both images and videos, it will get stuck after 270 steps, with an ongoing NCCL timeout.
The error message is as follows.
During the period when NCCL got stuck, I obtained the point at which the Python process became stuck.:
The text was updated successfully, but these errors were encountered: