Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in load_importance_loss #167

Open
Luodian opened this issue Jul 6, 2022 · 7 comments
Open

Error in load_importance_loss #167

Luodian opened this issue Jul 6, 2022 · 7 comments
Labels
enhancement New feature or request

Comments

@Luodian
Copy link

Luodian commented Jul 6, 2022

Hi I had the errors when using load_importance_loss (the code works fine when using gshard_loss). Does anyone have an idea about it?

The error log (in one rank/node) is in below:

[4]:
  time      : 2022-07-06_11:47:24
  host      : SG-IDC1-10-51-2-36
  rank      : 4 (local_rank: 4)
  exitcode  : 1 (pid: 55010)
  error_file: /tmp/torchelastic_kuhg0qco/none_62gucqgc/attempt_0/4/error.json
  traceback : Traceback (most recent call last):
    File "/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
      return forward_call(*input, **kwargs)
    File "/mnt/lustre/bli/projects/Pretraining-DG/mae/models_moe_mae.py", line 75, in forward
      x_temp = self.mlp(self.norm2(x))
    File "/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
      return forward_call(*input, **kwargs)
    File "/mnt/lustre/bli/.local/lib/python3.9/site-packages/tutel/impls/moe_layer.py", line 231, in forward
      logits_dtype, (crit, l_aux) = routing()
    File "/mnt/lustre/bli/.local/lib/python3.9/site-packages/tutel/impls/moe_layer.py", line 218, in routing
      return logits.dtype, extract_critical(scores,
    File "/mnt/lustre/bli/.local/lib/python3.9/site-packages/tutel/impls/fast_dispatch.py", line 150, in extract_critical
      l_loss = loss_fn(scores, topk_indices) if loss_fn is not None else None
    File "/mnt/lustre/bli/.local/lib/python3.9/site-packages/tutel/impls/moe_layer.py", line 215, in <lambda>
      _loss_fn = lambda gates, topk_ids: losses.load_importance_loss(
    File "/mnt/lustre/bli/.local/lib/python3.9/site-packages/tutel/impls/losses.py", line 41, in load_importance_loss
      l_load = load_loss(scores_wo_noise, topk_logits, num_global_experts, gate_noise)
    File "/mnt/lustre/bli/.local/lib/python3.9/site-packages/tutel/impls/losses.py", line 23, in load_loss
      normal = Normal(
    File "/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/torch/distributions/normal.py", line 54, in __init__
      super(Normal, self).__init__(batch_shape, validate_args=validate_args)
    File "/mnt/lustre/bli/anaconda3/envs/scale/lib/python3.9/site-packages/torch/distributions/distribution.py", line 55, in __init__
      raise ValueError(
  ValueError: Expected parameter scale (Tensor of shape (1,)) of distribution Normal(loc: tensor([0.], device='cuda:4'), scale: tensor([0.], device='cuda:4')) to satisfy the constraint GreaterThan(lower_bound=0.0), but found invalid values:
  tensor([0.], device='cuda:4')
@ghostplant
Copy link
Contributor

@zeliu98

@Luodian
Copy link
Author

Luodian commented Jul 21, 2022

Hi I think this happens because the default gate_noise value in load_importance_loss is 0.0.

And if we do

normal = Normal(0, 0.0)

it's weird, why we have a normal distribution with zero variance? and it returns

*** ValueError: Expected parameter scale (Tensor of shape ()) of distribution Normal(loc: 0.0, scale: 0.0) to satisfy the constraint GreaterThan(lower_bound=0.0), but found invalid values:
0.0

@Luodian
Copy link
Author

Luodian commented Jul 21, 2022

If I preset gate_noise to 1.0, I think the code run without problems but I am not sure if it's numerically correct?

gate_type={'type': 'top', 'k': 2, 'fp32_gate': False, 'gate_noise': 1.0, },

@zeliu98
Copy link
Contributor

zeliu98 commented Jul 21, 2022

Hi @Luodian, yes, you need to set gate_noise>0 for load_importance_loss. You can find the reasons in APPENDICES A: LOAD-BALANCING LOSS in the original paper (https://arxiv.org/pdf/1701.06538.pdf).

@ghostplant
Copy link
Contributor

ghostplant commented Aug 1, 2022

@zeliu98 We need to add assertion reason to avoid unknowns error like this.

And thanks for your information! @Luodian

@ghostplant ghostplant added the enhancement New feature or request label Aug 1, 2022
@Luodian
Copy link
Author

Luodian commented Aug 1, 2022

Yep, and I also found an issue when using cosine projector.

It seems that in cosine_top.py line 31, there should be an .cuda() or .to(device) flag to make sure the tensor in same device.

logit_scale = torch.clamp(self.temperature, max=torch.log(torch.tensor(1. / 0.01)).cuda()).exp()

ghostplant added a commit that referenced this issue Aug 1, 2022
@ghostplant
Copy link
Contributor

We have added gate_noise assertion and device cast in latest commit. Thanks for pointing out this bug.

ghostplant added a commit that referenced this issue Jan 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants