-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Taken <Pad> as a regular token could make model only learn the <Pad> information? #50
Comments
Hi, |
Hi, |
I have the same issue. I have modified the code to run with pytorch Lightning, but for me as well it learned only pads. |
I am running the experiments of QQP and I have changed the computation of loss in the training code. loss_mask = ([0]*(len(src)+1) + [1]*len(trg) + [0] * pad_length) Here is my result, the suffix "with_loss_mask" means only calculating loss of tokens in target sentence terms["loss"] = terms["mse_with_loss_mask"] +terms["decoder_nll_with_loss_mask"] + tT_loss_with_loss_mask
--------------------------------------------
| decoder_nll | 7.04e-09 |
| decoder_nll_q0 | 1.75e-08 |
| decoder_nll_q1 | 1.36e-08 |
| decoder_nll_q2 | 1.16e-08 |
| decoder_nll_q3 | 2.7e-09 |
| decoder_nll_with_loss_mask | 2.56e-08 |
| decoder_nll_with_loss_mask_q0 | 5.69e-08 |
| decoder_nll_with_loss_mask_q1 | 6.13e-08 |
| decoder_nll_with_loss_mask_q2 | 3.49e-08 |
| decoder_nll_with_loss_mask_q3 | 9.4e-09 |
| grad_norm | 0.0356 |
| loss | 0.00671 |
| loss_q0 | 0.00704 |
| loss_q1 | 0.00685 |
| loss_q2 | 0.00674 |
| loss_q3 | 0.00663 |
| mse | 1.5 |
| mse_q0 | 3.58 |
| mse_q1 | 2.92 |
| mse_q2 | 2.24 |
| mse_q3 | 0.699 |
| mse_with_loss_mask | 0.00671 |
| mse_with_loss_mask_q0 | 0.00704 |
| mse_with_loss_mask_q1 | 0.00685 |
| mse_with_loss_mask_q2 | 0.00674 |
| mse_with_loss_mask_q3 | 0.00663 |
| nll | 51.2 |
| nll_q0 | 115 |
| nll_q1 | 95.9 |
| nll_q2 | 77.8 |
| nll_q3 | 25.1 |
| nll_with_loss_mask | 1.11 |
| nll_with_loss_mask_q0 | 0.0114 |
| nll_with_loss_mask_q1 | 0.14 |
| nll_with_loss_mask_q2 | 0.608 |
| nll_with_loss_mask_q3 | 1.62 |
| samples | 9.8e+08 |
--------------------------------------------
Here is an example of generated texts, the model doesn't generate PAD, but it still can't generate expected text.
|
when I use the original loss(without loss mask),I get the following result
The loss doesn't become very small but the generated texts become much better
The only difference between the above two experiments (w/o loss mask) is training step. Maybe we just need to train more steps and set a proper lr |
did you just only modify the trg's loss that during training in |
when I use the original loss(without loss mask), I did not modify any code model trained with loss mask did not perform well,maybe I need to train more steps? Hope someone can give me some advice |
Did you modify the p_sample() at the end? I find if we change the seq_len, too many pads can seriously affect the effect. |
Hello!Could you please show me your modified 'training_losses_seq2seq'? |
It seems that DiffuSeq calculates its loss of both x and y part: #25 (comment). This is contradictory to the paper, but after training, meaningful texts are generated. |
@swave-demo Hi, this is a good point. Let me explain this. The input mask takes two roles: a. keep x input part un-noised; b. mask out the mse loss of x part. In this repo, we implement a but not b. In our following work DoT, which finetunes the current diffusion LMs in DiffuSeq-style, we implement both a and b. So it is suggested to mask out the mse of x part. You can also try it in DiffuSeq (train seq2seq data from scratch). Then why the current version of DiffuSeq still works? That's because we still mask the input of x and keep it un-noised, so you can imagine that for the x part, the model only learns to repeat the input text, which is easy to learn and is quite different from the y part's learning, where the model needs to recover the noised text to the clean text. In the end, the mse loss of x part does not contribute much to the denoiser model's training, if we have to claim its contribution, I believe it takes effects on the word embedding update, at least this model is train from scratch instead of finetuned from existing LMs. |
Thanks, your explanation really helps me understand. |
Hi
In my project, I discovered that taking as the regular token, the diffusion model usally
learn the information. In other words, the model tends to predict the token instead of other words in the generation.
How to avoid this issue?
The text was updated successfully, but these errors were encountered: