Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Taken <Pad> as a regular token could make model only learn the <Pad> information? #50

Open
ylgao1 opened this issue Jun 5, 2023 · 12 comments

Comments

@ylgao1
Copy link

ylgao1 commented Jun 5, 2023

Hi
In my project, I discovered that taking as the regular token, the diffusion model usally
learn the information. In other words, the model tends to predict the token instead of other words in the generation.
How to avoid this issue?

@ylgao1 ylgao1 changed the title Taken <Pad> as a regular token could letting model learn the <Pad> information Taken <Pad> as a regular token could make model only learn the <Pad> information? Jun 5, 2023
@summmeer
Copy link
Collaborator

summmeer commented Jun 7, 2023

Hi,
According to our experience, the sufficient training could avoid this situation. Another choice is to omit the computation of the token's loss in the training code. Both of them would work.

@Zoigin
Copy link

Zoigin commented Aug 2, 2023

Hi In my project, I discovered that taking as the regular token, the diffusion model usally learn the information. In other words, the model tends to predict the token instead of other words in the generation. How to avoid this issue?

Hi,
Are you also find that the value of the Loss did not decrease, and the decoded output is all 'PAD', an empty string is generated? Also, I would like to ask if you have resolved this issue and how it was resolved.

@golankai
Copy link

I have the same issue. I have modified the code to run with pytorch Lightning, but for me as well it learned only pads.
Running the experiments of QQP.

@xiaotingxuan
Copy link

xiaotingxuan commented Sep 1, 2023

I am running the experiments of QQP and I have changed the computation of loss in the training code.
when I create dataset, I add 'loss_mask'

loss_mask = ([0]*(len(src)+1) + [1]*len(trg) + [0] * pad_length)

Here is my result, the suffix "with_loss_mask" means only calculating loss of tokens in target sentence

terms["loss"] = terms["mse_with_loss_mask"] +terms["decoder_nll_with_loss_mask"]  + tT_loss_with_loss_mask 
        
--------------------------------------------
| decoder_nll                   | 7.04e-09 |
| decoder_nll_q0                | 1.75e-08 |
| decoder_nll_q1                | 1.36e-08 |
| decoder_nll_q2                | 1.16e-08 |
| decoder_nll_q3                | 2.7e-09  |
| decoder_nll_with_loss_mask    | 2.56e-08 |
| decoder_nll_with_loss_mask_q0 | 5.69e-08 |
| decoder_nll_with_loss_mask_q1 | 6.13e-08 |
| decoder_nll_with_loss_mask_q2 | 3.49e-08 |
| decoder_nll_with_loss_mask_q3 | 9.4e-09  |
| grad_norm                     | 0.0356   |
| loss                          | 0.00671  |
| loss_q0                       | 0.00704  |
| loss_q1                       | 0.00685  |
| loss_q2                       | 0.00674  |
| loss_q3                       | 0.00663  |
| mse                           | 1.5      |
| mse_q0                        | 3.58     |
| mse_q1                        | 2.92     |
| mse_q2                        | 2.24     |
| mse_q3                        | 0.699    |
| mse_with_loss_mask            | 0.00671  |
| mse_with_loss_mask_q0         | 0.00704  |
| mse_with_loss_mask_q1         | 0.00685  |
| mse_with_loss_mask_q2         | 0.00674  |
| mse_with_loss_mask_q3         | 0.00663  |
| nll                           | 51.2     |
| nll_q0                        | 115      |
| nll_q1                        | 95.9     |
| nll_q2                        | 77.8     |
| nll_q3                        | 25.1     |
| nll_with_loss_mask            | 1.11     |
| nll_with_loss_mask_q0         | 0.0114   |
| nll_with_loss_mask_q1         | 0.14     |
| nll_with_loss_mask_q2         | 0.608    |
| nll_with_loss_mask_q3         | 1.62     |
| samples                       | 9.8e+08  |
--------------------------------------------

Here is an example of generated texts, the model doesn't generate PAD, but it still can't generate expected text.
It seems that it is really hard for me to train the diffusion model sufficiently

{"recover": "[CLS] \u201d \u201d cap cap rather a safely \u201d / and you \u201d \u201d safely projections rather cap legitimate \u201d. \u201d \u201d projections \u201d up \u201d, cap i rather the time rather cap bother legitimate i \u201d rather i projections legitimate for legitimate investing safely safely face invalid rather legitimate legitimate legitimate a innovative safely cap 88 88 such bother projections through present working \u201d ended starting 5ven why the welcomed daily on \u201d un husky [ various bother welcomed projections scrap quo it legitimate besides \u201d requires boost legitimate legitimate alwayss legitimate legitimate'recommended", "reference": "[CLS] i'm a triple capricorn ( sun, moon and ascendant in capricorn ) what does this say about me? [SEP]", "source": "[CLS] astrology : i am a capricorn sun cap moon and cap rising... what does that say about me? [SEP] [SEP]"}

@xiaotingxuan
Copy link

when I use the original loss(without loss mask),I get the following result

-----------------------------
| decoder_nll    | 1.27e-05 |
| decoder_nll_q0 | 1.68e-05 |
| decoder_nll_q1 | 1.55e-05 |
| decoder_nll_q2 | 1.36e-05 |
| decoder_nll_q3 | 8.48e-06 |
| grad_norm      | 0.0651   |
| loss           | 0.0185   |
| loss_q0        | 0.0189   |
| loss_q1        | 0.0189   |
| loss_q2        | 0.0187   |
| loss_q3        | 0.0181   |
| mse            | 0.0185   |
| mse_q0         | 0.0189   |
| mse_q1         | 0.0188   |
| mse_q2         | 0.0187   |
| mse_q3         | 0.0181   |
| nll            | 0.206    |
| nll_q0         | 0.0191   |
| nll_q1         | 0.0648   |
| nll_q2         | 0.147    |
| nll_q3         | 0.412    |
| samples        | 8.18e+08 |

The loss doesn't become very small but the generated texts become much better

{"recover": "[CLS] what was your first sexual experience sexual like? [SEP]", "reference": "[CLS] what was your first sexual experience? [SEP]", "source": "[CLS] what was your first sexual experience like? [SEP] [SEP]"}
{"recover": "[CLS] what would trump win for presidency current s international with students an master or an on master f1 visa? [SEP]", "reference": "[CLS] how will a trump presidency affect the students presently in us or planning to study in us? [SEP]", "source": "[CLS] what would a trump presidency mean for current international master \u2019 s students on an f1 visa? [SEP] [SEP]"}
{"recover": "[CLS] what is manipulation manipulation on aren mean of look? [SEP]", "reference": "[CLS] what does manipulation means? [SEP]", "source": "[CLS] what does manipulation mean? [SEP] [SEP]"}
{"recover": "[CLS] why did so many questions on quora that be just can a answered on google google? [SEP]", "reference": "[CLS] why do people ask quora questions which can be answered easily by google? [SEP]", "source": "[CLS] why are so many quora users posting questions that are readily answered on google? [SEP] [SEP]"}

The only difference between the above two experiments (w/o loss mask) is training step.
with loss mask, I train 15000step
without loss mask, I train 25000 step

Maybe we just need to train more steps and set a proper lr

@bansky-cl
Copy link

when I use the original loss(without loss mask),I get the following result

-----------------------------
| decoder_nll    | 1.27e-05 |
| decoder_nll_q0 | 1.68e-05 |
| decoder_nll_q1 | 1.55e-05 |
| decoder_nll_q2 | 1.36e-05 |
| decoder_nll_q3 | 8.48e-06 |
| grad_norm      | 0.0651   |
| loss           | 0.0185   |
| loss_q0        | 0.0189   |
| loss_q1        | 0.0189   |
| loss_q2        | 0.0187   |
| loss_q3        | 0.0181   |
| mse            | 0.0185   |
| mse_q0         | 0.0189   |
| mse_q1         | 0.0188   |
| mse_q2         | 0.0187   |
| mse_q3         | 0.0181   |
| nll            | 0.206    |
| nll_q0         | 0.0191   |
| nll_q1         | 0.0648   |
| nll_q2         | 0.147    |
| nll_q3         | 0.412    |
| samples        | 8.18e+08 |

The loss doesn't become very small but the generated texts become much better

{"recover": "[CLS] what was your first sexual experience sexual like? [SEP]", "reference": "[CLS] what was your first sexual experience? [SEP]", "source": "[CLS] what was your first sexual experience like? [SEP] [SEP]"}
{"recover": "[CLS] what would trump win for presidency current s international with students an master or an on master f1 visa? [SEP]", "reference": "[CLS] how will a trump presidency affect the students presently in us or planning to study in us? [SEP]", "source": "[CLS] what would a trump presidency mean for current international master \u2019 s students on an f1 visa? [SEP] [SEP]"}
{"recover": "[CLS] what is manipulation manipulation on aren mean of look? [SEP]", "reference": "[CLS] what does manipulation means? [SEP]", "source": "[CLS] what does manipulation mean? [SEP] [SEP]"}
{"recover": "[CLS] why did so many questions on quora that be just can a answered on google google? [SEP]", "reference": "[CLS] why do people ask quora questions which can be answered easily by google? [SEP]", "source": "[CLS] why are so many quora users posting questions that are readily answered on google? [SEP] [SEP]"}

The only difference between the above two experiments (w/o loss mask) is training step. with loss mask, I train 15000step without loss mask, I train 25000 step

Maybe we just need to train more steps and set a proper lr

did you just only modify the trg's loss that during training in gaussian_diffusion.py ? have you modified the p_samle() where also need to use mask in the inference process

@xiaotingxuan
Copy link

xiaotingxuan commented Sep 18, 2023

when I use the original loss(without loss mask),I get the following result

-----------------------------
| decoder_nll    | 1.27e-05 |
| decoder_nll_q0 | 1.68e-05 |
| decoder_nll_q1 | 1.55e-05 |
| decoder_nll_q2 | 1.36e-05 |
| decoder_nll_q3 | 8.48e-06 |
| grad_norm      | 0.0651   |
| loss           | 0.0185   |
| loss_q0        | 0.0189   |
| loss_q1        | 0.0189   |
| loss_q2        | 0.0187   |
| loss_q3        | 0.0181   |
| mse            | 0.0185   |
| mse_q0         | 0.0189   |
| mse_q1         | 0.0188   |
| mse_q2         | 0.0187   |
| mse_q3         | 0.0181   |
| nll            | 0.206    |
| nll_q0         | 0.0191   |
| nll_q1         | 0.0648   |
| nll_q2         | 0.147    |
| nll_q3         | 0.412    |
| samples        | 8.18e+08 |

The loss doesn't become very small but the generated texts become much better

{"recover": "[CLS] what was your first sexual experience sexual like? [SEP]", "reference": "[CLS] what was your first sexual experience? [SEP]", "source": "[CLS] what was your first sexual experience like? [SEP] [SEP]"}
{"recover": "[CLS] what would trump win for presidency current s international with students an master or an on master f1 visa? [SEP]", "reference": "[CLS] how will a trump presidency affect the students presently in us or planning to study in us? [SEP]", "source": "[CLS] what would a trump presidency mean for current international master \u2019 s students on an f1 visa? [SEP] [SEP]"}
{"recover": "[CLS] what is manipulation manipulation on aren mean of look? [SEP]", "reference": "[CLS] what does manipulation means? [SEP]", "source": "[CLS] what does manipulation mean? [SEP] [SEP]"}
{"recover": "[CLS] why did so many questions on quora that be just can a answered on google google? [SEP]", "reference": "[CLS] why do people ask quora questions which can be answered easily by google? [SEP]", "source": "[CLS] why are so many quora users posting questions that are readily answered on google? [SEP] [SEP]"}

The only difference between the above two experiments (w/o loss mask) is training step. with loss mask, I train 15000step without loss mask, I train 25000 step
Maybe we just need to train more steps and set a proper lr

did you just only modify the trg's loss that during training in gaussian_diffusion.py ? have you modified the p_samle() where also need to use mask in the inference process

when I use the original loss(without loss mask), I did not modify any code
when I modify the trg's loss(with loss mask) , I add "loss mask" in dataset , so the new dataset has three elements {input_ids, mask, loss_mask} . p_samle() function will use 'mask'(I did not modify this function), 'loss mask ' is only used for calculating loss.

model trained with loss mask did not perform well,maybe I need to train more steps? Hope someone can give me some advice

@zkzhou126
Copy link

when I use the original loss(without loss mask),I get the following result

-----------------------------
| decoder_nll    | 1.27e-05 |
| decoder_nll_q0 | 1.68e-05 |
| decoder_nll_q1 | 1.55e-05 |
| decoder_nll_q2 | 1.36e-05 |
| decoder_nll_q3 | 8.48e-06 |
| grad_norm      | 0.0651   |
| loss           | 0.0185   |
| loss_q0        | 0.0189   |
| loss_q1        | 0.0189   |
| loss_q2        | 0.0187   |
| loss_q3        | 0.0181   |
| mse            | 0.0185   |
| mse_q0         | 0.0189   |
| mse_q1         | 0.0188   |
| mse_q2         | 0.0187   |
| mse_q3         | 0.0181   |
| nll            | 0.206    |
| nll_q0         | 0.0191   |
| nll_q1         | 0.0648   |
| nll_q2         | 0.147    |
| nll_q3         | 0.412    |
| samples        | 8.18e+08 |

The loss doesn't become very small but the generated texts become much better

{"recover": "[CLS] what was your first sexual experience sexual like? [SEP]", "reference": "[CLS] what was your first sexual experience? [SEP]", "source": "[CLS] what was your first sexual experience like? [SEP] [SEP]"}
{"recover": "[CLS] what would trump win for presidency current s international with students an master or an on master f1 visa? [SEP]", "reference": "[CLS] how will a trump presidency affect the students presently in us or planning to study in us? [SEP]", "source": "[CLS] what would a trump presidency mean for current international master \u2019 s students on an f1 visa? [SEP] [SEP]"}
{"recover": "[CLS] what is manipulation manipulation on aren mean of look? [SEP]", "reference": "[CLS] what does manipulation means? [SEP]", "source": "[CLS] what does manipulation mean? [SEP] [SEP]"}
{"recover": "[CLS] why did so many questions on quora that be just can a answered on google google? [SEP]", "reference": "[CLS] why do people ask quora questions which can be answered easily by google? [SEP]", "source": "[CLS] why are so many quora users posting questions that are readily answered on google? [SEP] [SEP]"}

The only difference between the above two experiments (w/o loss mask) is training step. with loss mask, I train 15000step without loss mask, I train 25000 step
Maybe we just need to train more steps and set a proper lr

did you just only modify the trg's loss that during training in gaussian_diffusion.py ? have you modified the p_samle() where also need to use mask in the inference process

Did you modify the p_sample() at the end? I find if we change the seq_len, too many pads can seriously affect the effect.
I don't know if just modify the trg's loss that during training in gaussian_diffusion.py is available.
If you solve this problem, hope you can let me know, thanks

@zkzhou126
Copy link

I am running the experiments of QQP and I have changed the computation of loss in the training code. when I create dataset, I add 'loss_mask'

loss_mask = ([0]*(len(src)+1) + [1]*len(trg) + [0] * pad_length)

Here is my result, the suffix "with_loss_mask" means only calculating loss of tokens in target sentence

terms["loss"] = terms["mse_with_loss_mask"] +terms["decoder_nll_with_loss_mask"]  + tT_loss_with_loss_mask 
        
--------------------------------------------
| decoder_nll                   | 7.04e-09 |
| decoder_nll_q0                | 1.75e-08 |
| decoder_nll_q1                | 1.36e-08 |
| decoder_nll_q2                | 1.16e-08 |
| decoder_nll_q3                | 2.7e-09  |
| decoder_nll_with_loss_mask    | 2.56e-08 |
| decoder_nll_with_loss_mask_q0 | 5.69e-08 |
| decoder_nll_with_loss_mask_q1 | 6.13e-08 |
| decoder_nll_with_loss_mask_q2 | 3.49e-08 |
| decoder_nll_with_loss_mask_q3 | 9.4e-09  |
| grad_norm                     | 0.0356   |
| loss                          | 0.00671  |
| loss_q0                       | 0.00704  |
| loss_q1                       | 0.00685  |
| loss_q2                       | 0.00674  |
| loss_q3                       | 0.00663  |
| mse                           | 1.5      |
| mse_q0                        | 3.58     |
| mse_q1                        | 2.92     |
| mse_q2                        | 2.24     |
| mse_q3                        | 0.699    |
| mse_with_loss_mask            | 0.00671  |
| mse_with_loss_mask_q0         | 0.00704  |
| mse_with_loss_mask_q1         | 0.00685  |
| mse_with_loss_mask_q2         | 0.00674  |
| mse_with_loss_mask_q3         | 0.00663  |
| nll                           | 51.2     |
| nll_q0                        | 115      |
| nll_q1                        | 95.9     |
| nll_q2                        | 77.8     |
| nll_q3                        | 25.1     |
| nll_with_loss_mask            | 1.11     |
| nll_with_loss_mask_q0         | 0.0114   |
| nll_with_loss_mask_q1         | 0.14     |
| nll_with_loss_mask_q2         | 0.608    |
| nll_with_loss_mask_q3         | 1.62     |
| samples                       | 9.8e+08  |
--------------------------------------------

Here is an example of generated texts, the model doesn't generate PAD, but it still can't generate expected text. It seems that it is really hard for me to train the diffusion model sufficiently

{"recover": "[CLS] \u201d \u201d cap cap rather a safely \u201d / and you \u201d \u201d safely projections rather cap legitimate \u201d. \u201d \u201d projections \u201d up \u201d, cap i rather the time rather cap bother legitimate i \u201d rather i projections legitimate for legitimate investing safely safely face invalid rather legitimate legitimate legitimate a innovative safely cap 88 88 such bother projections through present working \u201d ended starting 5ven why the welcomed daily on \u201d un husky [ various bother welcomed projections scrap quo it legitimate besides \u201d requires boost legitimate legitimate alwayss legitimate legitimate'recommended", "reference": "[CLS] i'm a triple capricorn ( sun, moon and ascendant in capricorn ) what does this say about me? [SEP]", "source": "[CLS] astrology : i am a capricorn sun cap moon and cap rising... what does that say about me? [SEP] [SEP]"}

Hello!Could you please show me your modified 'training_losses_seq2seq'?

@swave-demo
Copy link

It seems that DiffuSeq calculates its loss of both x and y part: #25 (comment). This is contradictory to the paper, but after training, meaningful texts are generated.
Maybe with or without mask is not so important to the performance of DiffuSeq?

@summmeer
Copy link
Collaborator

summmeer commented Jun 3, 2024

@swave-demo Hi, this is a good point. Let me explain this. The input mask takes two roles: a. keep x input part un-noised; b. mask out the mse loss of x part. In this repo, we implement a but not b. In our following work DoT, which finetunes the current diffusion LMs in DiffuSeq-style, we implement both a and b. So it is suggested to mask out the mse of x part. You can also try it in DiffuSeq (train seq2seq data from scratch). Then why the current version of DiffuSeq still works? That's because we still mask the input of x and keep it un-noised, so you can imagine that for the x part, the model only learns to repeat the input text, which is easy to learn and is quite different from the y part's learning, where the model needs to recover the noised text to the clean text. In the end, the mse loss of x part does not contribute much to the denoiser model's training, if we have to claim its contribution, I believe it takes effects on the word embedding update, at least this model is train from scratch instead of finetuned from existing LMs.

@swave-demo
Copy link

@swave-demo Hi, this is a good point. Let me explain this. The input mask takes two roles: a. keep x input part un-noised; b. mask out the mse loss of x part. In this repo, we implement a but not b. In our following work DoT, which finetunes the current diffusion LMs in DiffuSeq-style, we implement both a and b. So it is suggested to mask out the mse of x part. You can also try it in DiffuSeq (train seq2seq data from scratch). Then why the current version of DiffuSeq still works? That's because we still mask the input of x and keep it un-noised, so you can imagine that for the x part, the model only learns to repeat the input text, which is easy to learn and is quite different from the y part's learning, where the model needs to recover the noised text to the clean text. In the end, the mse loss of x part does not contribute much to the denoiser model's training, if we have to claim its contribution, I believe it takes effects on the word embedding update, at least this model is train from scratch instead of finetuned from existing LMs.

Thanks, your explanation really helps me understand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants