Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cannot reproduce comparable results in the paper #5

Open
felix-yuxiang opened this issue Jul 18, 2023 · 16 comments
Open

cannot reproduce comparable results in the paper #5

felix-yuxiang opened this issue Jul 18, 2023 · 16 comments

Comments

@felix-yuxiang
Copy link

felix-yuxiang commented Jul 18, 2023

Hi, I ran the code with a single GPU NVIDIA GeForce RTX 3090 with the given config file listed in the paper. Here is my reproduced result which is significantly different from the results provided in README.md file. Can you guide me through and specify what could be the issue? Can you provide more info on how to train a model with the same performance of your pre-trained model you provided in /checkpoints. Any help will be appreciated.

image

@Pranav-chib
Copy link

I also have the similar result while reproducing the training.
Capture

@fangzl123
Copy link

I think it's due to the hyperparameters setting. In the paper it's mentioned "With a frozen denoising module, we then train the leapfrog initializer for 200 epochs with an initial learning rate of 10−4, decaying by 0.9 every 32 epochs", but in the default led_augment.yml it is not like this.

Set them based on the paper and I've got
image

@Frank-Star-fn
Copy link

I think it's due to the hyperparameters setting. In the paper it's mentioned "With a frozen denoising module, we then train the leapfrog initializer for 200 epochs with an initial learning rate of 10−4, decaying by 0.9 every 32 epochs", but in the default led_augment.yml it is not like this.

Set them based on the paper and I've got image

Hello, did you use the pre trained model provided by him for the diffusion model in the first stage, or did you train yourself for one stage according to the settings in the paper?

@fangzl123
Copy link

I think it's due to the hyperparameters setting. In the paper it's mentioned "With a frozen denoising module, we then train the leapfrog initializer for 200 epochs with an initial learning rate of 10−4, decaying by 0.9 every 32 epochs", but in the default led_augment.yml it is not like this.
Set them based on the paper and I've got image

Hello, did you use the pre trained model provided by him for the diffusion model in the first stage, or did you train yourself for one stage according to the settings in the paper?

Hi, I use the provided pre-trained model as the first stage.

@fangzl123
Copy link

fangzl123 commented Nov 28, 2023 via email

@Pranav-chib
Copy link

I think 0.83/1.69 was the only reproduced result

@felix-yuxiang
Copy link
Author

Now, I am able to reproduce their stageone and LED stagetwo results. The answer from @woyoudian2gou helped me a lot. But I would say it requires an non-trivial amount of engineering work to tune this well.

@Pranav-chib
Copy link

Could you share with us some insight, it would me helpful.

@kkk00714
Copy link

Now, I am able to reproduce their stageone and LED stagetwo results. The answer from @woyoudian2gou helped me a lot. But I would say it requires an non-trivial amount of engineering work to tune this well.

Yes, and the whole implementation is difficult to explain, I think the original
author may have used a different method to get the pre-trained model.

@ZY-Ren
Copy link

ZY-Ren commented May 15, 2024

@woyoudian2gou Hi, I have implemented your mentioned hyperparameters setting, but still can't get a reasonable result. So could you share your config.yml with us? Thank you very much.

@kkk00714
Copy link

@woyoudian2gou Hi, I have implemented your mentioned hyperparameters setting, but still can't get a reasonable result. So could you share your config.yml with us? Thank you very much.

See https://github.com/MediaBrain-SJTU/LED/issues/6

@ZY-Ren
Copy link

ZY-Ren commented May 16, 2024

@kkk00714 Thank you for your prompt reply, I would also like to know the hyperparameters of Phase 2 training, could you share that? I would appreciate it

@kkk00714
Copy link

@kkk00714 Thank you for your prompt reply, I would also like to know the hyperparameters of Phase 2 training, could you share that? I would appreciate it

The hyperparameters of stage 2 are same as original implement of author (batchsiaze = 10, lr = 1e-4...).

@Nighttell
Copy link

@kkk00714
Hello, I have reproduced the first and second stages of training according to your method. Although the final results are similar to those given in MD, the loss is
[2024-08-05 04:31:00] Epoch: 99 Loss: 5.839539 Loss Dist.: 5.320199 Loss Uncertainty: 0.519340
I think its LOS is too high, or is this normal?
The results of running the test set in the 100th and 75th epochs are similar
[Initialization Model] Trainable/Total: 4634721/4634721
./results/led_augment/try2/models/model_0100.p
--ADE(1s): 0.1775 --FDE(1s): 0.2671
--ADE(2s): 0.3698 --FDE(2s): 0.5612
--ADE(3s): 0.5799 --FDE(3s): 0.8323
--ADE(4s): 0.8115 --FDE(4s): 1.1440
[Core Denoising Model] Trainable/Total: 6568720/6568720
[Initialization Model] Trainable/Total: 4634721/4634721
./results/led_augment/try2/models/model_0075.p
--ADE(1s): 0.1774 --FDE(1s): 0.2671
--ADE(2s): 0.3698 --FDE(2s): 0.5612
--ADE(3s): 0.5799 --FDE(3s): 0.8323
--ADE(4s): 0.8115 --FDE(4s): 1.1437
Is it due to the influence of batch_2? I used a 4070TI 12G for training. The batch size is set to 6 (I have tried training with the default settings and also tried training with the parameters given in the paper, but the results are not significantly different)
Look forward to your reply! thank

@kkk00714
Copy link

kkk00714 commented Aug 7, 2024

@kkk00714 Hello, I have reproduced the first and second stages of training according to your method. Although the final results are similar to those given in MD, the loss is [2024-08-05 04:31:00] Epoch: 99 Loss: 5.839539 Loss Dist.: 5.320199 Loss Uncertainty: 0.519340 I think its LOS is too high, or is this normal? The results of running the test set in the 100th and 75th epochs are similar [Initialization Model] Trainable/Total: 4634721/4634721 ./results/led_augment/try2/models/model_0100.p --ADE(1s): 0.1775 --FDE(1s): 0.2671 --ADE(2s): 0.3698 --FDE(2s): 0.5612 --ADE(3s): 0.5799 --FDE(3s): 0.8323 --ADE(4s): 0.8115 --FDE(4s): 1.1440 [Core Denoising Model] Trainable/Total: 6568720/6568720 [Initialization Model] Trainable/Total: 4634721/4634721 ./results/led_augment/try2/models/model_0075.p --ADE(1s): 0.1774 --FDE(1s): 0.2671 --ADE(2s): 0.3698 --FDE(2s): 0.5612 --ADE(3s): 0.5799 --FDE(3s): 0.8323 --ADE(4s): 0.8115 --FDE(4s): 1.1437 Is it due to the influence of batch_2? I used a 4070TI 12G for training. The batch size is set to 6 (I have tried training with the default settings and also tried training with the parameters given in the paper, but the results are not significantly different) Look forward to your reply! thank

Your loss is normal because the author multiplies a coefficient in front of loss_dist to give it a greater weight than loss_uncertainty. You can check it in the paper or code.

@Nighttell
Copy link

@kkk00714 Hello, I have reproduced the first and second stages of training according to your method. Although the final results are similar to those given in MD, the loss is [2024-08-05 04:31:00] Epoch: 99 Loss: 5.839539 Loss Dist.: 5.320199 Loss Uncertainty: 0.519340 I think its LOS is too high, or is this normal? The results of running the test set in the 100th and 75th epochs are similar [Initialization Model] Trainable/Total: 4634721/4634721 ./results/led_augment/try2/models/model_0100.p --ADE(1s): 0.1775 --FDE(1s): 0.2671 --ADE(2s): 0.3698 --FDE(2s): 0.5612 --ADE(3s): 0.5799 --FDE(3s): 0.8323 --ADE(4s): 0.8115 --FDE(4s): 1.1440 [Core Denoising Model] Trainable/Total: 6568720/6568720 [Initialization Model] Trainable/Total: 4634721/4634721 ./results/led_augment/try2/models/model_0075.p --ADE(1s): 0.1774 --FDE(1s): 0.2671 --ADE(2s): 0.3698 --FDE(2s): 0.5612 --ADE(3s): 0.5799 --FDE(3s): 0.8323 --ADE(4s): 0.8115 --FDE(4s): 1.1437 Is it due to the influence of batch_2? I used a 4070TI 12G for training. The batch size is set to 6 (I have tried training with the default settings and also tried training with the parameters given in the paper, but the results are not significantly different) Look forward to your reply! thank

Your loss is normal because the author multiplies a coefficient in front of loss_dist to give it a greater weight than loss_uncertainty. You can check it in the paper or code.

Thank you for your answer. This is the first time I have encountered such a loss handling method

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants