Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any plan to evaluation code? #9

Open
ILOFI opened this issue Jun 14, 2024 · 15 comments
Open

Any plan to evaluation code? #9

ILOFI opened this issue Jun 14, 2024 · 15 comments

Comments

@ILOFI
Copy link

ILOFI commented Jun 14, 2024

Thank you very much for your exciting work. Do you have any plan to release evaluation code corresponding to Table 2?

@Little-Podi
Copy link
Collaborator

Little-Podi commented Jun 14, 2024

No problem. I will clean and share the evaluation code later (probably after the CVPR conference week).

Our results are obtained from the whole nuScenes validation set, which includes 5369 samples in total. The evaluation code for FID uses FrechetInceptionDistance module of torchmetrics, and the evaluation code for FVD is modified from LVDM. Hope these are helpful if you want to evaluate on your own before our release.

@ILOFI
Copy link
Author

ILOFI commented Jun 14, 2024

Thank you very much for your helpful reply, looking forward to your further updates.

@ABaldrati
Copy link

Hi @Little-Podi,

First of all, thank you for your excellent work and insightful paper!

I'm attempting to replicate the results presented in Table 2 of your paper. Could you please clarify if you employed any form of action control when generating the videos, or if you used just the pre-trained video model without any action control?

Additionally, could you share any tips or common pitfalls to watch out for that might affect the replication of the results in this table?

@Little-Podi
Copy link
Collaborator

Hi @ABaldrati, thanks for your interest.

Could you please clarify if you employed any form of action control when generating the videos, or if you used just the pre-trained video model without any action control?

The results in Table 2 are evaluated in action-free mode.

Additionally, could you share any tips or common pitfalls to watch out for that might affect the replication of the results in this table?

There is nothing special about our evaluation. Just make sure you evaluate on all samples from the nuScenes validation set, which may take days with a single GPU.

@ABaldrati
Copy link

Hi @Little-Podi,

Thank you immensely for your availability and prompt response!

Before I start running the inference (as it takes quite some time), I just want to ensure that all the hyperparameters are set correctly. Specifically:

  • --n_rounds = 1
  • --n_frames = 25
  • --n_conds = 3? The default value in the sample.py script is 1, but I believe you used 3, correct? Please correct me if I'm wrong.
  • --cfg_scale = 2.5

I apologize for any inconvenience, but I want to be absolutely certain everything is set correctly.

Thank you again for your help!

@Little-Podi
Copy link
Collaborator

Little-Podi commented Jun 14, 2024

Exactly, but I think I used --n_conds = 1 for evaluation. Using --n_conds = 3 may lead to similar or better results, just like the effect of using ground truth action controls. You can also disable --rand_gen (by calling this argument) to automatically go through all validation samples. Besides, remember to take all predicted frames into account during the evaluation.

@ABaldrati
Copy link

Thanks so much for your quick response and for your availability. I really appreciate you making the code open-source and releasing the weights.

Thanks again for your help!

@Little-Podi
Copy link
Collaborator

No worries, feel free to contact us if you have any further questions.

@ABaldrati
Copy link

Hi @Little-Podi,

First of all, thank you very much for your support.

I've successfully generated all the videos for the nuScenes validation set and can replicate the FID numbers reported in Table 2, achieving even slightly lower numbers. However, I'm having difficulty replicating the FVD numbers. Could you please provide more details on the specific parameters you used for computing the FVD? For instance, the resolution, number of frames, resizing strategy, and any other relevant details would be extremely helpful.

Thank you again for your help!

@Little-Podi
Copy link
Collaborator

Hi @ABaldrati, thanks for your feedback. Sorry for the late reply. I have returned from CVPR, but I still have lots of things to deal with in the following days.

Could you please provide more details on the specific parameters you used for computing the FVD?

All 25 frames in each clip are used for calculating FVD. I just checked our evaluation script. The frame resolution is resized to (256, 448) when loading the generated images, and is eventually resized to (224, 224) before sending to the I3D model. I don't remember why we conduct resizing twice, but I will check it.

I'm having difficulty replicating the FVD numbers.

May I ask what are the FID and FVD scores you got? In fact, we continue to tune the checkpoint for a few iterations under the setting of phase2_stage2 before its release. I didn't retest it in terms of metrics, but I think it should be close. I will verify later to decide if it is necessary to provide the older checkpoint. Based on the few samples I have seen, I think the current checkpoint is better from the perceptual perspective.

@ABaldrati
Copy link

Hi @Little-Podi,

Thank you for your response!

Could you please clarify if the resizing from (256, 448) to (224, 224) is done using non-proportional resizing or a center crop?

For reference, I obtained a FID score of 6.7, which is very close to the 6.9 reported in the paper, indicating that our results are comparable. However, my FVD score is 139, which is significantly different, leading me to believe there might be an issue with my evaluation script.

Thank again!

@Little-Podi
Copy link
Collaborator

Little-Podi commented Jun 26, 2024

Could you please clarify if the resizing from (256, 448) to (224, 224) is done using non-proportional resizing or a center crop?

Oh, now I know why we conduct the resizing separately. We did center cropping before resizing from (576, 1024) to (256, 448) via the Pillow package. We didn't use cropping when resizing from (256, 448) to (224, 224) via F.interpolate. Did you evaluate on all 5369 video clips? The FVD score seems to be too high. I will retest the checkpoint and also provide the cleaned evaluation code later.

@ABaldrati
Copy link

Hi @Little-Podi,

Thank you for the clarification.

I’m a bit confused about the center cropping before resizing from (576, 1024) to (256, 448) since the proportions seem to be maintained in this step. Could you please provide a brief step-by-step description of each resizing step?

Yes, I have evaluated on all 5369 video clips.

Thank you for your assistance!

@Little-Podi
Copy link
Collaborator

Little-Podi commented Jun 26, 2024

I’m a bit confused about the center cropping before resizing from (576, 1024) to (256, 448) since the proportions seem to be maintained in this step. Could you please provide a brief step-by-step description of each resizing step?

The aspect ratios are almost the same, but some pixels will leak without center cropping. The implementation is identical to our data preprocessing here. For the latter resizing step, it is like:

output_frames = F.interpolate(input_frames, size=(224, 224), mode="bilinear", align_corners=False)

@ABaldrati
Copy link

Hi @Little-Podi,

Thank you for your availability and the detailed information.

Despite following the provided details, I still can't replicate the FVD results. I'll wait for the release of the evaluation code.

Thanks again, and great work on the project!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants