Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise VideoCLIP tutorial for pytorch docs #242

Closed
wants to merge 3 commits into from

Conversation

sophiazhi
Copy link
Contributor

Summary:
Revise VideoCLIP tutorial notebook to better adhere to guidelines for official pytorch tutorials (internal wiki).

Test plan:
Run notebook

Note that vscode is incapable of rendering video/audio in notebooks (see issue) and github doesn't display the videos in output cells. Embedded videos can be played in google colab (link to this tutorial in colab) or in jupyter lab.

To install jupyterlab, convert your torch-multimodal conda env into a kernel, and launch jupyter lab:

(base) conda install -c conda-forge jupyterlab
(base) conda activate torch-multimodal
(torch-multimodal) conda install ipykernel
(torch-multimodal) ipython kernel install --user --name=torch-multimodal
(torch-multimodal) conda deactivate
(base) jupyter lab

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 3, 2022
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we instantiate the VideoCLIP model and load weights from finetuning on the MUGEN dataset. We can set `text_pretrained=False, video_pretrained=False` as those flags will load weights from pretraining the encoders on different datasets (see [source](https://github.com/facebookresearch/multimodal/blob/main/examples/mugen/retrieval/video_clip.py) for more details)."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean "load pre-trained weights and finetune on the MUGEN dataset"?
The weights are not the result of finetuning the MUGEN dataset, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the weights are the result of finetuning on the mugen dataset. there is no additional training/finetuning in this notebook

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean to load the pretrained text encoder and video encoder?
Should we set text_pretrained=True and the same for video?

text_pretrained (bool): whether to use a pretrained text encoder or not.
Defaults to ``True``.
text_trainable (bool): whether the text encoder's weights should be trainable.
Defaults to ``True``. Ignored if ``text_pretrained`` is ``False``.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Here we instantiate the VideoCLIP model and load weights from finetuning on the MUGEN dataset. We can set text_pretrained=False, video_pretrained=False as those flags will load weights from pretraining the encoders on different datasets

Do you mean MUGEN has fine tuned the weights and here we are just loading their version of the weights?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by default, those flags load the weights from pretraining the text encoder on Wikipedia and pretraining the video encoder on Kinetics400. typically, those flags would be set as True when we want to finetune VideoCLIP on a new dataset, such as the MUGEN dataset. for example, those flags are used in the train.py script.

In the case of this notebook, I want to display the model's predictions on the MUGEN dataset. so I manually load in a different set of weights, which are on AWS and have been finetuned on the MUGEN dataset already by the MUGEN authors.
The text_pretrained, video_pretrained flags are set as False here because they are incapable of loading one large weights file (e.g., the mugen-finetuned weights) that includes the text encoder, video encoder, and both projection modules. They can only support loading one weights file for the text encoder and one weights file to the video encoder.

I will edit the explanation in the notebook to add that those flags are used for finetuning, not for evaluation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I manually load in a different set of weights, which are on AWS and have been finetuned on the MUGEN dataset already by the MUGEN authors.

Are your manual weights cropped from the original weights file? If the user only has the original (presumably large) file, will they be able to repro what you have here? Or do they need to crop the file themselves?

Copy link
Contributor

@langong347 langong347 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename this "tutorial.ipynb" as "evaluation" may be confused with the actual eval loop.

@codecov-commenter
Copy link

codecov-commenter commented Aug 17, 2022

Codecov Report

Merging #242 (39e4738) into main (5457c30) will decrease coverage by 0.71%.
The diff coverage is 96.00%.

@@            Coverage Diff             @@
##             main     #242      +/-   ##
==========================================
- Coverage   92.91%   92.19%   -0.72%     
==========================================
  Files          47       53       +6     
  Lines        2809     3191     +382     
==========================================
+ Hits         2610     2942     +332     
- Misses        199      249      +50     
Impacted Files Coverage Δ
torchmultimodal/utils/attention.py 86.66% <ø> (ø)
torchmultimodal/models/gpt.py 97.76% <95.78%> (ø)
torchmultimodal/modules/layers/attention.py 97.08% <100.00%> (ø)
torchmultimodal/utils/common.py 91.42% <100.00%> (+0.25%) ⬆️
torchmultimodal/modules/losses/flava.py 94.24% <0.00%> (-1.58%) ⬇️
torchmultimodal/models/video_vqvae.py 96.87% <0.00%> (-0.63%) ⬇️
torchmultimodal/models/vqvae.py 100.00% <0.00%> (ø)
torchmultimodal/models/mdetr.py
...hmultimodal/modules/encoders/albef_text_encoder.py
torchmultimodal/models/clip.py
... and 31 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@sophiazhi sophiazhi marked this pull request as ready for review August 17, 2022 19:54
@facebook-github-bot
Copy link
Contributor

@sophiazhi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@sophiazhi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@ankitade ankitade deleted the szhi-videoclip_eval_notebook branch December 7, 2022 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants