Revise VideoCLIP tutorial for pytorch docs #242

sophiazhi · 2022-08-03T20:18:24Z

Summary:
Revise VideoCLIP tutorial notebook to better adhere to guidelines for official pytorch tutorials (internal wiki).

Test plan:
Run notebook

Note that vscode is incapable of rendering video/audio in notebooks (see issue) and github doesn't display the videos in output cells. Embedded videos can be played in google colab (link to this tutorial in colab) or in jupyter lab.

To install jupyterlab, convert your torch-multimodal conda env into a kernel, and launch jupyter lab:

(base) conda install -c conda-forge jupyterlab
(base) conda activate torch-multimodal
(torch-multimodal) conda install ipykernel
(torch-multimodal) ipython kernel install --user --name=torch-multimodal
(torch-multimodal) conda deactivate
(base) jupyter lab

langong347 · 2022-08-17T19:16:51Z

examples/mugen/retrieval/evaluation.ipynb

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Here we instantiate the VideoCLIP model and load weights from finetuning on the MUGEN dataset. We can set `text_pretrained=False, video_pretrained=False` as those flags will load weights from pretraining the encoders on different datasets (see [source](https://github.com/facebookresearch/multimodal/blob/main/examples/mugen/retrieval/video_clip.py) for more details)."


You mean "load pre-trained weights and finetune on the MUGEN dataset"?
The weights are not the result of finetuning the MUGEN dataset, right?

the weights are the result of finetuning on the mugen dataset. there is no additional training/finetuning in this notebook

Do you mean to load the pretrained text encoder and video encoder?
Should we set text_pretrained=True and the same for video?

multimodal/examples/mugen/retrieval/video_clip.py

Lines 151 to 154 in 1ca663b

text_pretrained (bool): whether to use a pretrained text encoder or not.

Defaults to ``True``.

text_trainable (bool): whether the text encoder's weights should be trainable.

Defaults to ``True``. Ignored if ``text_pretrained`` is ``False``.

"Here we instantiate the VideoCLIP model and load weights from finetuning on the MUGEN dataset. We can set text_pretrained=False, video_pretrained=False as those flags will load weights from pretraining the encoders on different datasets

Do you mean MUGEN has fine tuned the weights and here we are just loading their version of the weights?

by default, those flags load the weights from pretraining the text encoder on Wikipedia and pretraining the video encoder on Kinetics400. typically, those flags would be set as True when we want to finetune VideoCLIP on a new dataset, such as the MUGEN dataset. for example, those flags are used in the train.py script.

In the case of this notebook, I want to display the model's predictions on the MUGEN dataset. so I manually load in a different set of weights, which are on AWS and have been finetuned on the MUGEN dataset already by the MUGEN authors.
The text_pretrained, video_pretrained flags are set as False here because they are incapable of loading one large weights file (e.g., the mugen-finetuned weights) that includes the text encoder, video encoder, and both projection modules. They can only support loading one weights file for the text encoder and one weights file to the video encoder.

I will edit the explanation in the notebook to add that those flags are used for finetuning, not for evaluation.

so I manually load in a different set of weights, which are on AWS and have been finetuned on the MUGEN dataset already by the MUGEN authors.

Are your manual weights cropped from the original weights file? If the user only has the original (presumably large) file, will they be able to repro what you have here? Or do they need to crop the file themselves?

langong347

Rename this "tutorial.ipynb" as "evaluation" may be confused with the actual eval loop.

codecov-commenter · 2022-08-17T19:52:07Z

Codecov Report

Merging #242 (39e4738) into main (5457c30) will decrease coverage by 0.71%.
The diff coverage is 96.00%.

@@            Coverage Diff             @@
##             main     #242      +/-   ##
==========================================
- Coverage   92.91%   92.19%   -0.72%     
==========================================
  Files          47       53       +6     
  Lines        2809     3191     +382     
==========================================
+ Hits         2610     2942     +332     
- Misses        199      249      +50

Impacted Files	Coverage Δ
torchmultimodal/utils/attention.py	`86.66% <ø> (ø)`
torchmultimodal/models/gpt.py	`97.76% <95.78%> (ø)`
torchmultimodal/modules/layers/attention.py	`97.08% <100.00%> (ø)`
torchmultimodal/utils/common.py	`91.42% <100.00%> (+0.25%)`	⬆️
torchmultimodal/modules/losses/flava.py	`94.24% <0.00%> (-1.58%)`	⬇️
torchmultimodal/models/video_vqvae.py	`96.87% <0.00%> (-0.63%)`	⬇️
torchmultimodal/models/vqvae.py	`100.00% <0.00%> (ø)`
torchmultimodal/models/mdetr.py
...hmultimodal/modules/encoders/albef_text_encoder.py
torchmultimodal/models/clip.py
... and 31 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

facebook-github-bot · 2022-08-17T19:54:54Z

@sophiazhi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-08-18T15:39:45Z

@sophiazhi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

revise videoclip tutorial for pytorch docs

b099dee

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 3, 2022

sophiazhi requested review from langong347 and ankitade August 9, 2022 19:23

langong347 approved these changes Aug 17, 2022

View reviewed changes

langong347 reviewed Aug 17, 2022

View reviewed changes

Rename tutoral notebook

f61188f

sophiazhi marked this pull request as ready for review August 17, 2022 19:54

clarify text_pretrained, video_pretrained flags

39e4738

facebook-github-bot closed this in f9bdc8e Aug 18, 2022

ankitade deleted the szhi-videoclip_eval_notebook branch December 7, 2022 18:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revise VideoCLIP tutorial for pytorch docs #242

Revise VideoCLIP tutorial for pytorch docs #242

sophiazhi commented Aug 3, 2022

langong347 Aug 17, 2022

sophiazhi Aug 17, 2022

langong347 Aug 17, 2022

langong347 Aug 17, 2022

sophiazhi Aug 17, 2022

langong347 Aug 17, 2022

langong347 left a comment

codecov-commenter commented Aug 17, 2022 •

edited

Loading

facebook-github-bot commented Aug 17, 2022

facebook-github-bot commented Aug 18, 2022

	text_pretrained (bool): whether to use a pretrained text encoder or not.
	Defaults to ``True``.
	text_trainable (bool): whether the text encoder's weights should be trainable.
	Defaults to ``True``. Ignored if ``text_pretrained`` is ``False``.

Revise VideoCLIP tutorial for pytorch docs #242

Revise VideoCLIP tutorial for pytorch docs #242

Conversation

sophiazhi commented Aug 3, 2022

langong347 Aug 17, 2022

Choose a reason for hiding this comment

sophiazhi Aug 17, 2022

Choose a reason for hiding this comment

langong347 Aug 17, 2022

Choose a reason for hiding this comment

langong347 Aug 17, 2022

Choose a reason for hiding this comment

sophiazhi Aug 17, 2022

Choose a reason for hiding this comment

langong347 Aug 17, 2022

Choose a reason for hiding this comment

langong347 left a comment

Choose a reason for hiding this comment

codecov-commenter commented Aug 17, 2022 • edited Loading

Codecov Report

facebook-github-bot commented Aug 17, 2022

facebook-github-bot commented Aug 18, 2022

codecov-commenter commented Aug 17, 2022 •

edited

Loading