-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revise VideoCLIP tutorial for pytorch docs #242
Conversation
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Here we instantiate the VideoCLIP model and load weights from finetuning on the MUGEN dataset. We can set `text_pretrained=False, video_pretrained=False` as those flags will load weights from pretraining the encoders on different datasets (see [source](https://github.com/facebookresearch/multimodal/blob/main/examples/mugen/retrieval/video_clip.py) for more details)." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean "load pre-trained weights and finetune on the MUGEN dataset"?
The weights are not the result of finetuning the MUGEN dataset, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the weights are the result of finetuning on the mugen dataset. there is no additional training/finetuning in this notebook
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean to load the pretrained text encoder and video encoder?
Should we set text_pretrained=True
and the same for video?
multimodal/examples/mugen/retrieval/video_clip.py
Lines 151 to 154 in 1ca663b
text_pretrained (bool): whether to use a pretrained text encoder or not. | |
Defaults to ``True``. | |
text_trainable (bool): whether the text encoder's weights should be trainable. | |
Defaults to ``True``. Ignored if ``text_pretrained`` is ``False``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Here we instantiate the VideoCLIP model and load weights from finetuning on the MUGEN dataset. We can set
text_pretrained=False, video_pretrained=False
as those flags will load weights from pretraining the encoders on different datasets
Do you mean MUGEN has fine tuned the weights and here we are just loading their version of the weights?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by default, those flags load the weights from pretraining the text encoder on Wikipedia and pretraining the video encoder on Kinetics400. typically, those flags would be set as True when we want to finetune VideoCLIP on a new dataset, such as the MUGEN dataset. for example, those flags are used in the train.py
script.
In the case of this notebook, I want to display the model's predictions on the MUGEN dataset. so I manually load in a different set of weights, which are on AWS and have been finetuned on the MUGEN dataset already by the MUGEN authors.
The text_pretrained, video_pretrained
flags are set as False
here because they are incapable of loading one large weights file (e.g., the mugen-finetuned weights) that includes the text encoder, video encoder, and both projection modules. They can only support loading one weights file for the text encoder and one weights file to the video encoder.
I will edit the explanation in the notebook to add that those flags are used for finetuning, not for evaluation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I manually load in a different set of weights, which are on AWS and have been finetuned on the MUGEN dataset already by the MUGEN authors.
Are your manual weights cropped from the original weights file? If the user only has the original (presumably large) file, will they be able to repro what you have here? Or do they need to crop the file themselves?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename this "tutorial.ipynb" as "evaluation" may be confused with the actual eval loop.
Codecov Report
@@ Coverage Diff @@
## main #242 +/- ##
==========================================
- Coverage 92.91% 92.19% -0.72%
==========================================
Files 47 53 +6
Lines 2809 3191 +382
==========================================
+ Hits 2610 2942 +332
- Misses 199 249 +50
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
@sophiazhi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@sophiazhi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary:
Revise VideoCLIP tutorial notebook to better adhere to guidelines for official pytorch tutorials (internal wiki).
Test plan:
Run notebook
Note that vscode is incapable of rendering video/audio in notebooks (see issue) and github doesn't display the videos in output cells. Embedded videos can be played in google colab (link to this tutorial in colab) or in jupyter lab.
To install jupyterlab, convert your
torch-multimodal
conda env into a kernel, and launch jupyter lab: