Skip to content

Latest commit

 

History

History
34 lines (22 loc) · 2.57 KB

DATASET.md

File metadata and controls

34 lines (22 loc) · 2.57 KB

Dataset

We understand video data are challenging to download and process. For videos, we provide our preprocessing scripts under scripts/video_feature_extractor (deeply adapted from https://github.com/antoine77340/video_feature_extractor); for text, we pre-tokenizing scripts under scripts/text_token_extractor.

S3D Feature Extraction

We use pre-trained S3D for video feature extraction. Please place the models as pretrained_models/s3d_dict.npy and pretrained_models/s3d_howto100m.pth.

We implement a PathBuilder to automatically track video ids, source video paths to their feature locations (you may need conda install -c anaconda pandas). Decoding may need pip install ffmpeg-python.

Howto100M

Howto100M is a large-scale video pre-training datasets. You may download videos by yourself and run preprocessing of our scripts.

Several key differences of our preprocessing from existing papers: (1) we use raw_caption.json instead of caption.json to have pure self-supervision on text (caption.json has manual removal of stop words); (2) we remove partially duplicated texts that are originally designed for real-time readability (see mmpt/processors/dedupprocessor.py); (3) then we shard video/text features using SharedTensor in mmpt/utils/shardedtensor.py for fast loading during training (faster than h5py).

Steps

video

To extract video features: edit and run bash scripts/video_feature_extractor/how2/s3d.sh. (consider to run this on multiple machines; by default, we store features in fp16 to save space and also for faster training).

Split available video ids as data/how2/how2_s3d_train.lst and data/how2/how2_s3d_val.lst.

Lastly, pack video features into ShardedTensor using python scripts/video_feature_extractor/shard_feature.py.

text

Clean captions using python -m mmpt.processors.dedupprocessor.

Tokenize dedupped captions data/how2/raw_caption_dedup.pkl into sharded numpy arrays:

python scripts/text_token_extractor/pretokenization.py scripts/text_token_extractor/configs/bert-base-uncased.yaml

Youcook, MSRVTT etc.

We use the version of Youcook and MSRVTT come with Howto100M and MILNCE. Please download the data to data/youcook and data/msrvtt accordingly, you can also check projects/task/youcook.yaml and projects/task/vtt.yaml etc. in details. We extract features for Youcook, MSRVTT similar to the first step of Howto100M but we read text from meta data directly and perform on-the-fly tokenization.