initial commit

alibaba-mmai-research · Oct 27, 2021 · fa78f82 · fa78f82
1 parent 5f401a4
commit fa78f82
Show file tree

Hide file tree

Showing 153 changed files with 15,475 additions and 7 deletions.
diff --git a/FEATURE_ZOO.md b/FEATURE_ZOO.md
@@ -0,0 +1,13 @@
+# FEATURE ZOO
+
+Here, we provide strong features for temporal action localization on HACS and Epic-Kitchens-100. 
+
+| dataset | model | resolution | features | classification | average mAP |
+| ------------ | ------------ | ------------ | ------------ | ------------ |
+| EK100 | ViViT Fact. Enc.-B16x2 | 32 x 2 | [features]() | [classification]() | 18.30 (A)) |
+| EK100 | TAda2D | 8 x 8 | [features]() | [classification]() | 13.18 |
+| HACS | TAda2D | 8 x 8 | [features]() | - | 32.3 |
+
+Annotations used for temporal action localization with our codebase can be found [here]().
+
+Pre-trained localization models using these features can be found in the [MODEL_ZOO.md](MODEL_ZOO.md).
diff --git a/GUIDELINES.md b/GUIDELINES.md
@@ -0,0 +1,64 @@
+# Guidelines for pytorch-video-understanding
+
+## Installation
+
+Requirements:
+- Python>=3.6
+- torch>=1.5
+- torchvision (version corresponding with torch)
+- simplejson==3.11.1
+- decord>=0.6.0
+- pyyaml
+- einops
+- oss2
+- psutil
+- tqdm
+- pandas
+
+optional requirements
+- fvcore (for flops calculation)
+
+## Data preparation
+
+For all datasets available in `datasets/base`, the name for each dataset list is specified in the `_get_dataset_list_name` function. 
+Here we provide a table summarizing all the name and the formats of the datasets.
+
+| dataset | split | list file name | format |
+| ------- | ----- | -------------- | ------ | 
+| epic-kitchens-100 | train | EPIC_100_train.csv | as downloaded |
+| epic-kitchens-100 | val | EPIC_100_validation.csv | as downloaded | 
+| epic-kitchens-100 | test | EPIC_100_test_timestamps.csv | as downloaded | 
+| hmdb51 | train/val | hmdb51_train_list.txt/hmdb51_val_list.txt | "video_path, supervised_label" | 
+| imagenet | train/val | imagenet_train.txt/imagenet_val.txt | "image_path, supervised_label" |
+| kinetics 400 | train/val | kinetics400_train_list.txt/kinetics400_val_list.txt | "video_path, supervised_label" |
+| ssv2 | train | something-something-v2-train-with-label.json | json file with "label_idx" specifying the class and "id" specifying the name | 
+| ssv2 | val | something-something-v2-val-with-label.json | json file with "label_idx" specifying the class and "id" specifying the name | 
+| ucf101 | train/val | ucf101_train_list.txt/ucf101_val_list.txt | "video_path, supervised_label" |
+
+For epic-kitchens-features, the file name is specified in the respective configs in `configs/projects/epic-kitchen-tal`.
+
+## Running
+
+The entry file for all the runs are `runs/run.py`. 
+
+Before running, some settings need to be configured in the config file. 
+The codebase is designed to be experiment friendly for rapid development of new models and representation learning approaches, in that the config files are designed in a hierarchical way.
+
+Take Tada2D as an example, each experiment (such as TAda2D_8x8 on kinetics 400: `configs/projects/tada/k400/tada2d_8x8.yaml`) inherits the config from the following hierarchy. 
+```
+--- base config file [configs/pool/base.yaml]
+    --- base run config [configs/pool/run/training/from_scratch_large.yaml]
+    --- base backbone config [configs/pool/backbone/tada2d.yaml]
+        --- base experiment config [configs/projects/tada/tada2d_k400.yaml]
+            --- current experiment config [configs/projects/tada/k400/tada2d_8x8.yaml]
+```
+Generally, the base config file `configs/pool/base.yaml` contains all the possible keys used in this codebase and the bottom config overwrites its base config when the same key is encountered in both files.
+A good practice would be set the parameters shared for all the experiments in the base experiment config, and set parameters that are different for each experiments to the current experiment config.
+
+For an example run, open `configs/projects/tada/tada2d_k400.yaml` 
+A. Set `DATA.DATA_ROOT_DIR` and `DATA.DATA_ANNO_DIR` to point to the kinetics 400, 
+B. Set the valid gpu number `NUM_GPUS`
+Then the codebase can be run by:
+```
+python runs/run.py --cfg configs/projects/tada/k400/tada2d_8x8.yaml 
+```
diff --git a/MODEL_ZOO.md b/MODEL_ZOO.md
@@ -0,0 +1,49 @@
+# MODEL ZOO
+
+## Kinetics 
+
+| Dataset | architecture | depth | init | clips x crops | #frames x sampling rate | acc@1 | acc@5 | checkpoint | config |
+| ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ |
+| K400 | TAda2D | R50 | IN-1K | 10 x 3 | 8 x 8 | 76.3 | 92.4 | [`link`]() |  configs/projects/tada/k400/tada2d_8x8.yaml |
+| K400 | TAda2D | R50 | IN-1K | 10 x 3 | 16 x 5 | 76.9 | 92.7 | [`link`]() | configs/projects/tada/k400/tada2d_16x5.yaml |
+| K400 | ViViT Fact. Enc. | B16x2 | IN-21K | 4 x 3 | 32 x 2 | 79.4 | 94.0 | [`link`]() | configs/projects/competition/k400/vivit_fac_enc_b16x2.yaml |
+
+## Something-Something
+| Dataset | architecture | depth | init | clips x crops | #frames | acc@1 | acc@5 | checkpoint | config |
+| ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ |
+| SSV2 | TAda2D | R50 | IN-1K | 2 x 3 | 8 | 63.8 | 87.7 | [`link`]() | configs/projects/tada/ssv2/tada2d_8f.yaml | 
+| SSV2 | TAda2D | R50 | IN-1K | 2 x 3 | 16 | 65.2 | 89.1 | [`link`]() | configs/projects/tada/ssv2/tada2d_16f.yaml | 
+
+## Epic-Kitchens Action Recognition
+
+| architecture | init | resolution | clips x crops | #frames x sampling rate | action acc@1 | verb acc@1 | noun acc@1 | checkpoint | config |
+| ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | 
+| ViViT Fact. Enc.-B16x2 | K700 | 320 | 4 x 3 | 32 x 2 | 46.3 | 67.4 | 58.9 | [`link`]() | configs/projects/competition/ek100/vivit_fac_enc.yaml |
+| ir-CSN-R152 | K700 | 224 | 10 x 3 | 32 x 2 | 44.5 | 68.4 | 55.9 | [`link`]() | configs/projects/competition/ek100/csn.yaml | 
+
+## Epic-Kitchens Temporal Action Localization
+
+| feature | classification | type | [email protected] | [email protected] | [email protected] | [email protected] | [email protected] | Avg | checkpoint | config |
+| ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | 
+| ViViT | ViViT | Verb | 22.90 | 21.93 | 20.74 | 19.08 | 16.00 | 20.13 | [`link`]() | configs/projects/epic-kitchen-tal/bmn-epic/vivit-os-local.yaml |
+| ViViT | ViViT | Noun | 28.95 | 27.38 | 25.52 | 22.67 | 18.95 | 24.69 | [`link`]() | configs/projects/epic-kitchen-tal/bmn-epic/vivit-os-local.yaml |
+| ViViT | ViViT | Action | 20.82 | 19.93 | 18.67 | 17.02 | 15.06 | 18.30 | [`link`]() | configs/projects/epic-kitchen-tal/bmn-epic/vivit-os-local.yaml |
+
+## MoSI
+Note: for the following models, decord 0.4.1 are used rather than the default 0.6.0 for the codebase.
+
+### Pre-train (without finetuning)
+| dataset | backbone | checkpoint | config |
+| ------- | -------- | ---------- | ------ |
+| HMDB51  | R-2D3D-18 | [`link`]() | configs/projects/mosi/pt-hmdb/r2d3ds.yaml |
+| HMDB51  | R(2+1)D-10 | [`link`]() | configs/projects/mosi/pt-hmdb/r2p1d.yaml |
+| UCF101  | R-2D3D-18 | [`link`]() |configs/projects/mosi/pt-ucf/r2d3ds.yaml |
+| UCF101  | R(2+1)D-10 | [`link`]() | configs/projects/mosi/pt-ucf/r2p1d.yaml | 
+
+### Finetuned
+| dataset | backbone | acc@1 | acc@5 | checkpoint | config |
+| ------- | -------- | ----- | ----- | ---------- | ------ |
+| HMDB51  | R-2D3D-18 | 46.93 | 74.71 | [`link`]() | configs/projects/mosi/ft-hmdb/r2d3ds.yaml | 
+| HMDB51  | R(2+1)D-10 | 51.83 | 78.63 | [`link`]() | configs/projects/mosi/ft-hmdb/r2p1d.yaml |
+| UCF101  | R-2D3D-18 | 71.75 | 89.14 | [`link`]() | configs/projects/mosi/ft-ucf/r2d3ds.yaml | 
+| UCF101  | R(2+1)D-10 | 82.79 | 95.78 | [`link`]() | configs/projects/mosi/ft-ucf/r2p1d.yaml |
diff --git a/README.md b/README.md
@@ -1,10 +1,47 @@
 # pytorch-video-understanding
-This codebase will provide a comprehensive video understanding solution, including state-of-the-art video models (both convolutional and transformer-based), self-supervised video representation learning approaches and temporal action detection methods, etc.
+This codebase provides a comprehensive video understanding solution for video classification and temporal detection. 
 
-Works to be released soon:
-- [Self-supervised Motion Learning from Static Images](https://openaccess.thecvf.com/content/CVPR2021/papers/Huang_Self-Supervised_Motion_Learning_From_Static_Images_CVPR_2021_paper).
-- [Towards Training Stronger Video Vision Transformers for EPIC-KITCHENS-100 Action Recognition](https://arxiv.org/pdf/2106.05058)
-- [A Stronger Baseline for Ego-Centric Action Detection](https://arxiv.org/pdf/2106.06942)
+Key features:
+- Video classification: State-of-the-art video models, with self-supervised representation learning approaches for pre-training, and supervised classification pipeline for fine-tuning. 
+- Video temporal detection: Strong features ready for both feature-level classification and localization, as well as standard pipeline taking advantage of the features for temporal action detection. 
 
-Other standard models and approaches for learning representations in videos will also be included. 
-Stay tuned!
+The approaches implemented in this repo include but are not limited to the following papers:
+
+- Self-supervised Motion Learning from Static Images <br>
+[[Project](projects/mosi/README.md)] [[Paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Huang_Self-Supervised_Motion_Learning_From_Static_Images_CVPR_2021_paper)] 
+**CVPR 2021**
+- A Stronger Baseline for Ego-Centric Action Detection <br>
+[[Project](projects/epic-kitchen-tal/README.md)] [[Paper]((https://arxiv.org/pdf/2106.06942))] 
+**First-place** submission to [EPIC-KITCHENS-100 Action Detection Challenge](https://competitions.codalab.org/competitions/25926#results)
+- Towards Training Stronger Video Vision Transformers for EPIC-KITCHENS-100 Action Recognition <br>
+[[Project](projects/epic-kitchen-ar/README.md)] [[Paper](https://arxiv.org/pdf/2106.05058)] 
+**Second-place** submission to [EPIC-KITCHENS-100 Action Recognition challenge](https://competitions.codalab.org/competitions/25923#results)
+- TAda! Temporally-Adaptive Convolutions for Video Understanding <br>
+[[Project](projects/tada/README.md)] [[Paper](https://arxiv.org/pdf/2110.06178.pdf)] 
+**Preprint**
+
+# Latest
+
+[2021-10] Codes and models are released!
+
+# Model Zoo
+
+We include our pre-trained models in the [MODEL_ZOO.md](MODEL_ZOO.md).
+
+# Feature Zoo
+
+We include strong features for [HACS](http://hacs.csail.mit.edu/) and [Epic-Kitchens-100](https://epic-kitchens.github.io/2021) in our [FEATURE_ZOO.md](FEATURE_ZOO.md).
+
+# Guidelines
+
+The general pipeline for using this repo is the installation, data preparation and running.
+See [GUIDELINES.md](GUIDELINES.md).
+
+# Contributors
+
+This codebase is written and maintained by [Ziyuan Huang](https://huang-ziyuan.github.io/), [Zhiwu Qing](https://scholar.google.com/citations?user=q9refl4AAAAJ&hl=zh-CN) and [Xiang Wang](https://scholar.google.com/citations?user=cQbXvkcAAAAJ&hl=zh-CN).
+
+If you find our codebase useful, please consider citing the respective work :).
+
+# Upcoming 
+[ParamCrop: Parametric Cubic Cropping for Video Contrastive Learning](https://arxiv.org/abs/2108.10501).
diff --git a/configs/pool/backbone/csn.yaml b/configs/pool/backbone/csn.yaml
@@ -0,0 +1,33 @@
+MODEL:
+  NAME: irCSN
+VIDEO:
+  BACKBONE:
+    DEPTH: 152
+    META_ARCH: ResNet3D
+    NUM_FILTERS: [64, 256, 512, 1024, 2048]
+    NUM_INPUT_CHANNELS: 3
+    NUM_OUT_FEATURES: 2048
+    KERNEL_SIZE: [
+      [3, 7, 7],
+      [3, 3, 3],
+      [3, 3, 3],
+      [3, 3, 3],
+      [3, 3, 3]
+    ]
+    DOWNSAMPLING: [true, false, true, true, true]
+    DOWNSAMPLING_TEMPORAL: [false, false, true, true, true]
+    NUM_STREAMS: 1
+    EXPANSION_RATIO: 4
+    BRANCH:
+      NAME: CSNBranch
+    STEM:
+      NAME: DownSampleStem
+    NONLOCAL:
+      ENABLE: false
+      STAGES: [5]
+      MASK_ENABLE: false
+  HEAD:
+    NAME: BaseHead
+    ACTIVATION: softmax
+    DROPOUT_RATE: 0
+    NUM_CLASSES:              # !!!
diff --git a/configs/pool/backbone/localization-conv.yaml b/configs/pool/backbone/localization-conv.yaml
@@ -0,0 +1,10 @@
+MODEL:
+  NAME: BaseVideoModel
+VIDEO:
+  DIM1D: 256
+  DIM2D: 128
+  DIM3D: 512
+  BACKBONE_LAYER: 2
+  BACKBONE_GROUPS_NUM: 4
+  BACKBONE:
+    META_ARCH: SimpleLocalizationConv
diff --git a/configs/pool/backbone/r2d3ds.yaml b/configs/pool/backbone/r2d3ds.yaml
@@ -0,0 +1,33 @@
+MODEL:
+  NAME: R2D3D
+VIDEO:
+  BACKBONE:
+    DEPTH: 18
+    META_ARCH: ResNet3D
+    NUM_FILTERS: [64, 64, 128, 256, 256]
+    NUM_INPUT_CHANNELS: 3
+    NUM_OUT_FEATURES: 256
+    KERNEL_SIZE: [
+      [1, 7, 7],
+      [1, 3, 3],
+      [1, 3, 3],
+      [3, 3, 3],
+      [3, 3, 3]
+    ]
+    DOWNSAMPLING: [true, false, true, true, true]
+    DOWNSAMPLING_TEMPORAL: [false, false, false, true, true]
+    NUM_STREAMS: 1
+    EXPANSION_RATIO: 2
+    BRANCH:
+      NAME: R2D3DBranch
+    STEM:
+      NAME: DownSampleStem
+    NONLOCAL:
+      ENABLE: false
+      STAGES: [5]
+      MASK_ENABLE: false
+  HEAD:
+    NAME: BaseHead
+    ACTIVATION: softmax
+    DROPOUT_RATE: 0
+    NUM_CLASSES:              # !!!
diff --git a/configs/pool/backbone/r2p1d.yaml b/configs/pool/backbone/r2p1d.yaml
@@ -0,0 +1,33 @@
+MODEL:
+  NAME: R2Plus1D
+VIDEO:
+  BACKBONE:
+    DEPTH: 10
+    META_ARCH: ResNet3D
+    NUM_INPUT_CHANNELS: 3
+    NUM_FILTERS: [64, 64, 128, 256, 512]
+    NUM_OUT_FEATURES: 512
+    KERNEL_SIZE: [
+      [3, 7, 7],
+      [3, 3, 3],
+      [3, 3, 3],
+      [3, 3, 3],
+      [3, 3, 3]
+    ]
+    DOWNSAMPLING: [true, false, true, true, true]
+    DOWNSAMPLING_TEMPORAL: [false, false, true, true, true]
+    NUM_STREAMS: 1
+    EXPANSION_RATIO: 2
+    BRANCH:
+      NAME: R2Plus1DBranch
+    STEM:
+      NAME: R2Plus1DStem
+    NONLOCAL:
+      ENABLE: false
+      STAGES: [5]
+      MASK_ENABLE: false
+  HEAD:
+    NAME: BaseHead
+    ACTIVATION: softmax
+    DROPOUT_RATE: 0
+    NUM_CLASSES:              # !!!
diff --git a/configs/pool/backbone/s3dg.yaml b/configs/pool/backbone/s3dg.yaml
@@ -0,0 +1,21 @@
+MODEL:
+  NAME: S3DG
+VIDEO:
+  BACKBONE:
+    META_ARCH: Inception3D
+    NUM_OUT_FEATURES: 1024
+    NUM_STREAMS: 1
+    BRANCH:
+      NAME: STConv3d
+      GATING: true
+    STEM:
+      NAME: STConv3d
+    NONLOCAL:
+      ENABLE: false
+      STAGES: [5]
+      MASK_ENABLE: false
+  HEAD:
+    NAME: BaseHead
+    ACTIVATION: softmax
+    DROPOUT_RATE: 0
+    NUM_CLASSES:              # !!!
diff --git a/configs/pool/backbone/slowfast_4x16.yaml b/configs/pool/backbone/slowfast_4x16.yaml
@@ -0,0 +1,59 @@
+MODEL:
+  NAME: SlowFast_4x16
+VIDEO:
+  BACKBONE:
+    DEPTH: 50
+    META_ARCH: Slowfast
+    NUM_FILTERS: [64, 256, 512, 1024, 2048]
+    NUM_INPUT_CHANNELS: 3
+    NUM_OUT_FEATURES: 2048
+    KERNEL_SIZE: [
+      [
+        [1, 7, 7],
+        [1, 3, 3],
+        [1, 3, 3],
+        [1, 3, 3],
+        [1, 3, 3],
+      ],
+      [
+        [5, 7, 7],
+        [1, 3, 3],
+        [1, 3, 3],
+        [1, 3, 3],
+        [1, 3, 3],
+      ],
+    ]
+    DOWNSAMPLING: [true, false, true, true, true]
+    DOWNSAMPLING_TEMPORAL: [false, false, false, false, false]
+    TEMPORAL_CONV_BOTTLENECK:
+      [
+        [false, false, false, true, true], # slow branch,
+        [false, true, true, true, true]    # fast branch
+      ]
+    NUM_STREAMS: 1
+    EXPANSION_RATIO: 4
+    BRANCH:
+      NAME: SlowfastBranch
+    STEM:
+      NAME: DownSampleStem
+    SLOWFAST:
+      MODE: slowfast
+      ALPHA: 8
+      BETA: 8             # slow fast channel ratio
+      CONV_CHANNEL_RATIO: 2
+      KERNEL_SIZE: 5
+      FUSION_CONV_BIAS: false
+      FUSION_BN: true
+      FUSION_RELU: true
+    NONLOCAL:
+      ENABLE: false
+      STAGES: [5]
+      MASK_ENABLE: false
+  HEAD:
+    NAME: SlowFastHead
+    ACTIVATION: softmax
+    DROPOUT_RATE: 0
+    NUM_CLASSES:              # !!!
+DATA:
+  NUM_INPUT_FRAMES: 32
+  SAMPLING_RATE: 2