Skip to content

Commit

Permalink
initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
huang-ziyuan committed Oct 27, 2021
1 parent 5f401a4 commit fa78f82
Show file tree
Hide file tree
Showing 153 changed files with 15,475 additions and 7 deletions.
13 changes: 13 additions & 0 deletions FEATURE_ZOO.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# FEATURE ZOO

Here, we provide strong features for temporal action localization on HACS and Epic-Kitchens-100.

| dataset | model | resolution | features | classification | average mAP |
| ------------ | ------------ | ------------ | ------------ | ------------ |
| EK100 | ViViT Fact. Enc.-B16x2 | 32 x 2 | [features]() | [classification]() | 18.30 (A)) |
| EK100 | TAda2D | 8 x 8 | [features]() | [classification]() | 13.18 |
| HACS | TAda2D | 8 x 8 | [features]() | - | 32.3 |

Annotations used for temporal action localization with our codebase can be found [here]().

Pre-trained localization models using these features can be found in the [MODEL_ZOO.md](MODEL_ZOO.md).
64 changes: 64 additions & 0 deletions GUIDELINES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Guidelines for pytorch-video-understanding

## Installation

Requirements:
- Python>=3.6
- torch>=1.5
- torchvision (version corresponding with torch)
- simplejson==3.11.1
- decord>=0.6.0
- pyyaml
- einops
- oss2
- psutil
- tqdm
- pandas

optional requirements
- fvcore (for flops calculation)

## Data preparation

For all datasets available in `datasets/base`, the name for each dataset list is specified in the `_get_dataset_list_name` function.
Here we provide a table summarizing all the name and the formats of the datasets.

| dataset | split | list file name | format |
| ------- | ----- | -------------- | ------ |
| epic-kitchens-100 | train | EPIC_100_train.csv | as downloaded |
| epic-kitchens-100 | val | EPIC_100_validation.csv | as downloaded |
| epic-kitchens-100 | test | EPIC_100_test_timestamps.csv | as downloaded |
| hmdb51 | train/val | hmdb51_train_list.txt/hmdb51_val_list.txt | "video_path, supervised_label" |
| imagenet | train/val | imagenet_train.txt/imagenet_val.txt | "image_path, supervised_label" |
| kinetics 400 | train/val | kinetics400_train_list.txt/kinetics400_val_list.txt | "video_path, supervised_label" |
| ssv2 | train | something-something-v2-train-with-label.json | json file with "label_idx" specifying the class and "id" specifying the name |
| ssv2 | val | something-something-v2-val-with-label.json | json file with "label_idx" specifying the class and "id" specifying the name |
| ucf101 | train/val | ucf101_train_list.txt/ucf101_val_list.txt | "video_path, supervised_label" |

For epic-kitchens-features, the file name is specified in the respective configs in `configs/projects/epic-kitchen-tal`.

## Running

The entry file for all the runs are `runs/run.py`.

Before running, some settings need to be configured in the config file.
The codebase is designed to be experiment friendly for rapid development of new models and representation learning approaches, in that the config files are designed in a hierarchical way.

Take Tada2D as an example, each experiment (such as TAda2D_8x8 on kinetics 400: `configs/projects/tada/k400/tada2d_8x8.yaml`) inherits the config from the following hierarchy.
```
--- base config file [configs/pool/base.yaml]
--- base run config [configs/pool/run/training/from_scratch_large.yaml]
--- base backbone config [configs/pool/backbone/tada2d.yaml]
--- base experiment config [configs/projects/tada/tada2d_k400.yaml]
--- current experiment config [configs/projects/tada/k400/tada2d_8x8.yaml]
```
Generally, the base config file `configs/pool/base.yaml` contains all the possible keys used in this codebase and the bottom config overwrites its base config when the same key is encountered in both files.
A good practice would be set the parameters shared for all the experiments in the base experiment config, and set parameters that are different for each experiments to the current experiment config.

For an example run, open `configs/projects/tada/tada2d_k400.yaml`
A. Set `DATA.DATA_ROOT_DIR` and `DATA.DATA_ANNO_DIR` to point to the kinetics 400,
B. Set the valid gpu number `NUM_GPUS`
Then the codebase can be run by:
```
python runs/run.py --cfg configs/projects/tada/k400/tada2d_8x8.yaml
```
49 changes: 49 additions & 0 deletions MODEL_ZOO.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# MODEL ZOO

## Kinetics

| Dataset | architecture | depth | init | clips x crops | #frames x sampling rate | acc@1 | acc@5 | checkpoint | config |
| ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ |
| K400 | TAda2D | R50 | IN-1K | 10 x 3 | 8 x 8 | 76.3 | 92.4 | [`link`]() | configs/projects/tada/k400/tada2d_8x8.yaml |
| K400 | TAda2D | R50 | IN-1K | 10 x 3 | 16 x 5 | 76.9 | 92.7 | [`link`]() | configs/projects/tada/k400/tada2d_16x5.yaml |
| K400 | ViViT Fact. Enc. | B16x2 | IN-21K | 4 x 3 | 32 x 2 | 79.4 | 94.0 | [`link`]() | configs/projects/competition/k400/vivit_fac_enc_b16x2.yaml |

## Something-Something
| Dataset | architecture | depth | init | clips x crops | #frames | acc@1 | acc@5 | checkpoint | config |
| ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ |
| SSV2 | TAda2D | R50 | IN-1K | 2 x 3 | 8 | 63.8 | 87.7 | [`link`]() | configs/projects/tada/ssv2/tada2d_8f.yaml |
| SSV2 | TAda2D | R50 | IN-1K | 2 x 3 | 16 | 65.2 | 89.1 | [`link`]() | configs/projects/tada/ssv2/tada2d_16f.yaml |

## Epic-Kitchens Action Recognition

| architecture | init | resolution | clips x crops | #frames x sampling rate | action acc@1 | verb acc@1 | noun acc@1 | checkpoint | config |
| ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ |
| ViViT Fact. Enc.-B16x2 | K700 | 320 | 4 x 3 | 32 x 2 | 46.3 | 67.4 | 58.9 | [`link`]() | configs/projects/competition/ek100/vivit_fac_enc.yaml |
| ir-CSN-R152 | K700 | 224 | 10 x 3 | 32 x 2 | 44.5 | 68.4 | 55.9 | [`link`]() | configs/projects/competition/ek100/csn.yaml |

## Epic-Kitchens Temporal Action Localization

| feature | classification | type | [email protected] | [email protected] | [email protected] | [email protected] | [email protected] | Avg | checkpoint | config |
| ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ |
| ViViT | ViViT | Verb | 22.90 | 21.93 | 20.74 | 19.08 | 16.00 | 20.13 | [`link`]() | configs/projects/epic-kitchen-tal/bmn-epic/vivit-os-local.yaml |
| ViViT | ViViT | Noun | 28.95 | 27.38 | 25.52 | 22.67 | 18.95 | 24.69 | [`link`]() | configs/projects/epic-kitchen-tal/bmn-epic/vivit-os-local.yaml |
| ViViT | ViViT | Action | 20.82 | 19.93 | 18.67 | 17.02 | 15.06 | 18.30 | [`link`]() | configs/projects/epic-kitchen-tal/bmn-epic/vivit-os-local.yaml |

## MoSI
Note: for the following models, decord 0.4.1 are used rather than the default 0.6.0 for the codebase.

### Pre-train (without finetuning)
| dataset | backbone | checkpoint | config |
| ------- | -------- | ---------- | ------ |
| HMDB51 | R-2D3D-18 | [`link`]() | configs/projects/mosi/pt-hmdb/r2d3ds.yaml |
| HMDB51 | R(2+1)D-10 | [`link`]() | configs/projects/mosi/pt-hmdb/r2p1d.yaml |
| UCF101 | R-2D3D-18 | [`link`]() |configs/projects/mosi/pt-ucf/r2d3ds.yaml |
| UCF101 | R(2+1)D-10 | [`link`]() | configs/projects/mosi/pt-ucf/r2p1d.yaml |

### Finetuned
| dataset | backbone | acc@1 | acc@5 | checkpoint | config |
| ------- | -------- | ----- | ----- | ---------- | ------ |
| HMDB51 | R-2D3D-18 | 46.93 | 74.71 | [`link`]() | configs/projects/mosi/ft-hmdb/r2d3ds.yaml |
| HMDB51 | R(2+1)D-10 | 51.83 | 78.63 | [`link`]() | configs/projects/mosi/ft-hmdb/r2p1d.yaml |
| UCF101 | R-2D3D-18 | 71.75 | 89.14 | [`link`]() | configs/projects/mosi/ft-ucf/r2d3ds.yaml |
| UCF101 | R(2+1)D-10 | 82.79 | 95.78 | [`link`]() | configs/projects/mosi/ft-ucf/r2p1d.yaml |
51 changes: 44 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,47 @@
# pytorch-video-understanding
This codebase will provide a comprehensive video understanding solution, including state-of-the-art video models (both convolutional and transformer-based), self-supervised video representation learning approaches and temporal action detection methods, etc.
This codebase provides a comprehensive video understanding solution for video classification and temporal detection.

Works to be released soon:
- [Self-supervised Motion Learning from Static Images](https://openaccess.thecvf.com/content/CVPR2021/papers/Huang_Self-Supervised_Motion_Learning_From_Static_Images_CVPR_2021_paper).
- [Towards Training Stronger Video Vision Transformers for EPIC-KITCHENS-100 Action Recognition](https://arxiv.org/pdf/2106.05058)
- [A Stronger Baseline for Ego-Centric Action Detection](https://arxiv.org/pdf/2106.06942)
Key features:
- Video classification: State-of-the-art video models, with self-supervised representation learning approaches for pre-training, and supervised classification pipeline for fine-tuning.
- Video temporal detection: Strong features ready for both feature-level classification and localization, as well as standard pipeline taking advantage of the features for temporal action detection.

Other standard models and approaches for learning representations in videos will also be included.
Stay tuned!
The approaches implemented in this repo include but are not limited to the following papers:

- Self-supervised Motion Learning from Static Images <br>
[[Project](projects/mosi/README.md)] [[Paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Huang_Self-Supervised_Motion_Learning_From_Static_Images_CVPR_2021_paper)]
**CVPR 2021**
- A Stronger Baseline for Ego-Centric Action Detection <br>
[[Project](projects/epic-kitchen-tal/README.md)] [[Paper]((https://arxiv.org/pdf/2106.06942))]
**First-place** submission to [EPIC-KITCHENS-100 Action Detection Challenge](https://competitions.codalab.org/competitions/25926#results)
- Towards Training Stronger Video Vision Transformers for EPIC-KITCHENS-100 Action Recognition <br>
[[Project](projects/epic-kitchen-ar/README.md)] [[Paper](https://arxiv.org/pdf/2106.05058)]
**Second-place** submission to [EPIC-KITCHENS-100 Action Recognition challenge](https://competitions.codalab.org/competitions/25923#results)
- TAda! Temporally-Adaptive Convolutions for Video Understanding <br>
[[Project](projects/tada/README.md)] [[Paper](https://arxiv.org/pdf/2110.06178.pdf)]
**Preprint**

# Latest

[2021-10] Codes and models are released!

# Model Zoo

We include our pre-trained models in the [MODEL_ZOO.md](MODEL_ZOO.md).

# Feature Zoo

We include strong features for [HACS](http://hacs.csail.mit.edu/) and [Epic-Kitchens-100](https://epic-kitchens.github.io/2021) in our [FEATURE_ZOO.md](FEATURE_ZOO.md).

# Guidelines

The general pipeline for using this repo is the installation, data preparation and running.
See [GUIDELINES.md](GUIDELINES.md).

# Contributors

This codebase is written and maintained by [Ziyuan Huang](https://huang-ziyuan.github.io/), [Zhiwu Qing](https://scholar.google.com/citations?user=q9refl4AAAAJ&hl=zh-CN) and [Xiang Wang](https://scholar.google.com/citations?user=cQbXvkcAAAAJ&hl=zh-CN).

If you find our codebase useful, please consider citing the respective work :).

# Upcoming
[ParamCrop: Parametric Cubic Cropping for Video Contrastive Learning](https://arxiv.org/abs/2108.10501).
33 changes: 33 additions & 0 deletions configs/pool/backbone/csn.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
MODEL:
NAME: irCSN
VIDEO:
BACKBONE:
DEPTH: 152
META_ARCH: ResNet3D
NUM_FILTERS: [64, 256, 512, 1024, 2048]
NUM_INPUT_CHANNELS: 3
NUM_OUT_FEATURES: 2048
KERNEL_SIZE: [
[3, 7, 7],
[3, 3, 3],
[3, 3, 3],
[3, 3, 3],
[3, 3, 3]
]
DOWNSAMPLING: [true, false, true, true, true]
DOWNSAMPLING_TEMPORAL: [false, false, true, true, true]
NUM_STREAMS: 1
EXPANSION_RATIO: 4
BRANCH:
NAME: CSNBranch
STEM:
NAME: DownSampleStem
NONLOCAL:
ENABLE: false
STAGES: [5]
MASK_ENABLE: false
HEAD:
NAME: BaseHead
ACTIVATION: softmax
DROPOUT_RATE: 0
NUM_CLASSES: # !!!
10 changes: 10 additions & 0 deletions configs/pool/backbone/localization-conv.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
MODEL:
NAME: BaseVideoModel
VIDEO:
DIM1D: 256
DIM2D: 128
DIM3D: 512
BACKBONE_LAYER: 2
BACKBONE_GROUPS_NUM: 4
BACKBONE:
META_ARCH: SimpleLocalizationConv
33 changes: 33 additions & 0 deletions configs/pool/backbone/r2d3ds.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
MODEL:
NAME: R2D3D
VIDEO:
BACKBONE:
DEPTH: 18
META_ARCH: ResNet3D
NUM_FILTERS: [64, 64, 128, 256, 256]
NUM_INPUT_CHANNELS: 3
NUM_OUT_FEATURES: 256
KERNEL_SIZE: [
[1, 7, 7],
[1, 3, 3],
[1, 3, 3],
[3, 3, 3],
[3, 3, 3]
]
DOWNSAMPLING: [true, false, true, true, true]
DOWNSAMPLING_TEMPORAL: [false, false, false, true, true]
NUM_STREAMS: 1
EXPANSION_RATIO: 2
BRANCH:
NAME: R2D3DBranch
STEM:
NAME: DownSampleStem
NONLOCAL:
ENABLE: false
STAGES: [5]
MASK_ENABLE: false
HEAD:
NAME: BaseHead
ACTIVATION: softmax
DROPOUT_RATE: 0
NUM_CLASSES: # !!!
33 changes: 33 additions & 0 deletions configs/pool/backbone/r2p1d.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
MODEL:
NAME: R2Plus1D
VIDEO:
BACKBONE:
DEPTH: 10
META_ARCH: ResNet3D
NUM_INPUT_CHANNELS: 3
NUM_FILTERS: [64, 64, 128, 256, 512]
NUM_OUT_FEATURES: 512
KERNEL_SIZE: [
[3, 7, 7],
[3, 3, 3],
[3, 3, 3],
[3, 3, 3],
[3, 3, 3]
]
DOWNSAMPLING: [true, false, true, true, true]
DOWNSAMPLING_TEMPORAL: [false, false, true, true, true]
NUM_STREAMS: 1
EXPANSION_RATIO: 2
BRANCH:
NAME: R2Plus1DBranch
STEM:
NAME: R2Plus1DStem
NONLOCAL:
ENABLE: false
STAGES: [5]
MASK_ENABLE: false
HEAD:
NAME: BaseHead
ACTIVATION: softmax
DROPOUT_RATE: 0
NUM_CLASSES: # !!!
21 changes: 21 additions & 0 deletions configs/pool/backbone/s3dg.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MODEL:
NAME: S3DG
VIDEO:
BACKBONE:
META_ARCH: Inception3D
NUM_OUT_FEATURES: 1024
NUM_STREAMS: 1
BRANCH:
NAME: STConv3d
GATING: true
STEM:
NAME: STConv3d
NONLOCAL:
ENABLE: false
STAGES: [5]
MASK_ENABLE: false
HEAD:
NAME: BaseHead
ACTIVATION: softmax
DROPOUT_RATE: 0
NUM_CLASSES: # !!!
59 changes: 59 additions & 0 deletions configs/pool/backbone/slowfast_4x16.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
MODEL:
NAME: SlowFast_4x16
VIDEO:
BACKBONE:
DEPTH: 50
META_ARCH: Slowfast
NUM_FILTERS: [64, 256, 512, 1024, 2048]
NUM_INPUT_CHANNELS: 3
NUM_OUT_FEATURES: 2048
KERNEL_SIZE: [
[
[1, 7, 7],
[1, 3, 3],
[1, 3, 3],
[1, 3, 3],
[1, 3, 3],
],
[
[5, 7, 7],
[1, 3, 3],
[1, 3, 3],
[1, 3, 3],
[1, 3, 3],
],
]
DOWNSAMPLING: [true, false, true, true, true]
DOWNSAMPLING_TEMPORAL: [false, false, false, false, false]
TEMPORAL_CONV_BOTTLENECK:
[
[false, false, false, true, true], # slow branch,
[false, true, true, true, true] # fast branch
]
NUM_STREAMS: 1
EXPANSION_RATIO: 4
BRANCH:
NAME: SlowfastBranch
STEM:
NAME: DownSampleStem
SLOWFAST:
MODE: slowfast
ALPHA: 8
BETA: 8 # slow fast channel ratio
CONV_CHANNEL_RATIO: 2
KERNEL_SIZE: 5
FUSION_CONV_BIAS: false
FUSION_BN: true
FUSION_RELU: true
NONLOCAL:
ENABLE: false
STAGES: [5]
MASK_ENABLE: false
HEAD:
NAME: SlowFastHead
ACTIVATION: softmax
DROPOUT_RATE: 0
NUM_CLASSES: # !!!
DATA:
NUM_INPUT_FRAMES: 32
SAMPLING_RATE: 2
Loading

0 comments on commit fa78f82

Please sign in to comment.