Skip to content

[CVPR 2024] Code and models for pi-ViT, a video transformer for understanding activities of daily living

License

Notifications You must be signed in to change notification settings

dominickrei/pi-vit

Repository files navigation

Just Add $\pi$! Pose Induced Video Transformers for Understanding Activities of Daily Living.

[Paper] [Pretrained models]

intro

This is the official code for the CVPR 2024 paper titled "Just Add $\pi$! Pose Induced Video Transformers for Understanding Activities of Daily Living"

PWC PWC PWC

Installation

First, create a conda environment and activate it:

conda create -n pivit python=3.7 -y
source activate pivit

Then, install the following packages:

  • torch & torchvision pip install torch===1.8.1+cu111 torchvision===0.9.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html
  • fvcore: pip install 'git+https://github.com/facebookresearch/fvcore'
  • PyAV: conda install av -c conda-forge
  • misc: pip install simplejson einops timm psutil scikit-learn opencv-python tensorboard

Lastly, build the codebase by running:

git clone https://github.com/dominickrei/pi-vit
cd pi-vit
python setup.py build develop

Data preparation

We make use of the following action recognition datasets for evaluation: Toyota Smarthome, NTU RGB+D, and NTU RGB+D 120. Download the datasets from their respective sources and structure their directories in the following formats.

Smarthome

├── Smarthome
    ├── mp4
        ├── Cook.Cleandishes_p02_r00_v02_c03.mp4
        ├── Cook.Cleandishes_p02_r00_v14_c03.mp4
        ├── ...
    ├── skeletonv12
        ├── Cook.Cleandishes_p02_r00_v02_c03_pose3d.json
        ├── Cook.Cleandishes_p02_r00_v14_c03_pose3d.json
        ├── ...

NTU RGB+D

├── NTU
    ├── rgb
        ├── S001C001P001R001A001_rgb.avi
        ├── S001C001P001R001A001_rgb.avi
        ├── ...
    ├── skeletons
        ├── S001C001P001R001A001.skeleton.npy
        ├── S001C001P001R001A001.skeleton.npy
        ├── ...

Preparing CSVs

After downloading and preparing the datasets, prepare the CSVs for training, testing, and validation splits as train.csv, test.csv, and val.csv. The format of each CSV is:

path_to_video_1,path_to_video_1_skeleton,label_1
path_to_video_2,path_to_video_2_skeleton,label_2
...
path_to_video_N,path_to_video_N_skeleton,label_N

Usage

We provide configs for training $\pi$-ViT on Smarthome and NTU in configs/. Please update the paths in the config to match the paths in your machine before using.

Training

Download the necessary pretrained models (Kinetics-400 for Smarthome and SSv2 for NTU) from this link and update TRAIN.CHECKPOINT_FILE_PATH to point to the downloaded model.

For example to train $\pi$-ViT on Smarthome using 8 GPUs run the following command:

python tools/run_net.py --cfg configs/Smarthome/PIViT_Smarthome.yaml NUM_GPUS 8

Testing

Model Dataset mCA Top-1 Downloads (direct download)
$\pi$-ViT Smarthome CS 72.9 - HuggingFace
$\pi$-ViT Smarthome CV2 64.8 - HuggingFace
$\pi$-ViT NTU-120 CS - 91.9 HuggingFace
$\pi$-ViT NTU-120 CSetup - 92.9 HuggingFace
$\pi$-ViT NTU-60 CS - 94.0 HuggingFace
$\pi$-ViT NTU-60 CV - 97.9 HuggingFace

After downloading a pretrained model, evaluate it using the command:

python tools/run_net.py --cfg configs/Smarthome/PIViT_Smarthome.yaml NUM_GPUS 8 TEST.CHECKPOINT_FILE_PATH /path/to/downloaded/model TRAIN.ENABLE False

Setting up skeleton features for $\pi$-ViT

During training, the 3D-SIM module in $\pi$-ViT requires extracted features from a pre-trained sketon action recognition model. This means that for every video in the training set, there must be a corresponding feature vector associated with it. The features should be stored in the directory indicated by the config option EXPERIMENTAL.HYPERFORMER_FEATURES_PATH.

$\pi$-ViT expects a directory containing a single HDF5 file for each video in the training dataset. For example, the directory structure for Smarthome should look like this:

├── /path/to/hyperformer_features
        ├── Cook.Cleandishes_p02_r00_v02_c03.h5
        ├── Cook.Cleandishes_p02_r00_v14_c03.h5
        ├── ...

Where Cook.Cleandishes_p02_r00_v02_c03.h5 is a HDF5 file containing a single dataset named data with a shape of 400x216. We provide a minimal example to demonstrate saving a feature vector in the format $\pi$-ViT expects:

skeleton_features = np.random.rand(400, 216)

with h5py.File('random_tensor.h5', 'w') as f:
    f.create_dataset('data', data=tensor)

Due to the large size of the skeleton feature datasets we do not upload them here, instead we provide the Hyperformer models pre-trained on Toyota-Smarthome in hyperformer_models/. NTU trained models, and details for executing the Hyperformer model, are available here.

Citation & Acknowledgement

@article{reilly2024pivit,
    title={Just Add $\pi$! Pose Induced Video Transformers for Understanding Activities of Daily Living},
    author={Dominick Reilly and Srijan Das},
    booktitle={Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)}
    year={2024}
}

Our primary contributions can be found in:

This repository is built on top of TimeSformer.

About

[CVPR 2024] Code and models for pi-ViT, a video transformer for understanding activities of daily living

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages