Just Add $\pi$! Pose Induced Video Transformers for Understanding Activities of Daily Living.

This is the official code for the CVPR 2024 paper titled "Just Add $\pi$! Pose Induced Video Transformers for Understanding Activities of Daily Living"

Installation

First, create a conda environment and activate it:

conda create -n pivit python=3.7 -y
source activate pivit

Then, install the following packages:

torch & torchvision pip install torch===1.8.1+cu111 torchvision===0.9.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html
fvcore: pip install 'git+https://github.com/facebookresearch/fvcore'
PyAV: conda install av -c conda-forge
misc: pip install simplejson einops timm psutil scikit-learn opencv-python tensorboard

Lastly, build the codebase by running:

git clone https://github.com/dominickrei/pi-vit
cd pi-vit
python setup.py build develop

Data preparation

We make use of the following action recognition datasets for evaluation: Toyota Smarthome, NTU RGB+D, and NTU RGB+D 120. Download the datasets from their respective sources and structure their directories in the following formats.

Smarthome

├── Smarthome
    ├── mp4
        ├── Cook.Cleandishes_p02_r00_v02_c03.mp4
        ├── Cook.Cleandishes_p02_r00_v14_c03.mp4
        ├── ...
    ├── skeletonv12
        ├── Cook.Cleandishes_p02_r00_v02_c03_pose3d.json
        ├── Cook.Cleandishes_p02_r00_v14_c03_pose3d.json
        ├── ...

NTU RGB+D

├── NTU
    ├── rgb
        ├── S001C001P001R001A001_rgb.avi
        ├── S001C001P001R001A001_rgb.avi
        ├── ...
    ├── skeletons
        ├── S001C001P001R001A001.skeleton.npy
        ├── S001C001P001R001A001.skeleton.npy
        ├── ...

By default, the NTU skeletons are in MATLAB format. We convert them into Numpy format using code provided in https://github.com/shahroudy/NTURGB-D/tree/master/Python

Preparing CSVs

After downloading and preparing the datasets, prepare the CSVs for training, testing, and validation splits as train.csv, test.csv, and val.csv. The format of each CSV is:

path_to_video_1,path_to_video_1_skeleton,label_1
path_to_video_2,path_to_video_2_skeleton,label_2
...
path_to_video_N,path_to_video_N_skeleton,label_N

Usage

We provide configs for training $\pi$-ViT on Smarthome and NTU in configs/. Please update the paths in the config to match the paths in your machine before using.

Training

Download the necessary pretrained models (Kinetics-400 for Smarthome and SSv2 for NTU) from this link and update TRAIN.CHECKPOINT_FILE_PATH to point to the downloaded model.

For example to train $\pi$-ViT on Smarthome using 8 GPUs run the following command:

python tools/run_net.py --cfg configs/Smarthome/PIViT_Smarthome.yaml NUM_GPUS 8

Testing

All model checkpoints: https://huggingface.co/datasets/dreilly/pi-vit-checkpoints/tree/main

Model	Dataset	mCA	Top-1	Downloads (direct download)
$\pi$-ViT	Smarthome CS	72.9	-	HuggingFace
$\pi$-ViT	Smarthome CV2	64.8	-	HuggingFace
$\pi$-ViT	NTU-120 CS	-	91.9	HuggingFace
$\pi$-ViT	NTU-120 CSetup	-	92.9	HuggingFace
$\pi$-ViT	NTU-60 CS	-	94.0	HuggingFace
$\pi$-ViT	NTU-60 CV	-	97.9	HuggingFace

After downloading a pretrained model, evaluate it using the command:

python tools/run_net.py --cfg configs/Smarthome/PIViT_Smarthome.yaml NUM_GPUS 8 TEST.CHECKPOINT_FILE_PATH /path/to/downloaded/model TRAIN.ENABLE False

Setting up skeleton features for $\pi$-ViT

During training, the 3D-SIM module in $\pi$-ViT requires extracted features from a pre-trained sketon action recognition model. This means that for every video in the training set, there must be a corresponding feature vector associated with it. The features should be stored in the directory indicated by the config option EXPERIMENTAL.HYPERFORMER_FEATURES_PATH.

$\pi$-ViT expects a directory containing a single HDF5 file for each video in the training dataset. For example, the directory structure for Smarthome should look like this:

├── /path/to/hyperformer_features
        ├── Cook.Cleandishes_p02_r00_v02_c03.h5
        ├── Cook.Cleandishes_p02_r00_v14_c03.h5
        ├── ...

Where Cook.Cleandishes_p02_r00_v02_c03.h5 is a HDF5 file containing a single dataset named data with a shape of 400x216. We provide a minimal example to demonstrate saving a feature vector in the format $\pi$-ViT expects:

skeleton_features = np.random.rand(400, 216)

with h5py.File('random_tensor.h5', 'w') as f:
    f.create_dataset('data', data=tensor)

Due to the large size of the skeleton feature datasets we do not upload them here, instead we provide the Hyperformer models pre-trained on Toyota-Smarthome in hyperformer_models/. NTU trained models, and details for executing the Hyperformer model, are available here.

Citation & Acknowledgement

@article{reilly2024pivit,
    title={Just Add $\pi$! Pose Induced Video Transformers for Understanding Activities of Daily Living},
    author={Dominick Reilly and Srijan Das},
    booktitle={Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)}
    year={2024}
}

Our primary contributions can be found in:

train_net.py, pivit.py, pivit_modules.py, losses.py, smarthome.py, ntu.py

This repository is built on top of TimeSformer.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
configs		configs
hyperformer_models		hyperformer_models
timesformer		timesformer
tools		tools
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
build_timesformer.sh		build_timesformer.sh
environment.yml		environment.yml
intro_graphic.png		intro_graphic.png
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Just Add $\pi$! Pose Induced Video Transformers for Understanding Activities of Daily Living.

Installation

Data preparation

Smarthome

NTU RGB+D

Preparing CSVs

Usage

Training

Testing

All model checkpoints: https://huggingface.co/datasets/dreilly/pi-vit-checkpoints/tree/main

Setting up skeleton features for $\pi$-ViT

Citation & Acknowledgement

About

Releases

Packages

Languages

License

dominickrei/pi-vit

Folders and files

Latest commit

History

Repository files navigation

Just Add $\pi$! Pose Induced Video Transformers for Understanding Activities of Daily Living.

Installation

Data preparation

Smarthome

NTU RGB+D

Preparing CSVs

Usage

Training

Testing

All model checkpoints: https://huggingface.co/datasets/dreilly/pi-vit-checkpoints/tree/main

Setting up skeleton features for $\pi$-ViT

Citation & Acknowledgement

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages