Interactive Spatiotemporal Token Attention Network for Skeleton-based General Interactive Action Recognition
This repository is the official implementation of Interactive Spatiotemporal Token Attention Network for Skeleton-based General Interactive Action Recognition (IROS 2023).
- 1. Change Log
- 2. Prerequisites
- 3. Prepare the Datasets
- 4. Run the Code
- 5. Acknowledgement
- 6. Citation
- [2023/12/19] Our paper now is available online in IROS 2023 proceeding. Here's the link.
- [2023/07/15] Now our paper is accepted to IROS 2023. Visit our project website!
- [2023/03/07] Code Upload.
To clone the main
branch only (for code) and exclude the gh-pages
branch (for project website), use the following git
command:
git clone -b main https://github.com/Necolizer/ISTA-Net.git
pip install -r requirements.txt
Please refer to CTR-GCN and follow the instructions in section Data Preparation to prepare NTU RGB+D 120.
For your convenience, here is the excerpt of the instructions in section Data Preparation:
DownLoad
- Request dataset here: https://rose1.ntu.edu.sg/dataset/actionRecognition
- Download the skeleton-only datasets:
- nturgbd_skeletons_s001_to_s017.zip (NTU RGB+D 60)
- nturgbd_skeletons_s018_to_s032.zip (NTU RGB+D 120)
- Extract above files to ./data/nturgbd_raw
Directory Structure
Put downloaded data into the following directory structure:
- data/
- ntu/
- ntu120/
- nturgbd_raw/
- nturgb+d_skeletons/ # from `nturgbd_skeletons_s001_to_s017.zip`
...
- nturgb+d_skeletons120/ # from `nturgbd_skeletons_s018_to_s032.zip`
...
Generating Data
Generate NTU RGB+D 120 dataset:
cd ./data/ntu120
# Get skeleton of each performer
python get_raw_skes_data.py
# Remove the bad skeleton
python get_raw_denoised_data.py
# Transform the skeleton to the center of the first frame
python seq_transformation.py
DownLoad
Download the dataset directly from browser with links in SBU Readme, or using download_sbu.py
in ./data/sbu/download_sbu.py
:
cd ./data/sbu
python download_sbu.py --version clean --savedir ./SBU-Kinect-Interaction/Clean
python download_sbu.py --version noisy --savedir ./SBU-Kinect-Interaction/Noisy
Go to the savedir
and unzip all the downloaded zip file unzip '*.zip'
Directory Structure
path/to/your/SBU-Kinect-Interaction
├── Clean
│ ├── s01s02
│ │ ├── 01
│ │ │ └── 001
│ │ │ ├── depth_000055.png
│ │ │ ├── ...
│ │ │ ├── rgb_000055.png
│ │ │ ├── ..
│ │ │ └── skeleton_pos.txt
│ │ ├── 02
│ │ ├── ...
│ │ └── 08
│ ├── s01s03
│ ├── ...
│ └── s07s03
└── Noisy
├── ...
Generating Data
cd ./data/sbu
python getSBU.py --rootdir ./SBU-Kinect-Interaction/Clean --savedir ./SBU-Kinect-Interaction-Skeleton/Clean
python getSBU.py --rootdir ./SBU-Kinect-Interaction/Noisy --savedir ./SBU-Kinect-Interaction-Skeleton/Noisy
DownLoad
- Request dataset here: https://h2odataset.ethz.ch/ . You can get the username and password from the download page.
- Download the dataset directly from the download page or using
download_script.py
in h2odataset repo (we have included it in./data/h2o/download_scirpt.py
in this repo)Selectcd ./data/h2o python download_script.py --username "username" --password "password" --mode pose --dest "dest folder path"
pose
mode to download only pose (hand, object, egocentric view) without RGB-D images. - Extract the downloaded files.
Directory Structure
path/to/your/extracted/files
├── label_split
├── subject1
│ ├── h1
│ │ ├── 0
│ │ │ └── cam4
│ │ │ ├── cam_pose
│ │ │ ├── hand_pose
│ │ │ ├── hand_pose_MANO
│ │ │ ├── obj_pose
│ │ │ ├── obj_pose_RT
│ │ │ ├── action_label
│ │ │ └── verb_label
│ │ ├── 1
│ │ ├── 2
│ │ ├── 3
│ │ └── ...
│ ├── h2
│ ├── k1
│ ├── k2
│ ├── o1
│ └── o2
├── subject2
├── subject3
├── subject4
└── object
Generating Data
Generate H2O pth files using ./data/h2o/generate_h2o.py
.
cd ./data/h2o
python generate_h2o.py --root path/to/your/extracted/files --dest ./h2o_pth --frames 120
DownLoad
- Submit an access request with your google account in Google Drive. Download
poses_60fps
directly or using scripts in assembly101-download-scripts. - Download
test_challenge.csv
in GoogleDrive/fine-grained-annotations - Download 3 csv files in asb101 repo.
Directory Structure
path/to/your/downdload/root
├── fine-grained-annotations
│ ├── test_challenge.csv (@30fps) [This file is download from googledrive]
│ ├── actions.csv [This file is download from asb101 repo]
│ ├── train.csv (@60fps) [This file is download from asb101 repo]
│ └── validation.csv (@60fps) [This file is download from asb101 repo]
└── poses_60fps
├── nusar-2021_action_both_9011-a01_9011_user_id_2021-02-01_153724.json
├── nusar-2021_action_both_9011-b06b_9011_user_id_2021-02-01_154253.json
├── ...
Generating Data
cd ./data/asb
# Train & Validation Set
# Step 1:
python ./Preprocess/1_generate_pose_data.py --rootdir path/to/your/downdload/root/poses_60fps --csvdir path/to/your/downdload/root/fine-grained-annotations --savedir ./RAW_contex25_thresh0
# Step 2:
# Action (mandatory)
python ./Preprocess/2_get_final_dataset.py --data_path ./RAW_contex25_thresh0 --type action
# Verb (optional)
python ./Preprocess/2_get_final_dataset.py --data_path ./RAW_contex25_thresh0 --type verb
# Object (optional)
python ./Preprocess/2_get_final_dataset.py --data_path ./RAW_contex25_thresh0 --type noun
# Test Set
# Step 1:
python ./PreprocessTest/1_generate_pose_data.py --rootdir path/to/your/downdload/root/poses_60fps --csvdir path/to/your/downdload/root/fine-grained-annotations --savedir ./RAW_contex25_thresh0
# Step 2:
# Action (mandatory)
python ./PreprocessTest/2_get_final_dataset.py --data_path ./RAW_contex25_thresh0 --type action
# Verb (optional)
python ./PreprocessTest/2_get_final_dataset.py --data_path ./RAW_contex25_thresh0 --type verb
# Object (optional)
python ./PreprocessTest/2_get_final_dataset.py --data_path ./RAW_contex25_thresh0 --type noun
The test set has a less number of valid samples than the provided test_challenge.csv
. The 1018 invlid test samples (about 5%) has no pose data and will fail to predict. This may cause lower accuracy reports in CodaLab Challenge Page. More information about this could be found in discussions assembly101 Issue#4.
The Cross-subject (X-Sub) and Cross-set (X-Set) criteria are employed, using only the joint modal data to ensure fair comparisons without fusing multiple modalities.
X-Sub
python main.py --config config/ntu/ntu26_xsub_joint.yaml
X-Set
python main.py --config config/ntu/ntu26_xset_joint.yaml
5-fold cross validation approach suggested in SBU is adopted. To get accuracy for each fold, arg fold
should be set to 0, 1, 2, 3 or 4 in sbu_noisy_joint.yaml
and sbu_clean_joint.yaml
. Run each command for 5 times with different fold
and average the test results.
Noisy
python main.py --config config/sbu/sbu_noisy_joint.yaml
Clean
python main.py --config config/sbu/sbu_clean_joint.yaml
Train & Validate
python main.py --config config/h2o/h2o.yaml
Generate JSON File for Test Submission
python main.py --config config/h2o/h2o_get_test_results.yaml --weights path/to/your/checkpoint
Submit zipped json file action_labels.json
in CodaLab Challenge H2O - Action to get the test results.
Train & Validate
# Action (mandatory): 1380 classes
python main.py --config config/asb/asb_action.yaml
# Verb (optional): 24 classes
python main.py --config config/asb/asb_verb.yaml
# Object (optional): 90 classes
python main.py --config config/asb/asb_noun.yaml
Generate JSON File for Test Submission
# Action (mandatory): 1380 classes
python main.py --config config/asb/asb_action_get_test_results.yaml --weights path/to/your/action/checkpoint
# Verb (optional): 24 classes
python main.py --config config/asb/asb_verb_get_test_results.yaml --weights path/to/your/verb/checkpoint
# Object (optional): 90 classes
python main.py --config config/asb/asb_noun_get_test_results.yaml --weights path/to/your/noun/checkpoint
Submit zipped json file preds.json
in CodaLab Challenge Assembly101 3D Action Recognition to get the test results.
You can get a fused json file for action+verb+object using the following script but you should specify the path args in this script:
# You should specify the paths in asb_fuse_json_files.py FIRST
python tools/asb_fuse_json_files.py
ATTENTION:
preds.json
for action is about 673M before compression, and for action+verb+object is about 727M before compression.
We provide scripts in tools/dataset_viz
to visualize dataset samples (pngs or gifs) for the above 4 datasets. Specify the args in those scripts and start visualizing general interactive actions!
Grateful to the collaborators/maintainers of STTFormer, CTR-GCN, MS-G3D, h2odataset, Assembly101 repository. Thanks to the authors for their great work.
If you find this work or code helpful in your research, please consider citing:
@INPROCEEDINGS{wen2023interactive,
author={Wen, Yuhang and Tang, Zixuan and Pang, Yunsheng and Ding, Beichen and Liu, Mengyuan},
booktitle={2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
title={Interactive Spatiotemporal Token Attention Network for Skeleton-Based General Interactive Action Recognition},
year={2023},
volume={},
number={},
pages={7886-7892},
doi={10.1109/IROS55552.2023.10342472}}