Exo2EgoDVC

Official implementation and data release for "Exo2EgoDVC: Dense Video Captioning of Egocentric Human Activities Using Web Instructional Videos (WACV 2025)". See the corresponding paper info below:

Exo2EgoDVC: Dense Video Captioning of Egocentric Human Activities Using Web Instructional Videos
Takehiko Ohkawa, Takuma Yagi, Taichi Nishimura, Ryosuke Furuta, Atsushi Hashimoto, Yoshitaka Ushiku, and Yoichi Sato
IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2025
[paper] [project]

Release notes

[Dec 6th, 2024]: Open repository

Dependency

Python>=3.7
PyTorch>=1.5.1

Make a vitual conda environment following [PDVC-GitHub]. Our code is run on the PDVC enviornment.

EgoYC2 Dataset

Our work proposes a newly captured egocentric dense video captioning dataset, dubbed EgoYC2. Download from [EgoYC2-download] and place ho_feats, crop_feats, and face_track as the following structure.

data/
├── egoyc2
│   ├── captiondata
│   │   ├── egoyc2_eval_wacv25.json
│   │   ├── egoyc2_train_wacv25.json
│   │   └── para
│   │       └── para_egoyc2_eval_wacv25.json
│   ├── features
│   │   ├── ho_feats
│   │   │   └── resnet_bn
│   └── └── crop_feats
│           └── resnet_bn
└── yc2
    ├── captiondata
    │   ├── para
    │   │   └── para_yc2_eval_wacv25.json
    │   ├── yc2_eval_wacv25.json
    │   └── yc2_train_wacv25.json
    ├── face_track
    ├── features
    │   ├── download_yc2_tsn_features.sh
    │   └── resnet_bn
    └── vocabulary_youcook2.json

Features crop_feats and ho_feats contain frame-wise features from cropped images showing the hand interactions and features from the cropped images and interacted hand-object regions, respectively.
The YC2 data download (yc2/features/resnet_bn) should be referred to [PDVC-GitHub].
The data split of YC2 (captiondata) is not compartible with the original split to align with the split of EgoYC2. The split used in our experiments is named as train/eval_wacv25.

Training and Validation

The running scripts are found under scripts. For the pre-training on source data (YC2), use src_pt_yc2.sh and src_vipt_yc2.sh as a naive baseline and our view-invariant pre-training, respectively.

For the fine-tuning on target data (EgoYC2), use trg_ft_egoyc2.sh and trg_vift_yc2+egoyc2.sh as a naive baseline and our view-invariant fine-tuning, respectively. Please replace the path of the pre-trained checkpoint (PRE_PATH) according to the experiment conditions.

Acknowledgement

The implementation of captioning models is based on [PDVC-GitHub].

Citation

Please cite the following article if our code helps you.

@inproceedings{ohkawa:wacv25,
  author      = {Takehiko Ohkawa and Takuma Yagi and　Taichi Nishimura and　Ryosuke Furuta and　Atsushi Hashimoto and　Yoshitaka Ushiku and　Yoichi Sato},
  title       = {{Exo2EgoDVC}: Dense Video Captioning of Egocentric Procedural Activities Using Web Instructional Videos},
  booktitle   = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
  year        = {2025},
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
cfgs		cfgs
data		data
densevid_eval3		densevid_eval3
misc		misc
pdvc		pdvc
scripts		scripts
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
eval.py		eval.py
eval_utils.py		eval_utils.py
opts.py		opts.py
train.py		train.py
train_adapt.py		train_adapt.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Exo2EgoDVC

Release notes

Dependency

EgoYC2 Dataset

Training and Validation

Acknowledgement

Citation

About

Releases

Packages

Languages

License

ut-vision/Exo2EgoDVC

Folders and files

Latest commit

History

Repository files navigation

Exo2EgoDVC

Release notes

Dependency

EgoYC2 Dataset

Training and Validation

Acknowledgement

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages