better view here -> https://egocentricvision.github.io/EgocentricVision/
Here’s a curated selection of the latest and greatest papers from CVPR and ECCV, focusing on the exciting field of Egocentric Vision. 🌟
We’ve done our best to include the most relevant and interesting works, but apologies in advance if we missed any papers! 🙏 If we did, let us know—we’d love to update the list.
👉 Ready to explore the trends shaping the future? Scroll down for the details and get inspired by the innovative approaches researchers are bringing to this space.
If you’re curious about papers from past years --> https://egocentricvision.github.io/EgocentricVision/ 📚 Happy reading!
-
On the Utility of 3D Hand Poses for Action Recognition - Md Salman Shamil, Dibyadip Chatterjee, Fadime Sener, Shugao Ma, Angela Yao, ECCV 2024. [project page]
-
Benchmarks and Challenges in Pose Estimation for Egocentric Hand Interactions with Objects - Zicong Fan, Takehiko Ohkawa, Linlin Yang, Nie Lin, Zhishan Zhou, Shihao Zhou, Jiajun Liang, Zhong Gao, Xuanyang Zhang, Xue Zhang, Fei Li, Liu Zheng, Feng Lu, Karim Abou Zeid, Bastian Leibe, Jeongwan On, Seungryul Baek, Aditya Prakash, Saurabh Gupta, Kun He, Yoichi Sato, Otmar Hilliges, Hyung Jin Chang, Angela Yao, ECCV 2024.
-
Are Synthetic Data Useful for Egocentric Hand-Object Interaction Detection? - Rosario Leonardi, Antonino Furnari, Francesco Ragusa, Giovanni Maria Farinella, ECCV 2024. [project page]
-
3D Hand Pose Estimation in Everyday Egocentric Images - Aditya Prakash, Ruisen Tu, Matthew Chang, Saurabh Gupta, ECCV 2024. [project page]
-
HOISDF: Constraining 3D Hand-Object Pose Estimation with Global Signed Distance Fields - Haozhe Qi, Chen Zhao, Mathieu Salzmann, Alexander Mathis, CVPR 2024. [code]
-
Single-to-Dual-View Adaptation for Egocentric 3D Hand Pose Estimation - Ruicong Liu, Takehiko Ohkawa, Mingfang Zhang, Yoichi Sato, CVPR 2024. [code]
-
AFF-ttention! Affordances and Attention models for Short-Term Object Interaction Anticipation - Lorenzo Mur-Labadia, Ruben Martinez-Cantin, Jose J Guerrero, Giovanni Maria Farinella, Antonino Furnari, ECCV 2024.
-
Semantically Guided Representation Learning For Action Anticipation - Anxhelo Diko, Danilo Avola, Bardh Prenkaj, Federico Fontana, Luigi Cinque, ECCV 2024. [code]
-
Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction Anticipation - Razvan-George Pasca, Alexey Gavryushin, Muhammad Hamza, Yen-Ling Kuo, Kaichun Mo, Luc Van Gool, Otmar Hilliges, Xi Wang, CVPR 2024.
-
Can't Make an Omelette Without Breaking Some Eggs: Plausible Action Anticipation Using Large Video-Language Models - Himangi Mittal, Nakul Agarwal, Shao-Yuan Lo, Kwonjoon Lee, CVPR 2024.
-
Bidirectional Progressive Transformer for Interaction Intention Anticipation - Zichen Zhang, Hongchen Luo, Wei Zhai, Yu Kang, Yang Cao, ECCV 2024.
-
SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos - Changan Chen, Kumar Ashutosh, Rohit Girdhar, David Harwath, Kristen Grauman, CVPR 2024. [project page]
-
The Audio-Visual Conversational Graph: From an Egocentric-Exocentric Perspective - Wenqi Jia, Miao Liu, Hao Jiang, Ishwarya Ananthabhotla, James M. Rehg, Vamsi Krishna Ithapu, Ruohan Gao, CVPR 2024. [project page]
-
Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos - Sagnik Majumder, Ziad Al-Halah, Kristen Grauman, CVPR 2024. [project page]
-
SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos - Changan Chen, Kumar Ashutosh, Rohit Girdhar, David Harwath, Kristen Grauman, CVPR 2024.
-
EventEgo3D: 3D Human Motion Capture from Egocentric Event Streams - Christen Millerdurai, Hiroyasu Akada, Jian Wang, Diogo Luvizon, Christian Theobalt, Vladislav Golyanik, CVPR 2024. [project page]
-
EventEgo3D: 3D Human Motion Capture from Egocentric Event Streams - Christen Millerdurai, Hiroyasu Akada, Jian Wang, Diogo Luvizon, Christian Theobalt, Vladislav Golyanik, CVPR 2024. [project page]
-
Masked Video and Body-worn IMU Autoencoder for Egocentric Action Recognition - Mingfang Zhang, Yifei Huang*, Ruicong Liu, Yoichi Sato, ECCV 2024.
-
ActionSwitch: Class-agnostic Detection of Simultaneous Actions in Streaming Videos - Hyolim Kang, Jeongseok Hyun, Joungbin An, Youngjae Yu, Seon Joo Kim, ECCV 2024.
-
HAT: History-Augmented Anchor Transformer for Online Temporal Action Localization - Sakib Reza, Yuexi Zhang, Mohsen Moghaddam, Octavia Camps, ECCV 2024. [code]
-
UniMD: Towards Unifying Moment Retrieval and Temporal Action Detection - Yingsen Zeng, Yujie Zhong, Chengjian Feng, Lin Ma, ECCV 2024. [code]
-
DyFADet: Dynamic Feature Aggregation for Temporal Action Detection - Le Yang, Ziwei Zheng, Yizeng Han, Hao Cheng, Shiji Song, Gao Huang, Fan Li, ECCV 2024. [code]
-
Synchronization is All You Need: Exocentric-to-Egocentric Transfer for Temporal Action Segmentation with Unlabeled Synchronized Video Pairs - Camillo Quattrocchi, Antonino Furnari, Daniele Di Mauro, Mario Valerio Giuffrida, Giovanni Maria Farinella, ECCV 2024. [code]
-
FACT: Frame-Action Cross-Attention Temporal - Zijia Lu, Ehsan Elhamifar, CVPR 2024. [code]
-
https://openaccess.thecvf.com//content/CVPR2024/html/Shen_Progress-Aware_Online_Action_Segmentation_for_Egocentric_Procedural_Task_Videos_CVPR_2024_paper - Yuhan Shen, Ehsan Elhamifar, CVPR 2024. [code]
-
EgoCVR: An Egocentric Benchmark for Fine-Grained Composed Video Retrieval - Thomas Hummel, Shyamgopal Karthik, Mariana-Iuliana Georgescu, Zeynep Akata, ECCV 2024. [code]
-
Retrieval-Augmented Egocentric Video Captioning - Jilan Xu, Yifei Huang, Junlin Hou, Guo Chen, Yuejie Zhang, Rui Feng, Weidi Xie, CVPR 2024.
-
ActionVOS: Actions as Prompts for Video Object Segmentation - Liangyang Ouyang, Ruicong Liu, Yifei Huang, Ryosuke Furuta, Yoichi Sato, ECCV 2024. [code]
-
EgoLifter: Open-world 3D Segmentation for Egocentric Perception - Qiao Gu, Zhaoyang Lv, Duncan Frost, Simon Green, Julian Straub, Chris Sweeney, ECCV 2024. [project page]
-
Learning to Segment Referred Objects from Narrated Egocentric Videos - Yuhan Shen, Huiyu Wang, Xitong Yang, Matt Feiszli, Ehsan Elhamifar, Lorenzo Torresani, Effrosyni Mavroudi;, CVPR 2024.
-
A SOUND APPROACH: Using Large Language Models to generate audio descriptions for egocentric text-audio retrieval - Andreea-Maria Oncescu, João F. Henriques, Andrew Zisserman, Samuel Albanie, A. Sophia Koepke, ICASSP 2024.
-
LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning - Bolin Lai, Xiaoliang Dai, Lawrence Chen, Guan Pang, James M. Rehg, Miao Liu, ECCV 2024.
-
Vamos: Versatile Action Models for Video Understanding - Shijie Wang, Qi Zhao, Minh Quan Do, Nakul Agarwal, Kwonjoon Lee, Chen Sun, ECCV 2024. [project page]
-
PALM: Predicting Actions through Language Models - Sanghwan Kim, Daoji Huang, Yongqin Xian, Otmar Hilliges, Luc Van Gool, Xi Wang, ECCV 2024.
-
Text-Conditioned Resampler For Long Form Video Understanding - Bruno Korbar, Yongqin Xian, Alessio Tonioni, Andrew Zisserman, Federico Tombari, ECCV 2024.
-
Grounded Question-Answering in Long Egocentric Videos - Shangzhe Di, Weidi Xie;, CVPR 2024. [code]
-
Video ReCap: Recursive Captioning of Hour-Long Videos - Md Mohaiminul Islam, Ngan Ho, Xitong Yang, Tushar Nagarajan, Lorenzo Torresani, Gedas Bertasius, CVPR 2024. [project page]
-
Multimodal Cross-Domain Few-Shot Learning for Egocentric Action Recognition - Masashi Hatano, Ryo Hachiuma, Ryo Fujii, Hideo Saito, ECCV 2024. [project page]
-
Listen to Look into the Future: Audio-Visual Egocentric Gaze Anticipation - Bolin Lai, Fiona Ryan, Wenqi Jia, Miao Liu, James M Rehg, ECCV 2024. [project page]
-
Put Myself in Your Shoes: Lifting the Egocentric Perspective from Exocentric Videos - Mi Luo, Zihui Xue, Alex Dimakis, Kristen Grauman, ECCV 2024.
-
The Audio-Visual Conversational Graph: From an Egocentric-Exocentric Perspective - Wenqi Jia, Miao Liu, Hao Jiang, Ishwarya Ananthabhotla, James M. Rehg, Vamsi Krishna Ithapu, Ruohan Gao, CVPR 2024. [project page]
-
EgoPoser: Robust Real-Time Egocentric Pose Estimation from Sparse and Intermittent Observations Everywhere - Jiaxi Jiang*, Paul Streli, Manuel Meier, Christian Holz, ECCV 2024.
-
EgoPoseFormer: A Simple Baseline for Stereo Egocentric 3D Human Pose Estimation - Chenhongyi Yang, Anastasia Tkach, Shreyas Hampali, Linguang Zhang, Elliot J Crowley, Cem Keskin, ECCV 2024. [code]
-
Egocentric Whole-Body Motion Capture with FisheyeViT and Diffusion-Based Motion Refinement - Jian Wang, Zhe Cao, Diogo Luvizon, Lingjie Liu, Kripasindhu Sarkar, Danhang Tang, Thabo Beeler, Christian Theobalt, CVPR 2024. [project page]
-
Attention-Propagation Network for Egocentric Heatmap to 3D Pose Lifting - Taeho Kang, Youngki Lee, CVPR 2024. [code]
-
EventEgo3D: 3D Human Motion Capture from Egocentric Event Streams - Christen Millerdurai, Hiroyasu Akada, Jian Wang, Diogo Luvizon, Christian Theobalt, Vladislav Golyanik, CVPR 2024. [project page]
-
Attention-Propagation Network for Egocentric Heatmap to 3D Pose Lifting - Taeho Kang, Youngki Lee, CVPR 2024. [code]
-
Spherical World-Locking for Audio-Visual Localization in Egocentric Videos - Heeseung Yun, Ruohan Gao, Ishwarya Ananthabhotla, Anurag Kumar, Jacob Donley, Chao Li, Gunhee Kim, Vamsi Krishna Ithapu, Calvin Murdock, ECCV 2024. [project page]
-
Instance Tracking in 3D Scenes from Egocentric Videos - Yunhan Zhao, Haoyu Ma, Shu Kong, Charless Fowlkes, CVPR 2024. [code]
-
Ex2Eg-MAE: A Framework for Adaptation of Exocentric Video Masked Autoencoders for Egocentric Social Role Understanding - Minh Tran, Yelin Kim, Che-Chun Su, Min Sun, Cheng-Hao Kuo, Mohammad Soleymani, ECCV 2024.
-
LoCoNet: Long-Short Context Network for Active Speaker Detection - Xizi Wang, Feng Cheng, Gedas Bertasius, CVPR 2024. [code]
-
Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos - Sagnik Majumder, Ziad Al-Halah, Kristen Grauman, CVPR 2024. [project page]
-
EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models - Sijie Cheng, Zhicheng Guo, Jingwen Wu, Kechen Fang, Peng Li, Huaping Liu, Yang Liu, CVPR 2024. [code] [project page]
-
4Diff: 3D-Aware Diffusion Model for Third-to-First Viewpoint Translation - Feng Cheng, Mi Luo, Huiyu Wang, Alex Dimakis, Lorenzo Torresani, Gedas Bertasius, Kristen Grauman, ECCV 2024. [project page]
-
Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos - Changan Chen, Puyuan Peng, Ami Baid, Zihui Xue, Wei-Ning Hsu, David Harwath, Kristen Grauman, ECCV 2024.
-
SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model - Armen Avetisyan, Christopher Xie, Henry Howard-Jenkins, Tsun-Yi Yang, Samir Aroudj, Suvam Patra, Fuyang Zhang, Luke Holland, Duncan Frost, Campbell Orme, Jakob Engel, Edward Miller, Richard Newcombe, Vasileios Balntas, ECCV 2024. [project page]
-
Mocap Everyone Everywhere: Lightweight Motion Capture With Smartwatches and a Head-Mounted Camera - Jiye Lee, Hanbyul Joo, CVPR 2024. [code]
-
VideoLLM-online: Online Video Large Language Model for Streaming Video - Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, Mike Zheng Shou, CVPR 2024. [video]
-
Error Detection in Egocentric Procedural Task Videos - Shih-Po Lee, Zijia Lu, Zekun Zhang, Minh Hoai, Ehsan Elhamifar, CVPR 2024. [code]
-
EgoPAT3Dv2 - Irving Fang, Yuzhong Chen, Yifan Wang, Jianghan Zhang, Qiushi Zhang, Jiali Xu, Xibo He, Weibo Gao, Hao Su, Yiming Li, Chen Feng, ICRA 2024. [project page]
-
EgoGen: An Egocentric Synthetic Data Generator - Gen Li, Kaifeng Zhao, Siwei Zhang, Xiaozhong Lyu, Mihai Dusmanu, Yan Zhang, Marc Pollefeys, Siyu Tang, CVPR 2024. [project page]
-
Active Object Detection with Knowledge Aggregation and Distillation from Large Models - Dejie Yang, Yang Liu, CVPR 2024. [code]
-
EgoPAT3Dv2 - The EgoPAT3Dv2 dataset includes 12 distinct scenes and more than 5,400 clips. It features data in various modalities, including RGB, depth, IMU, and point clouds. The dataset captures rearrangement tasks performed by different individuals in diverse scenes, recorded using Microsoft Azure Kinect devices mounted overhead. ICRA 2024. [paper]
-
[EgoExo-Fitness] - EgoExo-Fitness is a full-body action understanding dataset with synchronized egocentric and third-person fitness videos, enriched with detailed annotations. It offers two-level action boundaries, technical keypoint checks, and quality scores to evaluate “what,” “when,” and “how well” actions are performed. ECCV 2024. [paper] [code]
-
Nymeria - It captures synchronized multimodal data, including egocentric and third-person perspectives, for 300 hours of natural daily activities across 50 locations. It features detailed motion capture, hierarchical language descriptions (310.5K sentences, 8.64M words), and scenarios like cooking and hiking from 264 participants using advanced wearable devices. ECCV 2024. [paper]
-
EgoPet - It offers over 84 hours of egocentric video footage from animals like dogs, cats, eagles, and turtles, showcasing their daily lives. It includes 6,646 video segments sourced from TikTok and YouTube, with cats and dogs representing the majority of data. This rich and diverse dataset highlights unique perspectives and behaviors across various species, enabling detailed analysis of animal interactions. ECCV 2024. [paper]
-
[EgoBody3M] - First large-scale real-image dataset for egocentric body tracking, with a realistic VR headset configuration and diverse subjects and motions. The dataset contains 2688 sequences from 120 subjects. ECCV 2024. [paper]
-
Ego-Exo4D - Ego-Exo4D, a vast multimodal multiview video dataset capturing skilled human activities in both egocentric and exocentric perspectives (e.g., sports, music, dance). With 800+ participants in 13 cities, it offers 1,422 hours of combined footage, featuring diverse activities in 131 natural scene contexts, ranging from 1 to 42 minutes per video. CVPR 2024. [paper]
-
UnrealEgo2-UnrealEgo-RW - UnrealEgo2 Dataset: An expanded dataset capturing over 15,200 motions of realistic 3D human models with a glasses-based device, offering 1.25 million stereo views and comprehensive joint annotations. UnrealEgo-RW Dataset: A real-world dataset utilizing a compact mobile device with fisheye cameras, designed for versatile egocentric image capture in various environments. CVPR 2024. [paper] [code]
-
[TF2023] - A novel dataset featuring synchronized first-person and third-person views, including masks of camera wearers linked to their respective views. It consists of 208,794 training and 87,449 testing image pairs, with no actor overlap between sets. Each scene averages 4.29 actors, focusing on complex interactions like puzzle games, enhancing its value for cross-view matching in egocentric vision. CVPR 2024. [paper] [code]
-
TACO - A large-scale dataset of real-world bimanual tool-object interactions, featuring 131 tool-action-object triplets across 2.5K motion sequences and 5.2M frames with egocentric and 3rd-person views. TACO enables benchmarks in action recognition, hand-object motion forecasting, and grasp synthesis, advancing generalization research in human-object interactions. CVPR 2024. [paper]
-
HOT3D - HOT3D is benchmark dataset for egocentric vision-based understanding of 3D hand-object interactions. The dataset offers over 833 minutes (more than 3.7M images) of multi-view RGB/monochrome image streams showing 19 subjects interacting with 33 diverse rigid objects, multi-modal signals such as eye gaze or scene point clouds, as well as comprehensive ground truth annotations including 3D poses of objects, hands, and cameras, and 3D models of hands and objects. 2024. [paper] [code]
-
[ADL4D] - ADL4D dataset offers a novel perspective on human-object interactions, providing video sequences of everyday activities involving multiple people and objects interacting simultaneously. 2024. [paper]
This work is still in progress...