Wearable cameras, smart glasses, and AR/VR headsets are gaining importance for research and commercial use. They feature various sensors like cameras, depth sensors, microphones, IMUs, and GPS. Advances in machine perception enable precise user localization (SLAM), eye tracking, and hand tracking. This data allows understanding user behavior, unlocking new interaction possibilities with augmented reality. Egocentric devices may soon automatically recognize user actions, surroundings, gestures, and social relationships. These devices have broad applications in assistive technology, education, fitness, entertainment, gaming, eldercare, robotics, and augmented reality, positively impacting society.
Previously, research in this field faced challenges due to limited datasets in a data-intensive environment. However, the community's recent efforts have addressed this issue by releasing numerous large-scale datasets covering various aspects of egocentric perception, including HoloAssist, Ego4D, Ego-Exo4D, EPIC-KITCHENS, and HD-EPIC.
The goal of this workshop is to provide an exciting discussion forum for researchers working in this challenging and fast-growing area, and to provide a means to unlock the potential of data-driven research with our datasets to further the state-of-the-art.
We welcome submissions to the challenges from February to May (see important dates) through the leaderboards linked below. Participants to the challenges are requested to submit a technical report on their method. This is a requirement for the competition. Reports should be 2-6 pages including references. Submissions should use the CVPR format and should be submitted through the CMT website.
HoloAssist is a large-scale egocentric human interaction dataset, where two people collaboratively complete physical manipulation tasks.
Ego4D is a massive-scale, egocentric dataset and benchmark suite collected across 74 worldwide locations and 9 countries, with over 3,670 hours of daily-life activity video. Please find details below on our challenges:
Ego-Exo4D is a diverse, large-scale multi-modal multi view video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured ego- centric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair).
Please check the EPIC-KITCHENS website for more information on the EPIC-KITCHENS challenges. Links to individual challenges are also reported below.
Please check the HD-EPIC website for more information on the HD-EPIC challenges. Links to individual challenges are also reported below.
You are invited to submit extended abstracts to the second edition of joint egocentric vision workshop which will be held alongside CVPR 2025 in Nashville.
These abstracts represent existing or ongoing work and will not be published as part of any proceedings. We welcome all works that focus within the Egocentric Domain, it is not necessary to use the Ego4D dataset within your work. We expect a submission may contain one or more of the following topics (this is a non-exhaustive list):
The length of the extended abstracts is 2-4 pages, including figures, tables, and references. We invite submissions of ongoing or already published work, as well as reports on demonstrations and prototypes. The joint egocentric vision workshop gives opportunities for authors to present their work to the egocentric community to provoke discussion and feedback. Accepted work will be presented as either an oral presentation (either virtual or in-person) or as a poster presentation. The review will be single-blind, so there is no need to anonymize your work, but otherwise will follow the format of the CVPR submissions, information can be found here. Accepted abstracts will not be published as part of a proceedings, so can be uploaded to ArXiv etc. and the links will be provided on the workshop’s webpage. The submission will be managed with the CMT website.
Challenges Leaderboards Open | Feb 2025 |
Challenges Leaderboards Close | 19 May 2025 (some challenges have extended their deadline, please check respective challenge's webpage) |
Challenges Technical Reports Deadline (on CMT) | 23 May 2025 |
Notification to Challenge Winners | 30 May 2025 |
Challenge Reports ArXiv Deadline | 6 June 2025 |
Extended Abstract Deadline (on CMT) | |
Extended Abstract Notification to Authors | 23 May 2025 |
Extended Abstracts ArXiv Deadline | 2 June 2025 |
Workshop Date | 12 June 2025 |
All dates are local to Nashville's time, CST.
Time | Event |
---|---|
08:45-09:00 | Welcome and Introductions |
09:00-09:30 | Invited Keynote 1: Siyu Tang, ETH Zürich, CH Talk Title: Towards an egocentric multimodal foundation model |
09:30-10:00 | HoloAssist Challenges |
10:00-11:00 | Coffee Break and Poster Session |
11:00-11:30 | Invited Keynote 2: Kris Kitani, CMU, USA |
11:30-12:00 | EPIC-KITCHENS & HD-EPIC Challenges |
12:00-12:30 | Oral Presentations (Group 1) |
12:30-13:30 | Lunch Break |
13:30-14:00 | EgoVis Distinguished Papers Award |
14:00-14:30 | Invited Keynote 3: Xiaolong Wang, UCSD, USA |
14:30-15:30 | Ego4D & Ego-Exo4D Challenges |
15:30-16:00 | Coffee Break |
16:00-16:30 | Invited Keynote 4: Arsha Nagrani, Google DeepMind |
16:30-17:05 | Aria Gen2 |
17:05-17:35 | Oral Presentations (Group 2) |
17:35-17:45 | Conclusion |
EgoVis Poster Number | Title | Authors | Paper Link | CVPR 2025 Presentation Details |
---|---|---|---|---|
TBD | HOT3D: Hand and Object Tracking in 3D from Egocentric Multi-View Videos | Prithviraj Banerjee, Sindi Shkodrani, Pierre Moulon, Shreyas Hampali, Shangchen Han, Fan Zhang, Linguang Zhang, Jade Fountain, Edward Miller, Selen Basol, Richard Newcombe, Robert Wang, Jakob Engel, Tomas Hodan | link | TBD |
TBD | Generating 6DoF Object Manipulation Trajectories from Action Description in Egocentric Vision | Tomoya Yoshida, Shuhei Kurita, Taichi Nishimura, Shinsuke Mori | link | TBD |
TBD | FRAME: Floor-aligned Representation for Avatar Motion from Egocentric Video | Andrea Boscolo Camiletto, Jian Wang, Eduardo Alvarado, Rishabh Dabral, Thabo Beeler, Marc Habermann, Christian Theobalt | link | TBD |
TBD | Omnia de EgoTempo: Benchmarking Temporal Understanding of Multi-Modal LLMs in Egocentric Videos | Chiara Plizzari, Alessio Tonioni, Yongqin Xian, Achin Kulshrestha, Federico Tombari | link | TBD |
TBD | HaWoR: World-Space Hand Motion Reconstruction from Egocentric Videos | Jinglei Zhang, Jiankang Deng, Chao Ma, Rolandos Alexandros Potamias | link | TBD |
TBD | Layered motion fusion: Lifting motion segmentation to 3D in egocentric videos | Vadim Tschernezki, Diane Larlus, Andrea Vedaldi, Iro Laina | link | TBD |
TBD | REWIND: Real-Time Egocentric Whole-Body Motion Diffusion with Exemplar-Based Identity Conditioning | Jihyun Lee, Weipeng Xu, Alexander Richard, Shih-En Wei, Shunsuke Saito, Shaojie Bai, Te-Li Wang, Minhyuk Sung, Tae-Kyun Kim, Jason Saragih | link | TBD |
TBD | EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering | Sheng Zhou, Junbin Xiao, Qingyun Li, Yicong Li, Xun Yang, Dan Guo, Meng Wang, Tat-seng Chua, Angela Yao | link | TBD |
TBD | EgoLife: Towards Egocentric Life Assistant | Jingkang Yang, Shuai Liu, Hongming Guo, Yuhao Dong, Xiamengwei Zhang, Sicheng Zhang, Pengyun Wang, Zitang Zhou, Binzhu Xie, Ziyue Wang, Bei Ouyang, Zhengyu Lin, Marco Cominelli, Zhongang Cai, Bo Li, Yuanhan Zhang, Peiyuan Zhang, Fangzhou Hong, Joerg Widmer, Francesco Gringoli, Lei Yang, Ziwei Liu | link | TBD |
TBD | DIV-FF: Dynamic Image-Video Feature Fields For Environment Understanding in Egocentric Video | Lorenzo Mur-Labadia, Jose J. Guerrero, Ruben Martinez-Cantin | link | TBD |
TBD | Gazing Into Missteps: Leveraging Eye-Gaze for Unsupervised Mistake Detection in Egocentric Videos of Skilled Human Activities | Michele Mazzamuto, Antonino Furnari, Yoichi Sato, Giovanni Maria Farinella | link | TBD |
TBD | EgoLM: Multi-Modal Language Model of Egocentric Motions | Fangzhou Hong, Vladimir Guzov, Hyo Jin Kim, Yuting Ye, Richard Newcombe, Ziwei Liu, Lingni Ma | link | TBD |
TBD | EgoPressure: A Dataset for Hand Pressure and Pose Estimation in Egocentric Vision | Yiming Zhao, Taein Kwon, Paul Streli, Marc Pollefeys, Christian Holz | link | TBD |
TBD | HD-EPIC: A Highly-Detailed Egocentric Video Dataset | Toby Perrett, Ahmad Darkhalil, Saptarshi Sinha, Omar Emara, Sam Pollard, Kranti Kumar Parida, Kaiting Liu, Prajwal Gatti, Siddhant Bansal, Kevin Flanagan, Jacob Chalk, Zhifan Zhu, Rhodri Guerrier, Fahd Abdelazim, Bin Zhu, Davide Moltisanti, Michael Wray, Hazel Doughty, Dima Damen | link | TBD |
TBD | Estimating Body and Hand Motion in an Ego‑sensed World | Brent Yi, Vickie Ye, Maya Zheng, Yunqi Li, Lea Müller, Georgios Pavlakos, Yi Ma, Jitendra Malik, Angjoo Kanazawa | link | TBD |
TBD | Bootstrap Your Own Views: Masked Ego-Exo Modeling for Fine-grained View-invariant Video Representations | Jungin Park, Jiyoung Lee, Kwanghoon Sohn | link | TBD |
TBD | GEM: A Generalizable Ego-vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control | Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Pedro M B Rezende, Yasaman Haghighi, David Brüggemann, Isinsu Katircioglu, Lin Zhang, Xiaoran Chen, Suman Saha, Marco Cannici, Elie Aljalbout, Botao Ye, Xi Wang, Aram Davtyan, Mathieu Salzmann, Davide Scaramuzza, Marc Pollefeys, Paolo Favaro, Alex Alahi | link | TBD |
TBD | Viewpoint Rosetta Stone: Unlocking Unpaired Ego-Exo Videos for View-invariant Representation Learning | Mi Luo, Zihui Xue, Alex Dimakis, Kristen Grauman | link | TBD |
TBD | Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Instructional Videos | Sagnik Majumder, Tushar Nagarajan, Ziad Al-Halah, Reina Pradhan, Kristen Grauman | link | TBD |
TBD | ExpertAF: Expert Actionable Feedback from Video | Kumar Ashutosh, Tushar Nagarajan, Georgios Pavlakos, Kris Kitani, Kristen Grauman | link | TBD |
TBD | FIction: 4D Future Interaction Prediction from Video | Kumar Ashutosh, Georgios Pavlakos, Kristen Grauman | link | TBD |
TBD | Progress-Aware Video Frame Captioning | Zihui Xue, Joungbin An, Xitong Yang, Kristen Grauman | link | TBD |
TBD | BIMBA: Selective-Scan Compression for Long-Range Video Question Answering | Md Mohaiminul Islam, Tushar Nagarajan, Huiyu Wang, Gedas Bertasius, Lorenzo Torresani | link | TBD |
TBD | VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos | Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin, Jaehong Yoon, Feng Cheng, Gedas Bertasius, Mohit Bansal | link | TBD |
TBD | VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation | Hanzhi Chen, Boyang Sun, Anran Zhang, Marc Pollefeys, Stefan Leutenegger | link | TBD |
This workshop follows the footsteps of the following previous events:
EPIC-Kitchens and Ego4D Past Workshops:
Human Body, Hands, and Activities from Egocentric and Multi-view Cameras Past Workshops:
Project Aria Past Tutorials: