Third Joint Egocentric Vision (EgoVis) Workshop

Held in Conjunction with CVPR 2026

3/4 June 2026 - Denver, CO, USA
Room: TBD

This joint workshop aims to be the focal point for the egocentric computer vision community to meet and discuss progress in this fast growing research area, addressing egocentric vision in a comprehensive manner including key research challenges in video understanding, multi-modal data, interaction learning, self-supervised learning, AR/VR with applications to cognitive science and robotics.

Overview

Wearable cameras, smart glasses, and AR/VR headsets are gaining importance for research and commercial use. They feature various sensors like cameras, depth sensors, microphones, IMUs, and GPS. Advances in machine perception enable precise user localization (SLAM), eye tracking, and hand tracking. This data allows understanding user behavior, unlocking new interaction possibilities with augmented reality. Egocentric devices may soon automatically recognize user actions, surroundings, gestures, and social relationships. These devices have broad applications in assistive technology, education, fitness, entertainment, gaming, eldercare, robotics, and augmented reality, positively impacting society.

Previously, research in this field faced challenges due to limited datasets in a data-intensive environment. However, the community's recent efforts have addressed this issue by releasing numerous large-scale datasets covering various aspects of egocentric perception, including HoloAssist, Ego4D, Ego-Exo4D, EPIC-KITCHENS, HD-EPIC, EgoCross, and CASTLE.

The goal of this workshop is to provide an exciting discussion forum for researchers working in this challenging and fast-growing area, and to provide a means to unlock the potential of data-driven research with our datasets to further the state-of-the-art.

Challenges

We welcome submissions to the challenges from February to May (see important dates) through the leaderboards linked below. Participants to the challenges are requested to submit a technical report on their method. This is a requirement for the competition. Reports should be 2-6 pages including references. Submissions should use the CVPR format and should be submitted through the CMT website.

HoloAssist Challenges

HoloAssist is a large-scale egocentric human interaction dataset, where two people collaboratively complete physical manipulation tasks.

Mistake Detection
Lead:Taein Kwon, Meta, Switzerland & Mahdi Rad, Microsoft, Switzerland
Summary:Mistake detection is defined following the convention Assembly101 but applied to fine-grained actions in our benchmark. We take the features from the fine-grained action clips from the beginning of the coarse-grained action until the end of the current action clip, and the model predicts a label from {correct, mistake}.
Challenge Link

Ego4D Challenges

Ego4D is a massive-scale, egocentric dataset and benchmark suite collected across 74 worldwide locations and 9 countries, with over 3,670 hours of daily-life activity video. Please find details below on our challenges:

Ego4D Episodic Memory
Track: Natural Language Queries
Lead: Suyog Jain, Meta, US
Summary:Given an egocentric video V and a natural language query Q, the goal is to identify a response track r, such that the answer to Q can be deduced from r.
Challenge Link (coming soon)
Ego4D Forecasting
Track: Short-term object interaction anticipation
Lead:Antonino Furnari, University of Catania, IT
Summary:This task aims to predict the next human-object interaction happening after a given timestamp. Given an input video, the goal is to anticipate 1)the spatial positions of the active objects, 2) the category of each detected next active objects, 3) how each active object will be used (verb), 4) and when the interaction will begin.
Current SOTA:Paper 1; Paper 2
Previous Winner: Top-5 Overall mAP: 7.21
Challenge Link
Ego4D Episodic Memory
Track: Goal Step
Lead: Yale Song, Meta, US
Summary: Given an untrimmed egocentric video, identify the temporal action segment corresponding to a natural language description of the step. Specifically, predict the (start_time, end_time) for a given keystep description.
Current SOTA: Paper
Previous Winner: 35.18 r@1, IoU=0.3
Challenge Link (coming soon)

Ego-Exo4D Challenges

Ego-Exo4D is a diverse, large-scale multi-modal multi view video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured ego- centric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair).

EgoExo4D Pose Challenge
Track: Ego-Pose Body
Lead: Juanita Puentes Mozo, Los Andes
Summary: The EgoExo4D Body Pose Challenge aims to accurately estimate body pose using only first-person raw video and/or egocentric camera pose.
Current SOTA: EgoCast (MPJPE: 14.36)
Previous Winner: MPJPE: 15.32
Challenge Link (coming soon)
EgoExo4D Pose Challenge
Track: Ego-Pose Hands
Lead:Shan Shu, University of Pennsylvania, US
Summary:
Current SOTA:
Previous Winner:
Challenge Link (coming soon)

EPIC-Kitchens Challenges

Please check the EPIC-KITCHENS website for more information on the EPIC-KITCHENS challenges. Links to individual challenges are also reported below.

Action Recognition
Lead: Prajwal Gatti, University of Bristol, UK
Summary: Classify the action's verb and noun depicted in a trimmed video clip.
Current SOTA: Paper
Previous Winner: 48.1% - top 1 / 77.4% - top 5
Challenge Link
Action Detection
Lead: Francesco Ragusa, University of Catania, IT
Summary: The challenge requires detecting and recognising all action instances within an untrimmed video. The challenge will be carried out on the EPIC-KITCHENS-100 dataset.
Current SOTA: Results
Previous Winner: Action Avg. mAP 31.97
Challenge Link
Domain Adaptation Challenge for Action Recognition
Lead: Saptarshi Sinha, University of Bristol, UK
Summary: Given labelled videos from the source domain and unlabelled videos from the target domain, the goal is to classify actions in the target domain. An action is defined as a verb and noun depicted in a trimmed video clip.
Current SOTA: Paper
Previous Winner: 43.17 for action accuracy
Challenge Link
Multi-Instance Retrieval
Lead: Michael Wray, University of Bristol, UK
Summary: Perform cross-modal retrieval by searching between vision and text modalities.
Current SOTA: Paper
Previous Winner: Normalised Discounted Cumulative Gain (%) Avg. - 74.25
Challenge Link
Semi-Supervised Video-Object Segmentation
Lead: Ahmad Darkhalil, University of Bristol, UK
Summary:Given a sub-sequence of frames with M object masks in the first frame, the goal of this challenge is to segment these through the remaining frames. Other objects not present in the first frame of the sub-sequence are excluded from this benchmark.
Current SOTA:Webpage
Challenge Link
EPIC-SOUNDS Audio-Based Interaction Recognition
Lead: Jacob Chalk, University of Bristol, UK
Summary: Recognising interactions from audio data from EPIC-Sounds (classify the audio).
Current SOTA: User: JMCarrot
Previous Winner: N/A
Challenge Link
EPIC-SOUNDS Audio-Based Interaction Detection
Lead: Jacob Chalk, University of Bristol, UK
Summary: Classify all audio-based interactions (recognition) from audio data of EPIC-Sounds and predict their start and end times for a given video.
Current SOTA: User: shuming
Previous Winner: N/A
Challenge Link
Action Anticipation
Lead: Antonino Furnari, University of Catania, IT
Summary: The challenge requires the anticipation of a future action from the observation of a preceding video segment. The challenge will be carried out on the EPIC-KITCHENS-100 dataset.
Current SOTA: N/A Previous Winner: N/A
Challenge Link

HD-EPIC Challenge

Please check the HD-EPIC website for more information on the HD-EPIC challenges. Links to individual challenges are also reported below.

HD-EPIC Challenges - VQA
Lead: Prajwal Gatti, University of Bristol, UK
Summary: Given a question belonging to any one of the seven types defined in the HD-EPIC VQA benchmark, the goal is to predict the correct answer among the five listed choices.
Current SOTA: Gemini Pro
Previous Winner: N/A
Challenge Link

EgoCross Challenge

Please check the EgoCross website for more information on the EgoCross challenge.

Task Description: Given an egocentric video from a novel domain that differs significantly from commonly seen scenarios (e.g., industrial or surgical environments rather than daily-life settings), the goal is to select the correct answer from four options (A, B, C, D) for a given query question.

Source-Limited Track
Lead: Yuqian Fu, INSAIT, BG
Summary: Participants are restricted to the provided baseline model and the given small support set, which may be used to fine-tune or guide the model for better transfer to the target domain. This track is designed to ensure a fair comparison of different adaptation algorithms.
Current SOTA: SFT-Qwen3VL (Average accuracy across four novel domains: 0.4608)
Previous Winner: N/A
Challenge Link
Open-Source Track
Lead: Yuqian Fu, INSAIT, BG
Summary: There are no restrictions on base models; even commercial models are encouraged to evaluate their performance on our challenging out-of-domain targets. Additional data (as long as it is not manually constructed to align specifically with the target domain) may be used for training, together with our provided support set.
Current SOTA: SFT-Qwen3VL (Average accuracy across four novel domains: 0.4608)
Previous Winner: N/A
Challenge Link

CASTLE Challenge

Please check the CASTLE website for more information on the CASTLE challenge.

CASTLE Challenge - VQA
Lead: Luca Rossetto, Dublin City University, IE
Summary: Given the entire dataset of over 600 hours of content from 15 different perspectives, you are asked to select the correct answer to a given question out of four possible options.
Current SOTA: N/A
Previous Winner: N/A
Challenge Link

Call for Papers

You are invited to submit papers to the third edition of joint egocentric vision workshop which will be held alongside CVPR 2026 in Denver.

These papers represent original work and will be published as part of proceedings alongside CVPR. We welcome all works that focus within the Egocentric Domain, it is not necessary to use the datasets in this workshop within your work. We expect a submission may contain one or more of the following topics (this is a non-exhaustive list):

Presentation Guidelines

All accepted papers will be presented as posters. The guidelines for the posters are the same as at the main conference.

Submission Instruction

Call for Abstracts

You are invited to submit extended abstracts to the third edition of joint egocentric vision workshop which will be held alongside CVPR 2026 in Denver.

These abstracts represent existing or ongoing work and will not be published as part of any proceedings. We welcome all works that focus within the Egocentric Domain, it is not necessary to use the Ego4D dataset within your work. We expect a submission may contain one or more of the following topics (this is a non-exhaustive list):

Format

The length of the extended abstracts is 2-4 pages, including figures, tables, and references. We invite submissions of ongoing or already published work, as well as reports on demonstrations and prototypes. The joint egocentric vision workshop gives opportunities for authors to present their work to the egocentric community to provoke discussion and feedback. Accepted work will be presented as either an oral presentation (either virtual or in-person) or as a poster presentation. The review will be single-blind, so there is no need to anonymize your work, but otherwise will follow the format of the CVPR submissions, information can be found here. Accepted abstracts will not be published as part of a proceedings, so can be uploaded to ArXiv etc. and the links will be provided on the workshop’s webpage. The submission will be managed with the CMT website.

Important Dates

NOTE: All dates are in Pacific Time (PT).

Paper Deadline (on CMT) 27 Feb 2026
Paper Notifications to Authors 3 April 2026
Camera Ready Deadline (on CMT) 7 April 2026
Challenges Leaderboards Open Feb 2026
Challenges Leaderboards Close 13 May 2026
Challenges Technical Reports Deadline (on CMT) 20 May 2026
Notification to Challenge Winners 27 May 2026
Challenge Reports ArXiv Deadline 1 June 2026
Extended Abstract Deadline (on CMT) 27 April 2026
Extended Abstract Notification to Authors 18 May 2026
Extended Abstracts ArXiv Deadline 25 May 2026
Workshop Date TBD

Program

All dates are local to Denver's time, MST.
Workshop Location: Room TBD

Time Event
08:45-09:00 Welcome and Introductions
09:00-09:30 Invited Keynote 1: Marc Pollefeys, ETH Zurich, Switzerland
09:30-10:00 Oral Presentations (Group 1)
10:00-10:45 Coffee Break and First Poster Session
10:45-11:15 Invited Keynote 2: Saurabh Gupta, University of Illinois, USA
11:15-12:15 Challenges and Winning Solutions
12:15-12:45 Invited Keynote 3: Jawahar C V, IIIT Hyderabad, India
12:45-13:30 Lunch Break
13:30-14:00 EgoVis Distinguished Papers Award
14:00-14:30 Invited Keynote 4: Lorenzo Torresani, Northeastern University, USA
14:30-15:00 Oral Presentations (Group 2)
15:00-15:30 Invited Keynote 5: Hazel Doughty, Leiden University, Netherlands
15:30-16:15 Coffee Break and Second Poster Session
16:15-16:45 Invited Keynote 6: Ziwei Liu, Nanyang Technological University, Singapore
16:45-17:15 Panel Discussion
17:15-17:30 Conclusion

Invited Speakers


CV Jawahar

IIIT Hyderabad, India


Lorenzo Torresani

Northeastern University, USA


Marc Pollefeys

ETH Zurich, Switzerland


Hazel Doughty

Leiden University, Netherlands


Saurabh Gupta

University of Illinois, USA


Ziwei Liu

Nanyang Technological University, Singapore

Workshop Organisers


Siddhant Bansal

University of Bristol


Masashi Hatano

Keio University


Chiara Plizzari

Bocconi University


Antonino Furnari

University of Catania


Tushar Nagarajan

FAIR, Meta

Co-organizing Advisors


Dima Damen

University of Bristol and Google DeepMind


Giovanni Maria Farinella

University of Catania


Kristen Grauman

UT Austin


Jitendra Malik

UC Berkeley


Richard Newcombe

Reality Labs Research


Marc Pollefeys

ETH Zurich


Yoichi Sato

University of Tokyo


David Crandall

Indiana University

Related Past Events

This workshop follows the footsteps of the following previous events:


EPIC-Kitchens and Ego4D Past Workshops:


Human Body, Hands, and Activities from Egocentric and Multi-view Cameras Past Workshops:

Project Aria Past Tutorials:

Acknowledgements

The Microsoft CMT service was used for managing the peer-reviewing process for this conference. This service was provided for free by Microsoft and they bore all expenses, including costs for Azure cloud services as well as for software development and support.