EgoProx: Evaluating MLLMs on Egocentric 3D Proximity Reasoning Across a Cognitive Hierarchy

CVPR 2026 🔥
Jinzhao Li1,2, Yinuo Chen1,*, Dongxu Piao1,*, Panwang Pan2,† Yifan Yu2, Dong Wang2, Honglei Yan2, Liang Yue1 Shaofei Wang3, Yixin Chen3, Siyuan Huang3, Miao Liu1,‡
1College of AI, Tsinghua University 2ByteDance 3State Key Laboratory of General Artificial Intelligence, BIGAI
*Equal contribution. †Project Lead. ‡Corresponding author.

Abstract

Humans constantly reason about 3D proximity, the relations between their body and surrounding objects, to guide perception and action in daily life. Whether multimodal large language models (MLLMs) can perform such embodied 3D reasoning remains unclear. To this end, we introduce EgoProx, a benchmark for egocentric 3D proximity reasoning. We organize our tasks along a cognitive chain, covering intention, exploration, exploitation, and chain-of-actions reasoning. We also design an agent based data engine that produces diverse and consistent QA pairs at scale. We benchmark prevailing MLLMs on EgoProx and conduct additional analyses with dataset specific and task specific instruction tuning. We observe large cross-domain gains, indicating that current MLLMs contain some spatial knowledge; however, they still struggle to effectively leverage it for spatial reasoning VQA.

Key Contributions

EgoProc Benchmark

We propose EgoProx, the first benchmark designed to evaluate whether MLLMs can reason 3D perception–action coupling from an egocentric point-of-view, with four tasks organized along a cognitive hierarchy: Intention, Exploration, Exploitation, and Chain of Actions.

Agentic Data Engine

We develop an agent-based data generation pipeline that leverages task-aware salient clip sampler and 3D analysis toolset to automatically synthesize high-quality VQA data across diverse task categories.

Cross-Domain Spatial Analysis

Through extensive evaluation and cross-domain instruction-tuning experiments, we demonstrate that existing MLLMs already contain latent spatial knowledge acquired during pretraining, but unlocking this capability requires structured supervision.

Cognitive Hierarchy of EgoProx

EgoProx organizes egocentric 3D proximity reasoning as a cognitive hierarchy. Intention captures immediate gaze and head-orientation shifts toward goals; Exploration studies navigation steps toward objects or locations; Exploitation evaluates upcoming human-object interactions; and Chain of Actions asks models to reason over multi-step future behavior and the spatial relationships among action locations. Approximate proximity models coarse metric rotations and translations, while relative proximity captures directional relationships.

Cognitive hierarchy examples in EgoProx

Four-Level Egocentric 3D Proximity Reasoning

The benchmark requires models to use long-term contextual cues, spatial dependencies, and action-state changes from first-person videos, moving from local intention cues toward higher-order action chains.

Task distribution chart for EgoProx

Task Distribution

EgoProx contains 2,405 VQA samples collected from two complementary egocentric datasets: 1,016 samples from Aria Digital Twin and 1,389 samples from EgoExo4D. The benchmark covers a broad spectrum of proximity reasoning scenarios and is organized according to a four-level cognitive hierarchy consisting of Intention (30.27%), Exploration (15.71%), Exploitation (46.37%), and Chain of Actions (7.65%).

Intention: 30.27%
Exploration: 15.71%
Exploitation: 46.37%
Chain of Actions: 7.65%

Agentic Data Engine

Overview of our agent-based data construction pipeline. The agent first identifies salient moments with an interaction- and fixation-based sampler, then uses the 3D Analysis Toolset to extract spatial cues such as object positions, gaze targets, occupancy maps, and action chains. It then invokes the Spatial Calculator to derive 3D distances, orientations, and proximity relations, producing structured 3D proximity ground truth. Final benchmark question-answer pairs are compiled through necessary post-processing.

Agentic data engine overview

From Video to Structured 3D Proximity QA

The Agentic Data Engine converts egocentric video streams and raw annotations into scalable 3D proximity reasoning VQA data.

Leaderboard

Red cells indicate the best score among evaluated MLLMs for each metric. Human Level is shown as a reference.

Model Intention Exploration Exploitation Chain of Actions
Approx.Relative Approx.Relative Approx.Relative Act-AccRel-Acc-SRel-Acc-L
Human Level62.5075.3360.0063.1582.0285.2580.2363.2583.12
Proprietary Models
Gemini-2.5-Pro42.7537.1336.9029.3250.2445.1725.1417.0352.36
GPT-533.1640.3541.1834.5546.4545.1721.7420.8352.71
Open-source Models
LLaVA-NeXT-Video-7B23.0627.1918.7223.5631.0429.291.0916.6733.33
MiniCPM-V 2.628.5029.2422.999.4237.4431.602.6325.0050.00
InternVL 2.5-8B26.9428.9518.7219.3736.0233.778.256.2552.08
Qwen2.5-VL-7B33.6829.2427.2720.4238.6334.345.982.2746.21
Qwen2.5-VL-32B31.3533.9230.4817.8045.9740.5510.337.0245.61
Qwen2.5-VL-72B30.8335.3829.4124.0846.2140.2613.0414.2448.61
Qwen3-VL-235B34.4633.3328.3427.7545.9742.2810.8711.6751.25
Qwen3-VL-Plus26.4234.2120.3221.4748.3442.2810.877.5041.67

Even strong proprietary models remain far below human performance, especially on Chain of Actions. The results also show that model scaling alone brings limited gains for 3D proximity reasoning.

Cross-Category Experiment

Red cells indicate the best score in each evaluation column.

Model Intention Exploration Exploitation Chain of Actions
Approx.Relative Approx.Relative Approx.Relative Act-AccRel-Acc-SRel-Acc-L
Qwen2.5-VL-7B33.6829.2427.2720.4238.6334.345.982.2746.21
Qwen2.5-VL-7B + Intention Tuning--32.0924.6164.9339.113.800.0014.29
Qwen2.5-VL-7B + Exploration Tuning45.3435.09--45.2634.347.6114.2940.48
Qwen2.5-VL-7B + Exploitation Tuning56.4836.5527.2720.42--4.350.0016.67

Training on a small amount of data from one category often improves performance on other categories. Intention tuning transfers especially well, supporting the paper's cognitive hierarchy: intentional cues guide location reasoning and action-conditioned 3D reasoning. Exploration tuning also improves chain-of-actions metrics because navigation supervision provides useful signals about action locations.

Cross-Dataset Experiment

Red cells indicate the best score in each evaluation column.

ADT Evaluation
ModelIntentionExplorationExploitation
Qwen2.5-VL-7B35.9423.8147.64
EgoExo4D Only Tuning48.7027.7864.57
EgoExo4D Evaluation
ModelIntentionExplorationExploitation
Qwen2.5-VL-7B26.74-32.52
ADT Only Tuning50.58-43.67

Fine-tuning on one source dataset improves proximity reasoning on the other despite the recording-domain gap between ADT and EgoExo4D. The smaller gain on ADT Exploration is consistent with the data-source design: EgoExo4D training data does not contain exploration-type questions.

Qualitative Results

Intention

Qualitative results for Intention

Exploration

Qualitative results for Exploration

Exploitation

Qualitative results for Exploitation

Chain of Actions

Qualitative results for Chain of Actions

BibTeX

@misc{li2026egoproxevaluatingmllmsegocentric,
      title={EgoProx: Evaluating MLLMs on Egocentric 3D Proximity Reasoning Across a Cognitive Hierarchy}, 
      author={Jinzhao Li and Yinuo Chen and Dongxu Piao and Panwang Pan and Yifan Yu and Dong Wang and Honglei Yan and Liang Yue and Shaofei Wang and Yixin Chen and Siyuan Huang and Miao Liu},
      year={2026},
      eprint={2605.24456},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.24456}, 
}