EgoProx: Evaluating MLLMs on Egocentric 3D Proximity Reasoning

Abstract

Humans constantly reason about 3D proximity, the relations between their body and surrounding objects, to guide perception and action in daily life. Whether multimodal large language models (MLLMs) can perform such embodied 3D reasoning remains unclear. To this end, we introduce EgoProx, a benchmark for egocentric 3D proximity reasoning. We organize our tasks along a cognitive chain, covering intention, exploration, exploitation, and chain-of-actions reasoning. We also design an agent based data engine that produces diverse and consistent QA pairs at scale. We benchmark prevailing MLLMs on EgoProx and conduct additional analyses with dataset specific and task specific instruction tuning. We observe large cross-domain gains, indicating that current MLLMs contain some spatial knowledge; however, they still struggle to effectively leverage it for spatial reasoning VQA.

Key Contributions

EgoProc Benchmark

We propose EgoProx, the first benchmark designed to evaluate whether MLLMs can reason 3D perception–action coupling from an egocentric point-of-view, with four tasks organized along a cognitive hierarchy: Intention, Exploration, Exploitation, and Chain of Actions.

Agentic Data Engine

We develop an agent-based data generation pipeline that leverages task-aware salient clip sampler and 3D analysis toolset to automatically synthesize high-quality VQA data across diverse task categories.

Cross-Domain Spatial Analysis

Through extensive evaluation and cross-domain instruction-tuning experiments, we demonstrate that existing MLLMs already contain latent spatial knowledge acquired during pretraining, but unlocking this capability requires structured supervision.

Cognitive Hierarchy of EgoProx

EgoProx organizes egocentric 3D proximity reasoning as a cognitive hierarchy. Intention captures immediate gaze and head-orientation shifts toward goals; Exploration studies navigation steps toward objects or locations; Exploitation evaluates upcoming human-object interactions; and Chain of Actions asks models to reason over multi-step future behavior and the spatial relationships among action locations. Approximate proximity models coarse metric rotations and translations, while relative proximity captures directional relationships.

Four-Level Egocentric 3D Proximity Reasoning

The benchmark requires models to use long-term contextual cues, spatial dependencies, and action-state changes from first-person videos, moving from local intention cues toward higher-order action chains.

Task Distribution

EgoProx contains 2,405 VQA samples collected from two complementary egocentric datasets: 1,016 samples from Aria Digital Twin and 1,389 samples from EgoExo4D. The benchmark covers a broad spectrum of proximity reasoning scenarios and is organized according to a four-level cognitive hierarchy consisting of Intention (30.27%), Exploration (15.71%), Exploitation (46.37%), and Chain of Actions (7.65%).

Intention: 30.27%

Exploration: 15.71%

Exploitation: 46.37%

Chain of Actions: 7.65%

Agentic Data Engine

Overview of our agent-based data construction pipeline. The agent first identifies salient moments with an interaction- and fixation-based sampler, then uses the 3D Analysis Toolset to extract spatial cues such as object positions, gaze targets, occupancy maps, and action chains. It then invokes the Spatial Calculator to derive 3D distances, orientations, and proximity relations, producing structured 3D proximity ground truth. Final benchmark question-answer pairs are compiled through necessary post-processing.

From Video to Structured 3D Proximity QA

The Agentic Data Engine converts egocentric video streams and raw annotations into scalable 3D proximity reasoning VQA data.

Leaderboard

Red cells indicate the best score among evaluated MLLMs for each metric. Human Level is shown as a reference.

Model	Intention		Exploration		Exploitation		Chain of Actions
Model	Approx.	Relative	Approx.	Relative	Approx.	Relative	Act-Acc	Rel-Acc-S	Rel-Acc-L
Human Level	62.50	75.33	60.00	63.15	82.02	85.25	80.23	63.25	83.12
Proprietary Models
Gemini-2.5-Pro	42.75	37.13	36.90	29.32	50.24	45.17	25.14	17.03	52.36
GPT-5	33.16	40.35	41.18	34.55	46.45	45.17	21.74	20.83	52.71
Open-source Models
LLaVA-NeXT-Video-7B	23.06	27.19	18.72	23.56	31.04	29.29	1.09	16.67	33.33
MiniCPM-V 2.6	28.50	29.24	22.99	9.42	37.44	31.60	2.63	25.00	50.00
InternVL 2.5-8B	26.94	28.95	18.72	19.37	36.02	33.77	8.25	6.25	52.08
Qwen2.5-VL-7B	33.68	29.24	27.27	20.42	38.63	34.34	5.98	2.27	46.21
Qwen2.5-VL-32B	31.35	33.92	30.48	17.80	45.97	40.55	10.33	7.02	45.61
Qwen2.5-VL-72B	30.83	35.38	29.41	24.08	46.21	40.26	13.04	14.24	48.61
Qwen3-VL-235B	34.46	33.33	28.34	27.75	45.97	42.28	10.87	11.67	51.25
Qwen3-VL-Plus	26.42	34.21	20.32	21.47	48.34	42.28	10.87	7.50	41.67

Even strong proprietary models remain far below human performance, especially on Chain of Actions. The results also show that model scaling alone brings limited gains for 3D proximity reasoning.

Cross-Category Experiment

Red cells indicate the best score in each evaluation column.

Model	Intention		Exploration		Exploitation		Chain of Actions
Model	Approx.	Relative	Approx.	Relative	Approx.	Relative	Act-Acc	Rel-Acc-S	Rel-Acc-L
Qwen2.5-VL-7B	33.68	29.24	27.27	20.42	38.63	34.34	5.98	2.27	46.21
Qwen2.5-VL-7B + Intention Tuning	-	-	32.09	24.61	64.93	39.11	3.80	0.00	14.29
Qwen2.5-VL-7B + Exploration Tuning	45.34	35.09	-	-	45.26	34.34	7.61	14.29	40.48
Qwen2.5-VL-7B + Exploitation Tuning	56.48	36.55	27.27	20.42	-	-	4.35	0.00	16.67

Training on a small amount of data from one category often improves performance on other categories. Intention tuning transfers especially well, supporting the paper's cognitive hierarchy: intentional cues guide location reasoning and action-conditioned 3D reasoning. Exploration tuning also improves chain-of-actions metrics because navigation supervision provides useful signals about action locations.

Cross-Dataset Experiment

Red cells indicate the best score in each evaluation column.

ADT Evaluation
Model	Intention	Exploration	Exploitation
Qwen2.5-VL-7B	35.94	23.81	47.64
EgoExo4D Only Tuning	48.70	27.78	64.57
EgoExo4D Evaluation
Model	Intention	Exploration	Exploitation
Qwen2.5-VL-7B	26.74	-	32.52
ADT Only Tuning	50.58	-	43.67

Fine-tuning on one source dataset improves proximity reasoning on the other despite the recording-domain gap between ADT and EgoExo4D. The smaller gain on ADT Exploration is consistent with the data-source design: EgoExo4D training data does not contain exploration-type questions.

Qualitative Results

Intention

Exploration

Exploitation

Chain of Actions

BibTeX

@misc{li2026egoproxevaluatingmllmsegocentric,
      title={EgoProx: Evaluating MLLMs on Egocentric 3D Proximity Reasoning Across a Cognitive Hierarchy}, 
      author={Jinzhao Li and Yinuo Chen and Dongxu Piao and Panwang Pan and Yifan Yu and Dong Wang and Honglei Yan and Liang Yue and Shaofei Wang and Yixin Chen and Siyuan Huang and Miao Liu},
      year={2026},
      eprint={2605.24456},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.24456}, 
}