VPTnav | Aaron Rock Menezes

Why this matters

Human children figure out line-of-sight by around 1–2 years old. Hide a toy behind a couch and a toddler crawls around to look — they already represent that the object persists and reason about what they (and others) can or can’t see.

Modern vision models cannot do this reliably. They can name objects in a scene. They can sometimes encode depth. But ask “can the person on the left see the cup behind the box?” and they fall back on shortcuts — texture cues, object co-occurrence, training-set priors that look like spatial reasoning until you control for them.

This is more than a benchmark gap. It’s an AI safety and interpretability problem:

If a vision system can’t tell what’s visible to whom, what is it learning? Shortcut features? Statistical priors masquerading as understanding?
Autonomous systems acting on a fake model of what others perceive are dangerous in obvious ways (driving, robotics, assistive agents).
From a mechanistic interpretability angle: line-of-sight failures are a clean test bed — there’s a real geometric ground truth, and the model either has it internally or doesn’t.
From a computational neuroscience angle: VPT is a developmental milestone with a clear neural signature. A model that closes the gap is a candidate computational model of that ability.

VPTnav — what it is

VPTnav is two things:

A synthetic benchmark with three tasks (Depth Ordering, VPT1, VPT2) that isolate distinct visual-spatial faculties.
A procedural environment generator built on Isaac Sim / Isaac Lab — rejection-sampled to remove dataset biases, designed to scale.

Each scene has the same three entities:

	Role
Red ball	The target object
Green camera	The perspective-taker (asks “what can this camera see?”)
Pink cone	The spatial reference (used for VPT2)

The three tasks

1. Depth Ordering

Question: Is the camera closer to the viewpoint, or is the ball? Labels: 0 (camera closer) / 1 (ball closer), 50/50 split.

Why it’s in the benchmark: it’s a sanity check that models can localize both entities, and it tests depth perception independently of perspective reasoning. A model that fails depth ordering can’t reasonably be expected to do VPT1 — it isn’t even seeing the entities properly.

2. VPT1 — Line of sight

Question: Can the green camera see the red ball? Labels: Yes (in_view) or No (occluded or outside_fov).

Distribution:

Label	Sub-reason	Share
Yes	in_view	50%
No	occluded (object blocks line-of-sight)	25%
No	outside_fov (camera facing wrong way)	25%

The reason-balance matters. If you don’t decompose No, a model can hit 75% just by predicting “outside_fov” whenever there’s no obvious obstacle — without ever reasoning about geometry. The split forces the model to distinguish two genuinely different failure modes.

3. VPT2 — Perspective-relative spatial reasoning

Question: From the green camera’s POV, is the pink cone to the left or right of the red ball? Labels: 0 (left) / 1 (right), 50/50 split.

This is harder than VPT1. To answer correctly the model has to:

Take the camera’s perspective (mental rotation)
Project the cone and ball into that perspective
Compare their lateral positions in that view

VPT1 tests visibility — a binary geometric fact. VPT2 tests what the scene actually looks like from somewhere else. That’s perspective-taking proper, with a chunk of mental simulation baked in (you’re answering a question about an image the model never sees).

Dataset & procedural generation

Each task: 512 environments, train/test 50/50 = 256/256, with val carved from test (128 val / 128 test). 10 images per env from unique viewpoints → ~2.5k train / 1.3k val / 1.3k test images per task.

Generation pipeline:

Procedural scene synthesis in Isaac Sim / Isaac Lab — random poses, colors, scales, scene clutter
Rejection sampling to enforce reason-balance and remove correlated cues (e.g., reject any scene where occlusion correlates with object color)
Scales — pipeline runs in parallel across Isaac Lab environments; trivial to extend to 10k+ envs if needed
Object handling = visual variance only — objects vary in position, scale, color across episodes but are never labeled semantically. Forces models to ground reasoning in observed geometry, not training-set priors

Evaluation — ~500-model timm zoo

Two evaluation regimes on ~497 pre-trained models from HuggingFace timm:

Linear Probe (LP) — freeze backbone, train linear classifier on extracted features
Full Fine-tune (FT) — train the whole model

VPT1 results

Chance is 50%.

	Linear Probe	Full Fine-tune
Mean across ~493 models	55.2%	57.3%
Best single model	62.1%	70.7%
Worst single model	49.6%	51.0%

What the numbers say

Linear probes are near chance. Even the best frozen-feature extractor lands at ~62% — barely above random on a clean geometric task with a known answer.
Fine-tuning helps, but not as much as you’d think. Full FT gains ~2pp on average; the best FT model hits ~71%. Still far from solved. Humans on this task are at ceiling (~100%).
The worst models are at exactly chance under both regimes — a meaningful chunk of the ~500-model zoo learns nothing useful for line-of-sight, frozen or fine-tuned.
Bigger inputs help slightly (384px > 224px). Architecture family matters more than input size — BEiTv2, EVA, Swin variants lead under FT; DINOv2 leads under LP.

The takeaway: modern vision backbones cannot do line-of-sight reasoning from a single frame. The shortcut surface they normally rely on isn’t available, and what’s left is weak.

Depth Ordering & VPT2 results

In progress. Currently running the ~500-model sweep on both tasks; will update once results are in.

What’s next

Depth Ordering + VPT2 results across the full timm zoo
Compare LP/FT performance on VPT1 vs VPT2 — does perspective-taking degrade more than line-of-sight?
Tie back into NavJEPA — does an action-conditioned latent predictor close the gap that frozen backbones can’t?

Stack: Python · PyTorch · Isaac Sim / Isaac Lab · HuggingFace timm · WebDataset · SLURM (Oscar HPC)

VPTnav repo →