VPTnav
Synthetic benchmark + procedural RL env generation pipeline for testing depth, line-of-sight, and perspective-taking in vision models. Built at scale on Isaac Sim / Isaac Lab.
Why this matters
Human children figure out line-of-sight by around 1–2 years old. Hide a toy behind a couch and a toddler crawls around to look — they already represent that the object persists and reason about what they (and others) can or can’t see.
Modern vision models cannot do this reliably. They can name objects in a scene. They can sometimes encode depth. But ask “can the person on the left see the cup behind the box?” and they fall back on shortcuts — texture cues, object co-occurrence, training-set priors that look like spatial reasoning until you control for them.
This is more than a benchmark gap. It’s an AI safety and interpretability problem:
- If a vision system can’t tell what’s visible to whom, what is it learning? Shortcut features? Statistical priors masquerading as understanding?
- Autonomous systems acting on a fake model of what others perceive are dangerous in obvious ways (driving, robotics, assistive agents).
- From a mechanistic interpretability angle: line-of-sight failures are a clean test bed — there’s a real geometric ground truth, and the model either has it internally or doesn’t.
- From a computational neuroscience angle: VPT is a developmental milestone with a clear neural signature. A model that closes the gap is a candidate computational model of that ability.
VPTnav — what it is
VPTnav is two things:
- A synthetic benchmark with three tasks (Depth Ordering, VPT1, VPT2) that isolate distinct visual-spatial faculties.
- A procedural environment generator built on Isaac Sim / Isaac Lab — rejection-sampled to remove dataset biases, designed to scale.
Each scene has the same three entities:
| Role | |
|---|---|
| Red ball | The target object |
| Green camera | The perspective-taker (asks “what can this camera see?”) |
| Pink cone | The spatial reference (used for VPT2) |
The three tasks
1. Depth Ordering
Question: Is the camera closer to the viewpoint, or is the ball? Labels: 0 (camera closer) / 1 (ball closer), 50/50 split.
Why it’s in the benchmark: it’s a sanity check that models can localize both entities, and it tests depth perception independently of perspective reasoning. A model that fails depth ordering can’t reasonably be expected to do VPT1 — it isn’t even seeing the entities properly.
2. VPT1 — Line of sight
Question: Can the green camera see the red ball? Labels: Yes (in_view) or No (occluded or outside_fov).
Distribution:
| Label | Sub-reason | Share |
|---|---|---|
| Yes | in_view | 50% |
| No | occluded (object blocks line-of-sight) | 25% |
| No | outside_fov (camera facing wrong way) | 25% |
The reason-balance matters. If you don’t decompose No, a model can hit 75% just by predicting “outside_fov” whenever there’s no obvious obstacle — without ever reasoning about geometry. The split forces the model to distinguish two genuinely different failure modes.
3. VPT2 — Perspective-relative spatial reasoning
Question: From the green camera’s POV, is the pink cone to the left or right of the red ball? Labels: 0 (left) / 1 (right), 50/50 split.
This is harder than VPT1. To answer correctly the model has to:
- Take the camera’s perspective (mental rotation)
- Project the cone and ball into that perspective
- Compare their lateral positions in that view
VPT1 tests visibility — a binary geometric fact. VPT2 tests what the scene actually looks like from somewhere else. That’s perspective-taking proper, with a chunk of mental simulation baked in (you’re answering a question about an image the model never sees).
Dataset & procedural generation
Each task: 512 environments, train/test 50/50 = 256/256, with val carved from test (128 val / 128 test). 10 images per env from unique viewpoints → ~2.5k train / 1.3k val / 1.3k test images per task.
Generation pipeline:
- Procedural scene synthesis in Isaac Sim / Isaac Lab — random poses, colors, scales, scene clutter
- Rejection sampling to enforce reason-balance and remove correlated cues (e.g., reject any scene where occlusion correlates with object color)
- Scales — pipeline runs in parallel across Isaac Lab environments; trivial to extend to 10k+ envs if needed
- Object handling = visual variance only — objects vary in position, scale, color across episodes but are never labeled semantically. Forces models to ground reasoning in observed geometry, not training-set priors
Evaluation — ~500-model timm zoo
Two evaluation regimes on ~497 pre-trained models from HuggingFace timm:
- Linear Probe (LP) — freeze backbone, train linear classifier on extracted features
- Full Fine-tune (FT) — train the whole model
VPT1 results
Chance is 50%.
| Linear Probe | Full Fine-tune | |
|---|---|---|
| Mean across ~493 models | 55.2% | 57.3% |
| Best single model | 62.1% | 70.7% |
| Worst single model | 49.6% | 51.0% |
What the numbers say
- Linear probes are near chance. Even the best frozen-feature extractor lands at ~62% — barely above random on a clean geometric task with a known answer.
- Fine-tuning helps, but not as much as you’d think. Full FT gains ~2pp on average; the best FT model hits ~71%. Still far from solved. Humans on this task are at ceiling (~100%).
- The worst models are at exactly chance under both regimes — a meaningful chunk of the ~500-model zoo learns nothing useful for line-of-sight, frozen or fine-tuned.
- Bigger inputs help slightly (384px > 224px). Architecture family matters more than input size — BEiTv2, EVA, Swin variants lead under FT; DINOv2 leads under LP.
The takeaway: modern vision backbones cannot do line-of-sight reasoning from a single frame. The shortcut surface they normally rely on isn’t available, and what’s left is weak.
Depth Ordering & VPT2 results
In progress. Currently running the ~500-model sweep on both tasks; will update once results are in.
What’s next
- Depth Ordering + VPT2 results across the full timm zoo
- Compare LP/FT performance on VPT1 vs VPT2 — does perspective-taking degrade more than line-of-sight?
- Tie back into NavJEPA — does an action-conditioned latent predictor close the gap that frozen backbones can’t?
Stack: Python · PyTorch · Isaac Sim / Isaac Lab · HuggingFace timm · WebDataset · SLURM (Oscar HPC)