NavJEPA
Lightweight action-conditioned mental simulation in latent space for visual perspective-taking
Setup
NavJEPA is built on top of VPTnav. VPTnav establishes that vision models can’t answer visual perspective-taking questions from a single frame — they need to move and predict the consequence of that movement. NavJEPA is that predictor.
Architecture
frozen ViT CLS → linear+BN adapter → action-conditioned Transformer predictor → dreamed future latent → linear probe
- Frozen ViT-S DINOv2 encoder (CLS token only).
- Action-conditioned Transformer predictor — given a sequence of actions, rolls latent forward in time.
- SIGReg regularization to prevent embedding collapse.
- Scheduled sampling for stable multi-step autoregressive rollout.
- Linear probes evaluated every epoch on dreamed final latents.
Why latent, not pixel?
The motivation isn’t computational. It’s that pixels are mostly noise we don’t care about.
The texture of an object, the color of the back wall, lighting variation, surface micro-detail — none of that information helps answer “can A see B?”. What matters is the semantics of the world: where an object is, what space it occupies, what occludes what.
A pixel-reconstruction predictor wastes its capacity rendering all the noise. A latent predictor is forced to keep what’s semantically useful and throw away the rest. NavJEPA imagines the consequence of acting directly at the representational level — much closer to what an agent actually needs to reason about line-of-sight.
Results
Best run (exp11) on the v18_compiled_20k dataset:
| Metric | Value |
|---|---|
val/cosine | 0.677 |
val/full_episode/horizon_cos03 | 7.24 |
test/full_episode/horizon_cos03 | 6.27 |
Stack: PyTorch · Isaac Lab / Isaac Sim · WebDataset · SLURM (Oscar HPC)