Time-Masked Autoencoders for Fluid Dynamics
Temporal masking in video autoencoders for Shallow Water simulations — collaboration with Imperial College London
Background
Fluid dynamics simulations (Navier-Stokes, Shallow Water, etc.) produce spatiotemporal fields that evolve under known physical laws. Two distinct challenges:
- Forecasting — given the first N timesteps, predict the next M
- Reconstruction under partial observation — fill in missing frames from sparse observations (think: satellite imagery with cloud cover, or sensor dropouts)
Both reduce to “learn the dynamics well enough to interpolate/extrapolate.” Standard video prediction models (ConvLSTM, transformer baselines) struggle here because the structure of fluid dynamics is highly regular (governed by PDEs) but the appearance is chaotic in detail.
The question
Masked Autoencoders (MAE, VideoMAE) work shockingly well for image and video representation learning — randomly hide 75-90% of patches, learn to reconstruct the rest, and somehow the resulting features generalize beautifully.
What if we apply this to fluid dynamics? Specifically, mask entire temporal frames (not random spatial patches) — does this force the model to learn the underlying dynamical structure?
Approach
- Architecture: ViT-style autoencoder operating on stacks of Shallow Water simulation frames
- Time masking: randomly remove
kframes from the input sequence; train to reconstruct the full sequence - Masking ratios tested: 25%, 50%, 80% of input frames
- Forecasting variant: predict up to 20 future frames from a short history
Results
| Setup | Mask % | Frames predicted | SSIM |
|---|---|---|---|
| Reconstruction | 50% | (filled in) | ~0.92 |
| Reconstruction | 80% | (filled in) | ~0.85 |
| Forecasting | n/a | 20 future | 0.80+ |
What we learned
- High temporal masking is a strong regularizer — forces the model to learn what the dynamics must look like, not memorize pixel-level shortcuts
- Generalization improved on out-of-distribution initial conditions compared to a forecaster trained without masking
- Spatial-only masking underperformed — temporal masking specifically captures the PDE structure
What came of it
This was a focused collaboration with Imperial College London researchers working on physics-informed ML. The findings fed into their larger program on scalable PDE surrogates. For me personally, it was the project that hooked me on predictive world models — the same intuition (predict in latent space, throw away noise) shows up in NavJEPA today.
Stack: Python · PyTorch · TensorFlow · NumPy
Period: Nov 2023 – Jan 2024