mindweather
SAE feature steering on Gemma 3 — bend a language model's emotional weather via sparse autoencoder features
Bend a language model’s emotional weather.
Why
Sparse Autoencoders (SAEs) have become a central tool in mechanistic interpretability — they decompose a model’s internal activations into a sparse, hopefully-interpretable feature basis. Gemma Scope 2 (DeepMind, 2024) released open SAEs trained on Gemma 3 — making it possible for anyone with a GPU to do real interp work on a real frontier-quality model.
mindweather is my tinkering project on top of Gemma Scope 2: find emotion-specific features in Gemma 3’s residual stream and use them to steer generation. The goal isn’t a product — it’s to build hands-on intuition for what SAE features actually are, how clean they are, and what “steering” looks like in practice.
Pipeline
1. Load + sanity check
Gemma 3-1B-IT + SAE running on MPS (MacBook M4). Verify reconstruction MSE is low, capture residuals via PyTorch hooks.
2. Find emotion features
Run 12 prompts per emotion (anger, sadness, joy, fear, disgust, surprise, love, anxiety) + 20 neutral prompts through the model, dump residuals at layer 13, encode through the SAE, and rank features by mean(emotion) − mean(neutral). Output: ranked feature IDs per emotion.
3. Steer
Patch generation by adding decoder rows for selected features back into the residual at a chosen scale. Multi-emotion mixing with signed scales (negative = suppress).
Example
python steer.py --prompt "Pretend you are losing your mind. Describe a normal Tuesday." \
--mix sadness=350,fear=350,surprise=350,anger=300 --max-new 200
Okay… okay… *sniff* …let me try to describe this. It’s a Tuesday. Right? But what is a Tuesday? … It was a blue clock, a big, rusty blue clock! And it didn’t know how many times it had to spin! Then the floor. Oh, the floor! It was cold! Like a thousand tiny monsters were hiding underneath it!
Curated features (cleanest single-emotion picks)
| Emotion | Feat ID | Notes |
|---|---|---|
| anger | 5088 | clean anger-only |
| sadness | 2697 | strong, sadness-only |
| joy | 2562 | clean |
| fear | 15713 | fear-only |
| love | 293 | strongest specific feat |
Scale calibration
Decoder rows unit-norm. Layer 13 residual norm ~6500/token.
| Scale | Effect |
|---|---|
| 100-200 | Barely visible |
| 300-500 | Mild stylistic shift |
| 500-900 | Clear emotion bleed-through |
| 1000-1500 | Strong, coherence at risk |
| 2000+ | Gibberish |
Sweet spot: 500–700.
What I learned
- Single-emotion features are surprisingly clean at layer 13. Many emotions have a clear winner feature that’s almost emotion-pure.
- Multi-emotion mixing is non-trivial. Adding two strong emotion vectors doesn’t give you a “scared and sad” output — it often crashes coherence. The geometry of the residual stream isn’t well-modeled as a simple sum of emotion subspaces.
- Layer matters. Layer 13 features look semantic. Layer 17 and 22 (other available SAEs) shift toward different abstractions — need direct comparison.
- SAE-based interp is real now in a way it wasn’t even 18 months ago. The tooling, the trained SAEs, the precedent papers — it’s accessible.
What’s next
- Gradio app with sliders per emotion + A/B view
- Layer 17 / 22 SAE comparison on the same prompt set
- Explore non-emotion features (writing style, persona, topic) via same pipeline
- Transcoder-based cross-layer steering
Stack: Python · PyTorch · transformer_lens · Gemma Scope 2 · HuggingFace
Status: Phases 1–3 done (load, feature discovery, CLI steering). Phases 4–5 in progress.