mindweather

SAE feature steering on Gemma 3 — bend a language model's emotional weather via sparse autoencoder features

Bend a language model’s emotional weather.

Why

Sparse Autoencoders (SAEs) have become a central tool in mechanistic interpretability — they decompose a model’s internal activations into a sparse, hopefully-interpretable feature basis. Gemma Scope 2 (DeepMind, 2024) released open SAEs trained on Gemma 3 — making it possible for anyone with a GPU to do real interp work on a real frontier-quality model.

mindweather is my tinkering project on top of Gemma Scope 2: find emotion-specific features in Gemma 3’s residual stream and use them to steer generation. The goal isn’t a product — it’s to build hands-on intuition for what SAE features actually are, how clean they are, and what “steering” looks like in practice.

Pipeline

1. Load + sanity check

Gemma 3-1B-IT + SAE running on MPS (MacBook M4). Verify reconstruction MSE is low, capture residuals via PyTorch hooks.

2. Find emotion features

Run 12 prompts per emotion (anger, sadness, joy, fear, disgust, surprise, love, anxiety) + 20 neutral prompts through the model, dump residuals at layer 13, encode through the SAE, and rank features by mean(emotion) − mean(neutral). Output: ranked feature IDs per emotion.

3. Steer

Patch generation by adding decoder rows for selected features back into the residual at a chosen scale. Multi-emotion mixing with signed scales (negative = suppress).

Example

python steer.py --prompt "Pretend you are losing your mind. Describe a normal Tuesday." \
    --mix sadness=350,fear=350,surprise=350,anger=300 --max-new 200

Okay… okay… *sniff* …let me try to describe this. It’s a Tuesday. Right? But what is a Tuesday? … It was a blue clock, a big, rusty blue clock! And it didn’t know how many times it had to spin! Then the floor. Oh, the floor! It was cold! Like a thousand tiny monsters were hiding underneath it!

Curated features (cleanest single-emotion picks)

Emotion Feat ID Notes
anger 5088 clean anger-only
sadness 2697 strong, sadness-only
joy 2562 clean
fear 15713 fear-only
love 293 strongest specific feat

Scale calibration

Decoder rows unit-norm. Layer 13 residual norm ~6500/token.

Scale Effect
100-200 Barely visible
300-500 Mild stylistic shift
500-900 Clear emotion bleed-through
1000-1500 Strong, coherence at risk
2000+ Gibberish

Sweet spot: 500–700.

What I learned

  • Single-emotion features are surprisingly clean at layer 13. Many emotions have a clear winner feature that’s almost emotion-pure.
  • Multi-emotion mixing is non-trivial. Adding two strong emotion vectors doesn’t give you a “scared and sad” output — it often crashes coherence. The geometry of the residual stream isn’t well-modeled as a simple sum of emotion subspaces.
  • Layer matters. Layer 13 features look semantic. Layer 17 and 22 (other available SAEs) shift toward different abstractions — need direct comparison.
  • SAE-based interp is real now in a way it wasn’t even 18 months ago. The tooling, the trained SAEs, the precedent papers — it’s accessible.

What’s next

  • Gradio app with sliders per emotion + A/B view
  • Layer 17 / 22 SAE comparison on the same prompt set
  • Explore non-emotion features (writing style, persona, topic) via same pipeline
  • Transcoder-based cross-layer steering

Stack: Python · PyTorch · transformer_lens · Gemma Scope 2 · HuggingFace

Status: Phases 1–3 done (load, feature discovery, CLI steering). Phases 4–5 in progress.