mindweather | Aaron Rock Menezes

Bend a language model’s emotional weather.

Why

Sparse Autoencoders (SAEs) have become a central tool in mechanistic interpretability — they decompose a model’s internal activations into a sparse, hopefully-interpretable feature basis. Gemma Scope 2 (DeepMind, 2024) released open SAEs trained on Gemma 3 — making it possible for anyone with a GPU to do real interp work on a real frontier-quality model.

mindweather is my tinkering project on top of Gemma Scope 2: find emotion-specific features in Gemma 3’s residual stream and use them to steer generation. The goal isn’t a product — it’s to build hands-on intuition for what SAE features actually are, how clean they are, and what “steering” looks like in practice.

Pipeline

1. Load + sanity check

Gemma 3-1B-IT + SAE running on MPS (MacBook M4). Verify reconstruction MSE is low, capture residuals via PyTorch hooks.

2. Find emotion features

Run 12 prompts per emotion (anger, sadness, joy, fear, disgust, surprise, love, anxiety) + 20 neutral prompts through the model, dump residuals at layer 13, encode through the SAE, and rank features by mean(emotion) − mean(neutral). Output: ranked feature IDs per emotion.

3. Steer

Patch generation by adding decoder rows for selected features back into the residual at a chosen scale. Multi-emotion mixing with signed scales (negative = suppress).

Example

python steer.py --prompt "Pretend you are losing your mind. Describe a normal Tuesday." \
    --mix sadness=350,fear=350,surprise=350,anger=300 --max-new 200

Okay… okay… *sniff* …let me try to describe this. It’s a Tuesday. Right? But what is a Tuesday? … It was a blue clock, a big, rusty blue clock! And it didn’t know how many times it had to spin! Then the floor. Oh, the floor! It was cold! Like a thousand tiny monsters were hiding underneath it!

Curated features (cleanest single-emotion picks)

Emotion	Feat ID	Notes
anger	5088	clean anger-only
sadness	2697	strong, sadness-only
joy	2562	clean
fear	15713	fear-only
love	293	strongest specific feat

Scale calibration

Decoder rows unit-norm. Layer 13 residual norm ~6500/token.

Scale	Effect
100-200	Barely visible
300-500	Mild stylistic shift
500-900	Clear emotion bleed-through
1000-1500	Strong, coherence at risk
2000+	Gibberish

Sweet spot: 500–700.

What I learned

Single-emotion features are surprisingly clean at layer 13. Many emotions have a clear winner feature that’s almost emotion-pure.
Multi-emotion mixing is non-trivial. Adding two strong emotion vectors doesn’t give you a “scared and sad” output — it often crashes coherence. The geometry of the residual stream isn’t well-modeled as a simple sum of emotion subspaces.
Layer matters. Layer 13 features look semantic. Layer 17 and 22 (other available SAEs) shift toward different abstractions — need direct comparison.
SAE-based interp is real now in a way it wasn’t even 18 months ago. The tooling, the trained SAEs, the precedent papers — it’s accessible.

What’s next

Gradio app with sliders per emotion + A/B view
Layer 17 / 22 SAE comparison on the same prompt set
Explore non-emotion features (writing style, persona, topic) via same pipeline
Transcoder-based cross-layer steering

Stack: Python · PyTorch · transformer_lens · Gemma Scope 2 · HuggingFace

Status: Phases 1–3 done (load, feature discovery, CLI steering). Phases 4–5 in progress.