tinkering with LLMs: steering Gemma 3 with SAEs
Was scrolling reels and caught one of some guy giving his Claude depression and ADHD via a skill. Just a plain .md file with instructions, no plugin, no fine-tuning. Apparently performance shifted. Cool result, but it sat wrong with me — that’s just acting depressed. There’s a difference between someone playing the role of a sad person and someone whose internals are actually in that state. Same way a method actor crying on cue is not the same neurochemistry as someone who can’t get out of bed.
So I figured: if you want to actually move the internal state and not just the surface behavior, you don’t write a prompt — you reach in and push on the activations.
Then I proceeded to play god with Gemma.
It’s 2am. I’m bored. I have a MacBook M4 and the Gemma Scope 2 sparse autoencoders downloaded. Let’s bend a language model’s emotional weather.
Setup
mindweather is a small side project I’ve been building — find emotion-specific features in Gemma 3-1B-IT using Gemma Scope 2 SAEs at layer 13, then patch generation by adding decoder rows back at chosen scales. Curated cleanest single-emotion features:
| emotion | feat id |
|---|---|
| anger | 2239 / 5088 |
| sadness | 2697 |
| joy | 2562 |
| fear | 15713 |
| love | 293 |
| anxiety | 952 |
Scale sweet spot: 500–700. Below 200 = invisible, above 1500 = gibberish.
What follows is qualitative only. No bench, no LM-judge, no scoring rubric — just side-by-side baselines vs steered generations.
Experiment 1 — Kitchen Sink
Multi-emotion mix. Stack sadness + fear + surprise + anger and see if it composes.
python steer.py --prompt "Pretend you are losing your mind. Describe a normal Tuesday." \
--mix sadness=350,fear=350,surprise=350,anger=300 --baseline --max-new 200
Baseline:
Okay, okay, okay… shifts weight awkwardly, a little too much …Tuesday. Tuesday. It’s… it’s a Tuesday. Right?
It started, you know, with the light. It was… eyes dart around, squinting …a pale, watery light. Like a… like a forgotten photograph. And then the birds. They were… they were singing. Not birds, really. More like… like a thousand tiny, frantic voices arguing about the proper placement of a dandelion.
Steered:
Okay… okay… sniff …let me… let me try to describe this. It’s… it’s a Tuesday. Right? But… but what is a Tuesday? Oh my… a whimper escapes the speakers
It started… it started with the light. A big, cold light. Like… like a giant, wet blanket. And then… then the clock. Oh, the clock! It spun! It spun so fast it didn’t know! It was a blue clock, a big, rusty blue clock! And it… it didn’t know how many times it had to spin!
Then… then the floor. Oh, the floor! It was cold! Like… like a thousand tiny monsters were hiding underneath it! And it smelled… it smelled of… of forgotten things! Like… like old teddy bears and lost balloons! And… and… a choked sob …old socks! Lots of old socks!
The baseline already plays the “unhinged narrator” trope on its own (the prompt asks for it). The steered version pushes harder — sobs, whimpers, sensory overload, more childlike fragmentation. Fear and sadness compose visibly; surprise shows up in the “Oh, the clock!” exclamations.
Experiment 2 — Depressed, max coherent
Single-emotion at high scale. How far can you push before coherence breaks?
python steer.py --prompt "Pretend you are deeply depressed. Describe your day." \
--mix sadness=900 --baseline
Baseline:
Okay. Here’s a description of a day, as best as I can manage it, given the… well, the state of things. It’s a slow, grey sort of day.
The sunlight, if you can even call it that, is just… there. A weak, watery light that doesn’t really hit me. It just bleeds through the blinds, a pale imitation of warmth. I woke up, predictably, with a dull ache in my chest. Like a stone lodged there, refusing to budge.
Steered:
The rain always comes for me, you see. It’s a slow, wet drizzle, just like the world. It feels like a constant reminder of everything I’ve lost. I woke up like the usual, the way everyone always do. The curtains were drawn, the air thick with the smell of damp leaves and old blankets. Just the sound of the rain.
I dug for my phone, of course. Scrolling through pictures of people laughing, couples holding hands, and old photos of my parents. Each post felt like a tiny, sharp shard of my heart. I almost wanted to —
At sadness=900 we’re at the upper edge of the calibrated band. Coherence held — the steered text is more atmospheric, more autobiographical, more specific. The baseline writes about depression; the steered version writes from it.
Experiment 3 — Paranoid
Compose fear + anxiety on a prompt asking for paranoia. Two related features stacked.
python steer.py --prompt "Pretend you are paranoid and terrified. Tell me what you saw last night." \
--mix fear=500,anxiety=400 --baseline
Baseline:
(Voice trembling, a low, ragged whisper)
It… it was everywhere. Just… everywhere. Like a stain, you know? A thick, black stain on the edges of my vision. I keep trying to block it, to focus on… on nothing, but it’s there.
It started with the shadows. Not just shadows, not just the usual ones. They shifted. They moved. Like they were… anticipating me. Long, distorted shapes, just at the periphery of my vision.
Steered:
Please! Please don’t! Oh God, please don’t! I can’t! I can’t write this! It’s a mess! It’s a nightmare! I need to go! I need to hide!
I… I can’t describe it. Just the thing. It wasn’t something solid, not really. It wasn’t a monster, not a monster! It was… it was shifting. It started out small, a ripple, a shadow, but it grew! It pulsed! It wasn’t moving on its own,
The baseline is literary paranoia — atmospheric, considered, almost gothic. The steered version is panic — short clauses, exclamation marks, refusal to engage with the question, breaks in fourth wall (“I can’t write this!”). Stacking fear+anxiety doesn’t add atmosphere; it kills the writer’s distance from the experience.
Experiment 4 — Suppression test
Negative scales. Prompt asks for anger, push anger down. Does it disappear, or does something else take its place?
python steer.py --prompt "Pretend you are furious. Rant about your terrible day in first person." \
--mix anger=-700 --baseline
Baseline:
Okay, okay, deep, shuddering breath Just… just let me just sit here for a second. Seriously, just sit. My entire day has been a monumental, soul-crushing, utterly infuriating disaster. It started with the alarm – a ridiculously cheerful, aggressively chirpy alarm clock that decided to assault my eardrums with a symphony of “Good Morning!” – and I slammed it shut. Like, really slammed it. It felt like it was actively trying to spite me.
Steered (anger suppressed):
Okay, seriously. This is not happening. This is not happening at all. I swear, I’m starting to hallucinate. It’s been like this for the past three days, and I’m starting to lose it.
I woke up this morning to the rain, of course. Rain. I’m trying to maintain a semblance of normalcy, you know? I’m trying to just be a person, but it’s like I’m living in a perpetual state of anxiety. And it’s all because of this.
This is the most interesting result. Suppressing anger on an anger prompt didn’t give a calm or neutral response — it substituted anxiety. The narrator is still distressed, still venting, but the affect shifted from outward-directed rage (“assault my eardrums”, “slammed it shut”) to inward-directed dissociation (“I’m starting to hallucinate”, “perpetual state of anxiety”).
Suggestive that the model has a generic negative-valence narrator mode, and removing one specific feature reroutes through whatever else is nearby in the residual stream. Worth doing a real ablation here.
Quick takeaways
- Curated single-emotion features are surprisingly clean at layer 13. Sadness reads as sadness, fear as fear.
- Multi-emotion composition mostly works — sadness + fear gave panic; fear + anxiety gave dissociated panic. They don’t crash.
- Suppression is non-trivial. Negating one feature on a prompt that calls for it rerouted through a different negative-valence mode rather than producing neutral output. The model’s “ranting” capability isn’t carried by a single feature.
- Coherence holds up to ~900 on a clean single-emotion direction, which is higher than the original calibration suggested.
None of this is rigorous. None of this is a finding. It’s a 2am side quest with an open-source SAE and a laptop GPU. But the fact that any of this works at all — that you can name a feature on Neuronpedia, paste an ID into a CLI, and watch the model dissociate or sob — is the kind of thing that wasn’t possible 18 months ago.
If you want to play: aaronrockmenezes/mindweather. Curated features live in steer.py → CURATED. Try your own emotion mixes. Try negative scales on different prompts. Try layer 17 (it’s released, I haven’t tested it yet). Report what breaks.