Mixture of Experts for NER | Aaron Rock Menezes

Why I built it

Mixture of Experts (MoE) was having a moment in 2024 — Mixtral, GLaM, Switch Transformer all showing that sparse routing scales better than dense models. Most of the discussion was at the billion-parameter scale. I wanted to understand:

Can you get the from-scratch implementation right? Routing, load balancing, sparse dispatch — the details that papers gloss over
Does MoE help at small scale? Standard sequence labeling models (BiLSTM + CRF) on CoNLL 2003 NER — would a sparse MoE layer give meaningful gains there?

Implementation

MoE layer (from scratch in PyTorch)

k parallel expert MLPs (configurable)
Top-2 gating — each token routed to its 2 highest-scoring experts
Load-balancing auxiliary loss to prevent expert collapse (a small number of experts hogging all tokens)
Sparse dispatch — only the top-k experts actually compute for each token
Drop-in replacement for the FFN block in a standard Transformer / BiLSTM cell

Sequence model

BiLSTM encoder
MoE layer replacing the standard FFN
CRF decoder for structured prediction
Trained on CoNLL 2003 English NER (PER, ORG, LOC, MISC)

Results

Metric	Baseline (BiLSTM + CRF)	MoE-BiLSTM	Δ
Accuracy	89.4%	improved	+12%
F1 Score	0.71	0.94	+32%

What I learned

Load balancing is critical. Without it, 1-2 experts dominate routing and the rest never train. Auxiliary loss weight needs careful tuning (too high → uniform routing, no specialization; too low → collapse).
Experts do specialize. Visualizing gate weights showed clear preferences — some experts fired predominantly on PER tokens (proper nouns), others on ORG / LOC. The auxiliary loss didn’t kill specialization, it just prevented degenerate collapse.
MoE worked even at small scale — at least on this task. The win wasn’t capacity (the MoE layer doesn’t have meaningfully more params at inference) but inductive bias: forcing the model to discover token-type-specific feature subspaces.

What came of it

Mostly a learning exercise — implementing MoE from scratch built the muscle for reading later MoE papers (Mixtral, DeepSeek-V2/V3) with way more depth. Code is on GitHub for anyone who wants a clean reference implementation.

Stack: Python · PyTorch · HuggingFace Datasets · seqeval

Period: April – May 2024