Mixture of Experts for NER
MoE layer from scratch + BiLSTM for CoNLL 2003 NER — 32% F1 improvement
Why I built it
Mixture of Experts (MoE) was having a moment in 2024 — Mixtral, GLaM, Switch Transformer all showing that sparse routing scales better than dense models. Most of the discussion was at the billion-parameter scale. I wanted to understand:
- Can you get the from-scratch implementation right? Routing, load balancing, sparse dispatch — the details that papers gloss over
- Does MoE help at small scale? Standard sequence labeling models (BiLSTM + CRF) on CoNLL 2003 NER — would a sparse MoE layer give meaningful gains there?
Implementation
MoE layer (from scratch in PyTorch)
-
kparallel expert MLPs (configurable) - Top-2 gating — each token routed to its 2 highest-scoring experts
- Load-balancing auxiliary loss to prevent expert collapse (a small number of experts hogging all tokens)
- Sparse dispatch — only the top-k experts actually compute for each token
- Drop-in replacement for the FFN block in a standard Transformer / BiLSTM cell
Sequence model
- BiLSTM encoder
- MoE layer replacing the standard FFN
- CRF decoder for structured prediction
- Trained on CoNLL 2003 English NER (PER, ORG, LOC, MISC)
Results
| Metric | Baseline (BiLSTM + CRF) | MoE-BiLSTM | Δ |
|---|---|---|---|
| Accuracy | 89.4% | improved | +12% |
| F1 Score | 0.71 | 0.94 | +32% |
What I learned
- Load balancing is critical. Without it, 1-2 experts dominate routing and the rest never train. Auxiliary loss weight needs careful tuning (too high → uniform routing, no specialization; too low → collapse).
- Experts do specialize. Visualizing gate weights showed clear preferences — some experts fired predominantly on PER tokens (proper nouns), others on ORG / LOC. The auxiliary loss didn’t kill specialization, it just prevented degenerate collapse.
- MoE worked even at small scale — at least on this task. The win wasn’t capacity (the MoE layer doesn’t have meaningfully more params at inference) but inductive bias: forcing the model to discover token-type-specific feature subspaces.
What came of it
Mostly a learning exercise — implementing MoE from scratch built the muscle for reading later MoE papers (Mixtral, DeepSeek-V2/V3) with way more depth. Code is on GitHub for anyone who wants a clean reference implementation.
Stack: Python · PyTorch · HuggingFace Datasets · seqeval
Period: April – May 2024