Mixture of Experts for NER

MoE layer from scratch + BiLSTM for CoNLL 2003 NER — 32% F1 improvement

Why I built it

Mixture of Experts (MoE) was having a moment in 2024 — Mixtral, GLaM, Switch Transformer all showing that sparse routing scales better than dense models. Most of the discussion was at the billion-parameter scale. I wanted to understand:

  1. Can you get the from-scratch implementation right? Routing, load balancing, sparse dispatch — the details that papers gloss over
  2. Does MoE help at small scale? Standard sequence labeling models (BiLSTM + CRF) on CoNLL 2003 NER — would a sparse MoE layer give meaningful gains there?

Implementation

MoE layer (from scratch in PyTorch)

  • k parallel expert MLPs (configurable)
  • Top-2 gating — each token routed to its 2 highest-scoring experts
  • Load-balancing auxiliary loss to prevent expert collapse (a small number of experts hogging all tokens)
  • Sparse dispatch — only the top-k experts actually compute for each token
  • Drop-in replacement for the FFN block in a standard Transformer / BiLSTM cell

Sequence model

  • BiLSTM encoder
  • MoE layer replacing the standard FFN
  • CRF decoder for structured prediction
  • Trained on CoNLL 2003 English NER (PER, ORG, LOC, MISC)

Results

Metric Baseline (BiLSTM + CRF) MoE-BiLSTM Δ
Accuracy 89.4% improved +12%
F1 Score 0.71 0.94 +32%

What I learned

  • Load balancing is critical. Without it, 1-2 experts dominate routing and the rest never train. Auxiliary loss weight needs careful tuning (too high → uniform routing, no specialization; too low → collapse).
  • Experts do specialize. Visualizing gate weights showed clear preferences — some experts fired predominantly on PER tokens (proper nouns), others on ORG / LOC. The auxiliary loss didn’t kill specialization, it just prevented degenerate collapse.
  • MoE worked even at small scale — at least on this task. The win wasn’t capacity (the MoE layer doesn’t have meaningfully more params at inference) but inductive bias: forcing the model to discover token-type-specific feature subspaces.

What came of it

Mostly a learning exercise — implementing MoE from scratch built the muscle for reading later MoE papers (Mixtral, DeepSeek-V2/V3) with way more depth. Code is on GitHub for anyone who wants a clean reference implementation.

Stack: Python · PyTorch · HuggingFace Datasets · seqeval

Period: April – May 2024