Mentatcurated
▸ Concept also: MoE, sparse mixture of experts, sparse MoE

Mixture of experts

An architecture where only a small fraction of a model's parameters activate per token — keeping quality high while cutting the compute cost of each forward pass.

Learn first

In a nutshell

A mixture-of-experts model replaces a single dense feed-forward block with many parallel "expert" sub-networks and a learned router that picks a handful of them for each token. The total parameter count can be enormous, but most weights sit idle on any given pass — so the compute cost matches a model far smaller than the full parameter tally suggests. The hard part is the router: if it always picks the same experts, the others atrophy and the sparsity buys nothing. Balancing load across experts during training is what makes or breaks the approach.

Where it came from

Year1991
SourceJacobs et al. — "Adaptive Mixtures of Local Experts" (Neural Computation)
Why it matteredIntroduced the gating-network idea; the sparse, large-scale transformer variant was popularised by Shazeer et al. (2017) and scaled up by Switch Transformer (Fedus et al., 2021).

How this connects

Tap a node to open it