▸ Concept also: MoE, sparse mixture of experts, sparse MoE

Mixture of experts

An architecture where only a small fraction of a model's parameters activate per token — keeping quality high while cutting the compute cost of each forward pass.

Learn first

Scaling laws

In a nutshell

A mixture-of-experts model replaces a single dense feed-forward block with many parallel "expert" sub-networks and a learned router that picks a handful of them for each token. The total parameter count can be enormous, but most weights sit idle on any given pass — so the compute cost matches a model far smaller than the full parameter tally suggests. The hard part is the router: if it always picks the same experts, the others atrophy and the sparsity buys nothing. Balancing load across experts during training is what makes or breaks the approach.

Where it came from

Year1991

SourceJacobs et al. — "Adaptive Mixtures of Local Experts" (Neural Computation)

Why it matteredIntroduced the gating-network idea; the sparse, large-scale transformer variant was popularised by Shazeer et al. (2017) and scaled up by Switch Transformer (Fedus et al., 2021).

In megatrends

Artificial Intelligence

Models, agents, and AI–human collaboration — general-purpose capability scaling into every domain.

How this connects

Tap a node to open it

Mixture of experts

Learn first

Where it came from

In megatrends

Artificial Intelligence

Related players

Finds citing this concept

Hunter Alpha

A flagship on a gaming card

How this connects