▸ Concept

KV-cache quantization

Storing a transformer's key-value attention cache in lower-precision numbers to cut the memory it occupies during inference — the bottleneck that limits how long a context a model can hold.

Learn first

Local inference Scaling laws

In a nutshell

Every transformer stores a key and value tensor for each token in its context window — the KV cache. At long contexts or high concurrency, this cache dominates GPU memory, not the model weights. Quantizing it (8-bit, 4-bit, or lower) shrinks each entry, so more context fits in the same memory and more requests can run in parallel. The hard part is that the cache is read and written on every token, so quantization errors compound across the full sequence and the quality cost is harder to bound than for weight quantization.

In megatrends

Artificial Intelligence

Models, agents, and AI–human collaboration — general-purpose capability scaling into every domain.

How this connects

Tap a node to open it

KV-cache quantization

Learn first

In megatrends

Artificial Intelligence

Finds citing this concept

The paper that spooked the chip market

How this connects