Mentatcurated
▸ Concept

KV-cache quantization

Storing a transformer's key-value attention cache in lower-precision numbers to cut the memory it occupies during inference — the bottleneck that limits how long a context a model can hold.

In a nutshell

Every transformer stores a key and value tensor for each token in its context window — the KV cache. At long contexts or high concurrency, this cache dominates GPU memory, not the model weights. Quantizing it (8-bit, 4-bit, or lower) shrinks each entry, so more context fits in the same memory and more requests can run in parallel. The hard part is that the cache is read and written on every token, so quantization errors compound across the full sequence and the quality cost is harder to bound than for weight quantization.

How this connects

Tap a node to open it