Mentatcurated
Artificial Intelligence high · first-party

The paper that spooked the chip market

A Google Research paper on squeezing an AI model's working memory by roughly six times spooked the memory-chip market — because cheaper inference could mean fewer of the expensive chips that serve it.

When a large model answers you, it keeps a running scratchpad of every token it has already processed — the key-value cache. For long prompts that scratchpad balloons into the dominant cost of inference, and it lives on high-bandwidth memory, the priciest silicon in an AI server. TurboQuant compresses that scratchpad from sixteen bits per value down to roughly three and a half, with almost no measurable loss in answer quality.

The compression applies only to the model's scratchpad, not its weights or its training — investors were pricing the narrowest slice of memory demand.

The trick is that it needs no retraining and no calibration data: it rotates the cache through a random transform that flattens the lopsided number distributions which usually make aggressive quantization wreck accuracy, then quantizes near the theoretical floor for how much you can throw away. Google reports up to six times less cache memory and as much as eight times faster attention on an H100; independent observers who reran it land closer to a 30-to-40-percent gain in practice.

The eye-catching part was financial. The paper landed and memory-chip stocks dropped — Micron, Western Digital, SanDisk all down on the day — because if each query needs a fraction of the cache memory it used to, the AI build-out may need fewer of the chips everyone assumed it would buy. A compression result on an abstract data structure briefly repriced the companies that make the physical thing it shrinks.

The lenses

Novelty 3
Impact · breadth 3
Impact · depth 3
Actionable 3
Substance 5
Hype 3

The facts

Free / open?Paper is public (ICLR 2026); an independent open-source reimplementation exists on GitHub
What it doesShrinks an AI model's working memory ~6x with near-zero accuracy loss, no retraining needed
Independent gainsReal-world memory and speed improvements measured nearer 30-40%, below Google's headline figures
Open research.google →

How this connects

Tap a node to open it