A flagship on a gaming card
Alibaba's new open-weight model fires only 3 billion parameters per word it writes, yet matches last year's flagship that fired more than seven times as many — and it runs on an $800 gaming card.
Alibaba released a model that activates 3 billion parameters for each token it generates. A year ago, the company's best model activated 22 billion to do comparable work. The new one keeps 35 billion in memory but lights up only a sliver at a time, and on Alibaba's own scorecard it lands roughly even with that older flagship — ahead on a broad knowledge test, behind on harder reasoning and some coding tasks. The headline isn't 'small beats big.' It's 'sparse rivals dense': the model isn't tiny, it's selective.
The catch: every benchmark here is Alibaba's own, and the model already lost the harder reasoning and coding tests to a flagship a year its senior.
That selectivity is the whole point. Because so few parameters fire per token, the model runs at over a hundred words a second on a single consumer graphics card costing around $800 — where the model it rivals needed datacenter-class memory to load at all. Frontier-adjacent coding and agent work that used to mean renting a server now fits on a desk, under an Apache 2.0 licence that lets anyone download and run it.
This isn't a breakthrough so much as the cost curve doing what it has been doing since DeepSeek and Mistral popularised the same trick: each generation, the parameters you actually pay to run drop while quality holds. The proof of how fast that curve moves is the model's own obsolescence — Alibaba shipped a whole newer generation within two months. The durable signal isn't this one release; it's that 'good enough to self-host' keeps arriving sooner than the hardware to run it gets cheaper.
Pull the weights from the Hugging Face model card and run them through Ollama or llama.cpp to see the 100-tokens-a-second figure on your own GPU.
Get the tool at huggingface.co →The lenses
The facts
How this connects
Tap a node to open it