Artificial Intelligence medium · first-party

A flagship on a gaming card

Alibaba's new open-weight model fires only 3 billion parameters per word it writes, yet matches last year's flagship that fired more than seven times as many — and it runs on an $800 gaming card.

tool Qwen team, Alibaba · 2 min read

Alibaba released a model that activates 3 billion parameters for each token it generates. A year ago, the company's best model activated 22 billion to do comparable work. The new one keeps 35 billion in memory but lights up only a sliver at a time, and on Alibaba's own scorecard it lands roughly even with that older flagship — ahead on a broad knowledge test, behind on harder reasoning and some coding tasks. The headline isn't 'small beats big.' It's 'sparse rivals dense': the model isn't tiny, it's selective.

The catch: every benchmark here is Alibaba's own, and the model already lost the harder reasoning and coding tests to a flagship a year its senior.

That selectivity is the whole point. Because so few parameters fire per token, the model runs at over a hundred words a second on a single consumer graphics card costing around $800 — where the model it rivals needed datacenter-class memory to load at all. Frontier-adjacent coding and agent work that used to mean renting a server now fits on a desk, under an Apache 2.0 licence that lets anyone download and run it.

This isn't a breakthrough so much as the cost curve doing what it has been doing since DeepSeek and Mistral popularised the same trick: each generation, the parameters you actually pay to run drop while quality holds. The proof of how fast that curve moves is the model's own obsolescence — Alibaba shipped a whole newer generation within two months. The durable signal isn't this one release; it's that 'good enough to self-host' keeps arriving sooner than the hardware to run it gets cheaper.

Want to try it?

Pull the weights from the Hugging Face model card and run them through Ollama or llama.cpp to see the 100-tokens-a-second figure on your own GPU.

Get the tool at huggingface.co →

The lenses

Novelty 2

Impact · breadth 3

Impact · depth 3

Actionable 4

Substance 4

Hype 2

The facts

Open?Yes — Apache 2.0, downloadable weights

Runs onA single ~$800 consumer GPU, 100+ words/sec

The catchMaker-reported benchmarks; matches but doesn't beat last year's flagship across the board

Concepts

Local inference Scaling laws Mixture of experts

Open huggingface.co →

How this connects

Tap a node to open it

A flagship on a gaming card

The lenses

The facts

Concepts

More in Artificial Intelligence

Safety's rounding error

The Jevons bill comes due

Money stopped being the bottleneck

How this connects