Mentatcurated
Open Models Quietly overlooked

nanoVLM

A complete vision-language model in about 750 lines — small enough to read in an afternoon, and that's exactly the point.

Most frontier models are effectively unreadable: millions of lines and layers of infrastructure that obscure what's actually happening. nanoVLM is the opposite — a from-scratch vision-language model whose entire training loop fits in a single, legible file.

The whole pipeline stops being a black box.

It won't win a benchmark, and it isn't trying to. Its value is pedagogical: you can follow, line by line, how an image becomes tokens, fuses with text, and gets decoded into a description. The whole pipeline stops being a black box.

There's a tradition here — Karpathy's nanoGPT did it for language models and quietly taught a generation how transformers work. nanoVLM is the same move for multimodal, and it's easy to miss precisely because it isn't loud.

Why it's here

Understanding beats access. A readable reference implementation does more for the field's literacy than another closed model behind an API — and almost nobody is talking about it.

The facts

LanguagePython
LicenseApache-2.0
Stars3.1k
Updatedthis week
Open github.com/huggingface/nanoVLM →