nanoVLM
A complete vision-language model in about 750 lines — small enough to read in an afternoon, and that's exactly the point.
Most frontier models are effectively unreadable: millions of lines and layers of infrastructure that obscure what's actually happening. nanoVLM is the opposite — a from-scratch vision-language model whose entire training loop fits in a single, legible file.
The whole pipeline stops being a black box.
It won't win a benchmark, and it isn't trying to. Its value is pedagogical: you can follow, line by line, how an image becomes tokens, fuses with text, and gets decoded into a description. The whole pipeline stops being a black box.
There's a tradition here — Karpathy's nanoGPT did it for language models and quietly taught a generation how transformers work. nanoVLM is the same move for multimodal, and it's easy to miss precisely because it isn't loud.
Understanding beats access. A readable reference implementation does more for the field's literacy than another closed model behind an API — and almost nobody is talking about it.