Mentatcurated
concept also: VLM, vision language model, multimodal model

Vision-language model

A model that takes both images and text as input and reasons over them together — describing pictures, answering questions about them, or grounding language in what it sees.

VLMs fuse a vision encoder with a language model so the two share a representation. They power image captioning, document understanding, screenshots-to-actions for agents, and most "look at this and tell me…" interactions.

The interesting recent thread is how small a competent VLM can be — readable reference implementations now fit in a few hundred lines, which makes the architecture teachable rather than mysterious.

Related concepts