# VLM (Vision-Language Model)

> A VLM jointly understands images and text — reading documents, screenshots, charts, and photos and reasoning about them in language.

**A vision-language model (VLM) is a model that takes images alongside text and reasons over both — describing a photo, extracting a table from a scanned invoice, reading a dashboard screenshot, or explaining a chart.**

Architecturally, a vision encoder turns images into tokens the language model attends to natively, so the LLM's reasoning applies directly to visual content. That collapsed what used to be pipelines: OCR + layout analysis + parsing became one call to a model that *reads the page* — the shift covered in [Using VLMs for OCR, Documents, and Video](/guides/vision/vlm-ocr-documents), and packaged for extraction work in the [multimodal-document-extractor](/skills/data/multimodal-document-extractor) skill.

Frontier APIs are all VLMs now, and open-weight families like [Qwen3-VL](/tools/qwen3-vl) brought the capability to self-hosting. Beyond documents, VLMs are the perception layer of [computer-use agents](/glossary/computer-use) (reading UIs to act on them) and of coding agents verifying their own frontend work from screenshots. The practical craft: manage image resolution deliberately — it's both the accuracy ceiling for small text and the token bill.

---

_Source: https://agentscamp.com/glossary/vision-language-model — Term on AgentsCamp._
