VLM (Vision-Language Model)
A VLM jointly understands images and text — reading documents, screenshots, charts, and photos and reasoning about them in language.
A vision-language model (VLM) is a model that takes images alongside text and reasons over both — describing a photo, extracting a table from a scanned invoice, reading a dashboard screenshot, or explaining a chart.
Architecturally, a vision encoder turns images into tokens the language model attends to natively, so the LLM's reasoning applies directly to visual content. That collapsed what used to be pipelines: OCR + layout analysis + parsing became one call to a model that reads the page — the shift covered in Using VLMs for OCR, Documents, and Video, and packaged for extraction work in the multimodal-document-extractor skill.
Frontier APIs are all VLMs now, and open-weight families like Qwen3-VL brought the capability to self-hosting. Beyond documents, VLMs are the perception layer of computer-use agents (reading UIs to act on them) and of coding agents verifying their own frontend work from screenshots. The practical craft: manage image resolution deliberately — it's both the accuracy ceiling for small text and the token bill.
Frequently asked questions
- Do VLMs replace OCR?
- For most document understanding, effectively yes. Classic OCR outputs characters and leaves structure to you; a VLM reads the page like a person — tables, layout, handwriting, checkboxes, the figure the text refers to — and can return structured data directly. Dedicated OCR still wins on raw character accuracy for clean, high-volume scanning at minimal cost.
- What are VLMs' weak spots?
- Precise counting and measurement, dense small text at low resolution, exact spatial coordinates, and hallucinated detail when an image is ambiguous — the model fills gaps plausibly, like any LLM. Resolution settings matter more than people expect: token cost scales with image size, and downscaling silently destroys small text.
Related
- Using Vision-Language Models for OCR, Documents, and Video UnderstandingHow to use vision-language models for OCR, documents, and video: how they differ from traditional OCR, their failure modes, and getting reliable output.
- Multimodal AIMultimodal AI processes more than one kind of input or output — text, images, audio, video — in a single model, like an LLM that reads screenshots or speaks.
- Multimodal Document ExtractorExtract structured data from documents and images with a vision-language model — define the target schema, prompt the VLM to fill it from the page (invoices, forms, receipts, statements, IDs), and verify critical fields against the source. Use when you need reliable structured output from messy, varied, or scanned documents that defeat template-based OCR.
- Qwen3-VLAlibaba Qwen's open-weights vision-language model family (2B–235B, Apache-2.0): image and document understanding, OCR, visual reasoning, and video.
- Computer UseComputer use is an AI agent operating software through its real interface — reading the screen, moving the cursor, clicking, and typing like a person would.
- How Computer-Use Agents WorkInside the perception-action loop that lets AI operate real software — screenshots in, clicks out — plus grounding, reliability, and when to use APIs instead.