# Using Vision-Language Models for OCR, Documents, and Video Understanding

> How to use vision-language models for OCR, documents, and video: how they differ from traditional OCR, their failure modes, and getting reliable output.

Vision-language models read images and text together, so they grasp layout, tables, charts, and handwriting — where traditional OCR only extracts characters. They're powerful on varied documents, but can hallucinate exact values, so you constrain output to a schema and verify critical fields. Covers VLM vs. OCR, document and video understanding, and open vs. proprietary models.

"OCR" used to mean one thing: convert pixels of text into characters. Vision-language models (VLMs) change the job entirely — they read an image *and* understand it, so they can pull the line items out of an invoice, tell you whether a form is signed, read handwriting, interpret a chart, and answer questions about a page. This guide is about when that's the right tool, where it bites you, and how to get output you can trust.

## VLM vs. traditional OCR

Traditional OCR transcribes characters: fast, cheap, deterministic, and excellent on clean printed text. It struggles the moment a document has structure or variety — tables, multi-column layouts, forms, stamps, handwriting, poor scans — because it has no understanding of what it's reading.

A VLM reads the image and the text together, so it grasps **layout and meaning**: it knows the number in the bottom-right cell is the total, that a block is a shipping address, that a signature box is empty. For messy, varied documents it generalizes without the per-format templates that make classic document pipelines brittle.

> [!WARNING]
> The failure mode that matters is **faithfulness**. A VLM can occasionally mis-read or hallucinate an *exact* value — a total, a date, an account number — while producing confident, well-formatted output. Never trust a critical extracted value just because the JSON parsed. Constrain output to a schema and **verify the fields that matter** against the source (or a traditional OCR pass) before acting on them.

## Getting reliable structured output

The reliable pattern for document extraction:

1. **Define the schema** — the exact fields, types, and enums you want, with clear descriptions.
2. **Prompt with structured output** — use the provider's structured-output/JSON mode so the result conforms (see [Structured Output vs JSON Mode vs Function Calling](/guides/concepts/structured-output-2026)).
3. **Verify critical fields** — check totals, IDs, and dates against the source; add confidence handling and route low-confidence pages to human review.

The [multimodal-document-extractor](/skills/data/multimodal-document-extractor) skill packages exactly this loop.

## Video understanding

Video is the same idea extended over time: sample frames, give the model temporal context, and it can caption, answer questions, detect events, and search within the footage. The practical constraint is tokens — every frame costs context, latency, and money — so you sample at a rate that captures what matters and chunk long videos deliberately rather than feeding every frame.

## Open vs. proprietary models

Open-weights VLMs like [Qwen3-VL](/tools/qwen3-vl) (Apache-2.0) are strong on many OCR and document tasks and can be **self-hosted** for privacy, cost control at volume, and offline operation. Proprietary frontier VLMs may lead on the hardest reasoning, with zero infrastructure to run. The choice is the usual one — see [Self-Host vs API](/guides/mlops/self-host-vs-api-llm), and for serving an open model the [llm-inference-engineer](/agents/data-ai/llm-inference-engineer). Whatever you pick, decide it by measured accuracy on *your* documents, not a benchmark.

---

_Source: https://agentscamp.com/guides/vision/vlm-ocr-documents — Guide on AgentsCamp._