# Multimodal AI

> Multimodal AI processes more than one kind of input or output — text, images, audio, video — in a single model, like an LLM that reads screenshots or speaks.

**Multimodal AI refers to models that work across more than one modality — accepting or producing combinations of text, images, audio, and video — rather than text alone.**

The practical 2026 baseline: frontier models are natively multimodal on the input side (paste a screenshot into Claude Code and it *sees* the broken layout), [vision-language models](/glossary/vision-language-model) handle documents and OCR-grade reading, speech models run realtime conversation, and image generation is a commodity API. Modalities stopped being separate products and became input types.

For builders, two domains dominate. **Documents and screens**: VLMs replaced OCR-then-parse pipelines with direct understanding — the basis of [document extraction](/guides/vision/vlm-ocr-documents) and of [computer-use agents](/glossary/computer-use) that read UIs. **Voice**: the [STT → LLM → TTS pipeline](/guides/voice/build-a-voice-agent) and its realtime successors put a conversation on top of any agent. The recurring engineering theme is token cost — images and audio consume [context](/glossary/context-window) fast, so resolution and chunking decisions are budget decisions.

---

_Source: https://agentscamp.com/glossary/multimodal-ai — Term on AgentsCamp._