Computer Use
Computer use is an AI agent operating software through its real interface — reading the screen, moving the cursor, clicking, and typing like a person would.
Computer use is the agent capability of operating a computer the way a person does — perceiving the screen visually and acting through mouse and keyboard, with no API required.
It's the generalization of tool use to interfaces never designed for machines: a VLM reads the screenshot, the agent loop issues primitive actions (click, type, scroll), and each new frame is the observation that drives the next step. Anthropic shipped the first frontier version of the capability in late 2024; by 2026 it powers browser-using agents in products from coding tools to Google's Antigravity, with dedicated frameworks (Browser Use, Stagehand, Skyvern) industrializing the browser case.
Its engineering reality is honest: slower, costlier, and less reliable than structured automation — every step is a model call over an image. So the hierarchy holds: use an API when one exists, structured browser tools like Playwright MCP or Chrome DevTools MCP when the DOM is reachable, and pixel-level computer use for everything else — with human gates on anything that spends money or sends email, since a mis-grounded click is this modality's signature failure.
Frequently asked questions
- How does computer use actually work?
- A perception-action loop: the agent receives a screenshot (or accessibility/DOM data), a vision-language model decides the next action — click these coordinates, type this text, scroll — the action executes, and a fresh screenshot comes back as the observation. Reliability comes from grounding (finding the right element) and recovery (noticing a mis-click and correcting).
- When is computer use the right tool versus an API?
- Last resort by design: APIs and structured tools (like browser automation via Playwright MCP) are faster, cheaper, and far more reliable when they exist. Computer use earns its keep where there's no API — legacy desktop software, arbitrary websites, vendor portals — or where the task is inherently visual. If an MCP server covers it, use that first.
Related
- AI AgentAn AI agent is an LLM-driven system that pursues a goal in a loop — calling tools, observing results, iterating — instead of returning one answer.
- VLM (Vision-Language Model)A VLM jointly understands images and text — reading documents, screenshots, charts, and photos and reasoning about them in language.
- Playwright MCPMicrosoft's open-source MCP server that gives AI agents structured browser automation via Playwright's accessibility tree.
- Chrome DevTools MCPGoogle's official MCP server that gives coding agents a live Chrome — Puppeteer automation plus DevTools network, console, and performance insights.
- Human-in-the-Loop (HITL)Human-in-the-loop design inserts human judgment at decisive points in an AI workflow — approving actions, resolving ambiguity, owning the irreversible steps.
- How Computer-Use Agents WorkInside the perception-action loop that lets AI operate real software — screenshots in, clicks out — plus grounding, reliability, and when to use APIs instead.
- Browser UseThe most-adopted open-source browser-agent framework — point an LLM at a task and it drives a real browser: navigating, clicking, typing, extracting.
- DaytonaSub-90ms agent sandboxes — isolated computers with snapshots, volumes, Git and LSP tools, on Linux, Windows, or Android; AGPL self-host or managed cloud.
- E2bOpen-source Firecracker-microVM sandboxes where AI agents safely execute untrusted code — stateful code interpreters with full Linux, pause/resume, and desktop VMs.
- Multimodal AIMultimodal AI processes more than one kind of input or output — text, images, audio, video — in a single model, like an LLM that reads screenshots or speaks.