Voice & Multimodal — AI Agents, Skills & Tools

Agents, skills, guides, tools, and commands for voice & multimodal — 22 curated resources for building with AI coding agents.

Agent

Voice Agent Engineer

Use this agent to build or fix a real-time voice agent — the streaming STT → LLM → TTS pipeline, conversational (mouth-to-ear) latency, turn-taking, barge-in/interruptions, and per-stage provider selection. Examples — "our voice bot feels laggy and talks over people, fix the turn-taking and latency", "build a phone agent that transcribes, answers with our LLM, and speaks back", "get our voice agent's response time under a second".

sonnet6

Skill

Multimodal Document Extractor

Extract structured data from documents and images with a vision-language model — define the target schema, prompt the VLM to fill it from the page (invoices, forms, receipts, statements, IDs), and verify critical fields against the source. Use when you need reliable structured output from messy, varied, or scanned documents that defeat template-based OCR.

invocablev1.0.0

Guide

Add Image Understanding to Your App

A practical guide to sending images to a vision model and getting reliable, structured results: base64 vs URL, resolution, prompting, cost, and errors.

7m read· AgentsCamp

Guide

Multimodal Embeddings and Image Search

How multimodal embeddings put images and text in one vector space, and how to build text-to-image and image-to-image search on top of it.

6m read· AgentsCamp

Guide

Multimodal RAG over PDFs, Scans & Charts: Two Approaches That Actually Work

RAG over visual documents — PDFs, scans, charts — where text-only extraction loses tables and layout. Parse-then-text vs embed-the-page-image, with trade-offs.

6m read· AgentsCamp

Guide

Screenshot-to-Code: Building UIs from Images with AI

Turn a screenshot, mockup, or Figma frame into working frontend code with AI vision models — the realistic workflow, the right tools, and the honest pitfalls.

5m read· AgentsCamp

Guide

Vision-Language Models Compared (2026)

Which vision-language model to reach for, by job: Claude, GPT, Gemini, and open models like Qwen3-VL compared on OCR, charts, grounding, video, and cost.

5m read· AgentsCamp

Guide

Using Vision-Language Models for OCR, Documents, and Video Understanding

How to use vision-language models for OCR, documents, and video: how they differ from traditional OCR, their failure modes, and getting reliable output.

3m read· AgentsCamp

Guide

Best Speech-to-Text APIs in 2026

The STT field, honestly ranked — Deepgram and AssemblyAI's hosted duel, Whisper as the open baseline, Cartesia Ink for latency — and how to pick by workload.

2m read· AgentsCamp

Guide

Best Text-to-Speech APIs in 2026

The TTS APIs worth building on — ElevenLabs for quality and breadth, Cartesia Sonic for realtime latency — and how to choose for agents vs produced audio.

2m read· AgentsCamp

Guide

How to Build a Voice Agent: The STT → LLM → TTS Pipeline

How to build a real-time voice agent: the STT → LLM → TTS pipeline, the latency budget that makes or breaks it, and how to wire each stage.

3m read· AgentsCamp

Guide

Realtime Voice Agents: Build on LiveKit, Buy Vapi, or Pipeline with Pipecat

The three ways to ship a realtime voice agent in 2026 — open infrastructure, managed platform, or OSS pipeline — and how speech-to-speech models fit in.

2m read· AgentsCamp

Tool

Assemblyai

Speech AI platform: promptable Universal-3 Pro STT, a flat-rate Voice Agent API, and speech understanding — summarization, sentiment, PII redaction.

freemiumvoice

Tool

Cartesia

Real-time voice AI on state-space models — Sonic streaming TTS, Ink STT with native turn detection, and Line, a code-first voice-agent platform.

freemiumvoice

Tool

Deepgram

A voice-AI platform with fast, accurate speech-to-text (Nova) and low-latency text-to-speech (Aura), plus a bundled Voice Agent API.

freemiumvoice

Tool

ElevenLabs

A voice-AI platform for high-quality text-to-speech, voice cloning, dubbing, and real-time conversational agents, via API.

freemiumvoice

Tool

fal

fal is a generative-media inference cloud for running image, video, and audio diffusion models fast — 1,000+ models, a simple API, and pay-per-use pricing.

freemiumplatform

Tool

Livekit

Open-source realtime infrastructure — a WebRTC server plus the LiveKit Agents framework for production voice AI, with turn detection, telephony, and cloud.

freemiumvoice

Tool

Pipecat

An open-source Python framework for real-time voice and multimodal conversational AI — it orchestrates streaming STT, LLM, and TTS into composable pipelines.

open sourcevoice

Tool

Qwen3-VL

Alibaba Qwen's open-weights vision-language model family (2B–235B, Apache-2.0): image and document understanding, OCR, visual reasoning, and video.

open sourceplatform

Tool

Vapi

The API-first voice-agent platform — assemble phone and web agents from any STT/LLM/TTS mix, with telephony, squads, and tool calling handled for you.

paidvoice

Tool

Whisper

OpenAI's open-weights speech-to-text — the MIT-licensed multilingual model family that made self-hosted transcription a default, with a huge ecosystem.

open sourcevoice