Skip to content
agentscamp

Voice & Multimodal — AI Agents, Skills & Tools

Agents, skills, guides, tools, and commands for voice & multimodal — 16 curated resources for building with AI coding agents.

Agent

Voice Agent Engineer

Use this agent to build or fix a real-time voice agent — the streaming STT → LLM → TTS pipeline, conversational (mouth-to-ear) latency, turn-taking, barge-in/interruptions, and per-stage provider selection. Examples — "our voice bot feels laggy and talks over people, fix the turn-taking and latency", "build a phone agent that transcribes, answers with our LLM, and speaks back", "get our voice agent's response time under a second".

sonnet6
Skill

Multimodal Document Extractor

Extract structured data from documents and images with a vision-language model — define the target schema, prompt the VLM to fill it from the page (invoices, forms, receipts, statements, IDs), and verify critical fields against the source. Use when you need reliable structured output from messy, varied, or scanned documents that defeat template-based OCR.

invocablev1.0.0
Guide

Using Vision-Language Models for OCR, Documents, and Video Understanding

How to use vision-language models for OCR, documents, and video: how they differ from traditional OCR, their failure modes, and getting reliable output.

2m read· AgentsCamp
Guide

Best Speech-to-Text APIs in 2026

The STT field, honestly ranked — Deepgram and AssemblyAI's hosted duel, Whisper as the open baseline, Cartesia Ink for latency — and how to pick by workload.

2m read· AgentsCamp
Guide

Best Text-to-Speech APIs in 2026

The TTS APIs worth building on — ElevenLabs for quality and breadth, Cartesia Sonic for realtime latency — and how to choose for agents vs produced audio.

2m read· AgentsCamp
Guide

How to Build a Voice Agent: The STT → LLM → TTS Pipeline

How to build a real-time voice agent: the STT → LLM → TTS pipeline, the latency budget that makes or breaks it, and how to wire each stage.

3m read· AgentsCamp
Guide

Realtime Voice Agents: Build on LiveKit, Buy Vapi, or Pipeline with Pipecat

The three ways to ship a realtime voice agent in 2026 — open infrastructure, managed platform, or OSS pipeline framework — and how speech-to-speech models change it.

2m read· AgentsCamp
Tool

Assemblyai

Speech AI platform: Universal STT models (promptable Universal-3 Pro), a flat-rate Voice Agent API, and speech understanding — summarization, sentiment, PII redaction.

freemiumvoice
Tool

Cartesia

Real-time voice AI on state-space models — Sonic streaming TTS, Ink STT with native turn detection, and Line, a code-first voice-agent platform.

freemiumvoice
Tool

Deepgram

A voice-AI platform with fast, accurate speech-to-text (Nova) and low-latency text-to-speech (Aura), plus a bundled Voice Agent API.

freemiumvoice
Tool

ElevenLabs

A voice-AI platform for high-quality text-to-speech, voice cloning, dubbing, and real-time conversational agents, via API.

freemiumvoice
Tool

Livekit

Open-source realtime infrastructure — a WebRTC server plus the LiveKit Agents framework for production voice AI, with turn detection, telephony, and cloud.

freemiumvoice
Tool

Pipecat

An open-source Python framework for real-time voice and multimodal conversational AI — it orchestrates streaming STT, LLM, and TTS into composable pipelines.

open sourcevoice
Tool

Qwen3-VL

Alibaba Qwen's open-weights vision-language model family (2B–235B, Apache-2.0): image and document understanding, OCR, visual reasoning, and video.

open sourceplatform
Tool

Vapi

The API-first voice-agent platform — assemble phone and web agents from any STT/LLM/TTS mix, with telephony, squads, and tool calling handled for you.

paidvoice
Tool

Whisper

OpenAI's open-weights speech-to-text — the MIT-licensed multilingual model family that made self-hosted transcription a default, with a huge ecosystem.

open sourcevoice