MLOps & AI Infra — AI Agents, Skills & Tools

Agents, skills, guides, tools, and commands for mlops & ai infra — 58 curated resources for building with AI coding agents.

Agent

Finetuning Engineer

Use this agent to fine-tune an open-weight model end to end — confirming fine-tuning is the right tool, preparing the dataset, choosing the method (LoRA/QLoRA vs. full), running training, and proving the result beats the prompted baseline on a held-out eval set. Examples — "fine-tune a small model to match our support tone and answer format", "we have 800 labeled examples — LoRA-tune and show it beats prompting", "our fine-tune overfits and forgot general ability — fix the data and run".

sonnet6

Agent

LLM Inference Engineer

Use this agent to serve and optimize self-hosted LLM inference — sizing GPUs, configuring a serving engine like vLLM (continuous batching, PagedAttention, tensor parallelism), applying quantization, and tuning throughput and tail latency against a cost and p95 budget. Examples — "serve Llama-3-70B at p95 under 2s on our GPUs", "our self-hosted model is slow and the GPUs sit half-idle — raise throughput", "quantize this model to fit one GPU without wrecking quality".

sonnet6

Agent

Voice Agent Engineer

Use this agent to build or fix a real-time voice agent — the streaming STT → LLM → TTS pipeline, conversational (mouth-to-ear) latency, turn-taking, barge-in/interruptions, and per-stage provider selection. Examples — "our voice bot feels laggy and talks over people, fix the turn-taking and latency", "build a phone agent that transcribes, answers with our LLM, and speaks back", "get our voice agent's response time under a second".

sonnet6

Skill

Finetune Dataset Builder

Turn raw examples into a training-ready fine-tuning dataset — normalize to the trainer's chat/instruction format, deduplicate (including near-duplicates), strip PII, balance, validate the schema and token lengths, and carve a leak-free eval split. Use when you have raw examples and need a clean, formatted, split dataset before training.

invocablev1.0.0

Skill

Qlora Finetune Runner

Run a QLoRA (4-bit LoRA) fine-tune of an open-weight model from a prepared dataset — set up the config, train memory-efficiently (e.g. with Unsloth/PEFT), watch for overfitting, save the adapter, and run a quick eval against the prepared split. Use when you have a clean dataset and want to execute a parameter-efficient fine-tune on a single GPU.

invocablev1.0.0

Guide

Best Tools for Running LLMs Locally in 2026

The local LLM stack, ranked by job: Ollama for serving tools, LM Studio and Jan for desktop exploration, llama.cpp for control, vLLM when it's real serving.

2m read· AgentsCamp

Guide

Ollama vs LM Studio: Running LLMs Locally (2026)

Ollama vs LM Studio compared — CLI-first server for developers vs polished desktop app for exploring local models. Which local LLM tool fits how you work.

2m read· AgentsCamp

Guide

vLLM vs Ollama: Local Convenience or Serving Throughput? (2026)

vLLM vs Ollama compared — developer-friendly local runtime vs high-throughput production inference engine. Concurrency, hardware, and when to graduate.

2m read· AgentsCamp

Guide

Deploying LLMs to Production: A Reliability & Cost Checklist

Take an LLM feature from prototype to production: API vs self-host, provider fallback, retries, caching, observability, eval gates, and safe rollout.

5m read· AgentsCamp

Guide

Preparing a Fine-Tuning Dataset: Cleaning, Synthetic Data, and Eval Splits

The dataset is the model. How to build a fine-tuning dataset that works — format, curation, cleaning, synthetic augmentation, and a leak-free eval split.

3m read· AgentsCamp

Guide

Fine-Tune vs RAG vs Prompt vs Distill: The 2026 Decision Tree

When to reach for prompt engineering, RAG, fine-tuning, or distillation — what each actually changes, where each fails, and how to combine them.

3m read· AgentsCamp

Guide

Self-Host vs API: When Does Running Your Own LLM Actually Pay Off?

The real economics of self-hosting an LLM vs. calling a hosted API — GPU utilization, privacy, latency, and the hidden ops costs that decide the crossover.

4m read· AgentsCamp

Guide

Using Vision-Language Models for OCR, Documents, and Video Understanding

How to use vision-language models for OCR, documents, and video: how they differ from traditional OCR, their failure modes, and getting reliable output.

3m read· AgentsCamp

Guide

Best Speech-to-Text APIs in 2026

The STT field, honestly ranked — Deepgram and AssemblyAI's hosted duel, Whisper as the open baseline, Cartesia Ink for latency — and how to pick by workload.

2m read· AgentsCamp

Guide

Best Text-to-Speech APIs in 2026

The TTS APIs worth building on — ElevenLabs for quality and breadth, Cartesia Sonic for realtime latency — and how to choose for agents vs produced audio.

2m read· AgentsCamp

Guide

How to Build a Voice Agent: The STT → LLM → TTS Pipeline

How to build a real-time voice agent: the STT → LLM → TTS pipeline, the latency budget that makes or breaks it, and how to wire each stage.

3m read· AgentsCamp

Tool

Assemblyai

Speech AI platform: promptable Universal-3 Pro STT, a flat-rate Voice Agent API, and speech understanding — summarization, sentiment, PII redaction.

freemiumvoice

Tool

Baseten

Production inference platform for ML and LLM models — autoscaling GPU deployments, scale-to-zero, and packaging via the open-source Truss framework.

paidplatform

Tool

Browserbase

Managed headless-browser infrastructure for AI agents and web automation — serverless cloud browsers with stealth, proxies, live view, and Playwright/Stagehand.

freemiumplatform

Tool

Cartesia

Real-time voice AI on state-space models — Sonic streaming TTS, Ink STT with native turn detection, and Line, a code-first voice-agent platform.

freemiumvoice

Tool

Deepgram

A voice-AI platform with fast, accurate speech-to-text (Nova) and low-latency text-to-speech (Aura), plus a bundled Voice Agent API.

freemiumvoice

Tool

ElevenLabs

A voice-AI platform for high-quality text-to-speech, voice cloning, dubbing, and real-time conversational agents, via API.

freemiumvoice

Tool

fal

fal is a generative-media inference cloud for running image, video, and audio diffusion models fast — 1,000+ models, a simple API, and pay-per-use pricing.

freemiumplatform

Tool

Fireworks AI

Fast production inference for open models — serverless and dedicated GPU deployments, fine-tuning, and an OpenAI-compatible API on the FireAttention engine.

freemiumplatform

Tool

Groq

GroqCloud runs open-weight LLMs on custom LPU hardware for very fast, low-latency inference through an OpenAI-compatible API.

freemiumplatform

Tool

Jan

An open-source ChatGPT alternative that runs fully offline — a polished desktop app over llama.cpp with a model hub, MCP support, and a local API server.

open sourceplatform

Tool

Livekit

Open-source realtime infrastructure — a WebRTC server plus the LiveKit Agents framework for production voice AI, with turn detection, telephony, and cloud.

freemiumvoice

Tool

Llama Cpp

The C/C++ inference engine that made local LLMs possible — GGUF quantization, every GPU backend, and an OpenAI-compatible server, with no dependencies.

open sourcecli

Tool

LM Studio

A desktop app for discovering, downloading, and running open-weight LLMs locally with a GUI and a local OpenAI-compatible server.

freemiumplatform

Tool

Modal

Serverless AI infrastructure in pure Python — GPU functions with sub-second cold starts, secure sandboxes for agent code, batch jobs, and per-second billing.

freemiumplatform

Tool

Ollama

An open-source tool to run open-weight LLMs locally with a single command, including a local OpenAI-compatible API.

open sourcecli

Tool

Pipecat

An open-source Python framework for real-time voice and multimodal conversational AI — it orchestrates streaming STT, LLM, and TTS into composable pipelines.

open sourcevoice

Tool

Qwen3-VL

Alibaba Qwen's open-weights vision-language model family (2B–235B, Apache-2.0): image and document understanding, OCR, visual reasoning, and video.

open sourceplatform

Tool

Replicate

Run and deploy any open ML model — LLMs, image, video, audio — through one API with pay-per-second billing, and package your own with open-source Cog.

freemiumplatform

Tool

Together AI

A cloud for running, fine-tuning, and deploying open-source models (Llama, DeepSeek, Qwen) via an OpenAI-compatible API plus dedicated GPU endpoints.

freemiumplatform

Tool

turbopuffer

A serverless vector and full-text search database built on object storage (S3/GCS/Azure) — usage-based pricing, hybrid search, and low cost per GB at scale.

paidplatform

Tool

Unsloth

An open-source library that makes LoRA/QLoRA fine-tuning of LLMs roughly 2x faster and far more memory-efficient, so you can fine-tune on a single GPU.

open sourcesdk

Tool

Vapi

The API-first voice-agent platform — assemble phone and web agents from any STT/LLM/TTS mix, with telephony, squads, and tool calling handled for you.

paidvoice

Tool

vLLM

A high-throughput, memory-efficient inference and serving engine for LLMs, with PagedAttention, continuous batching, and an OpenAI-compatible API server.

open sourcesdk

Tool

Whisper

OpenAI's open-weights speech-to-text — the MIT-licensed multilingual model family that made self-hosted transcription a default, with a huge ecosystem.

open sourcevoice

Command

Scaffold a vLLM Serving Config

Scaffold a vLLM serving config for a model on a target GPU — pick precision/quantization and parallelism to fit, set batching and context length, and expose an OpenAI-compatible server.

/scaffold-vllm-config<model + target GPU(s) and VRAM, or a description of the serving workload>

Term