Data & ML — AI Agents, Skills & Tools

Agents, skills, guides, tools, and commands for data & ml — 42 curated resources for building with AI coding agents.

Agent

Data Engineer

Use this agent to build and maintain data pipelines — ingestion, ELT/ETL, warehouse modeling, orchestration, and data-quality tests. Examples — building an idempotent ingestion job, modeling a fact/dimension table in dbt, writing a safe backfill for a changed schema.

sonnet6

Agent

Data Scientist

Use this agent for data analysis — exploration, statistics, SQL, and clear findings. Examples — analyzing a dataset, writing an analytical SQL query, summarizing experiment results.

sonnet

Agent

ML Engineer

Use this agent for production ML — pipelines, training, serving, evaluation, and MLOps. Examples — building a training pipeline, deploying a model, setting up evaluation.

opus

Agent

Postgres Migration Engineer

Use this agent to plan and execute a zero-downtime Postgres schema migration — decomposing a breaking change into expand-contract steps, writing batched backfills, building indexes CONCURRENTLY, validating constraints online, and keeping every step reversible with the project's migration tooling. Examples — "add a NOT NULL column to a 200M-row table without downtime", "rename a column safely across a rolling deploy", "split this risky migration into reversible expand/contract steps".

sonnet6

Agent

Prompt Engineer

Use this agent to design and iterate the prompts behind an LLM-powered product feature — instructions, few-shot examples, tool schemas, and the evals that prove a change actually helped. Examples — "this classification prompt is flaky, make it reliable", "design the system prompt and function schema for our support agent", "our extraction prompt regressed after I tweaked it, set up evals so this stops happening".

sonnet6

Agent

Vector Search Engineer

Use this agent to design, build, and tune the vector-database layer of a search or RAG system — schema and index design (HNSW/IVF + quantization), metadata/payload filtering, hybrid (dense + sparse) search, and ingestion/upsert pipelines — sized to a real latency, recall, and cost budget. Examples — "set up pgvector for our docs with HNSW and filtered search", "our Qdrant queries are slow and recall dropped after quantization", "add metadata filtering so search only returns the current tenant's documents".

sonnet6

Agent

SQL Pro

Use this agent for SQL itself — correct joins and window functions, indexing, EXPLAIN plans, schema design, and safe migrations on Postgres/MySQL. Examples — making a slow query fast, designing a normalized schema, writing a reversible migration.

sonnet6

Skill

Multimodal Document Extractor

Extract structured data from documents and images with a vision-language model — define the target schema, prompt the VLM to fill it from the page (invoices, forms, receipts, statements, IDs), and verify critical fields against the source. Use when you need reliable structured output from messy, varied, or scanned documents that defeat template-based OCR.

invocablev1.0.0

Skill

SQL Optimizer

Diagnose a slow SQL query from its execution plan and propose a verified optimization — finding the real bottleneck (sequential scan, missing or unused index, bad join order, app-side N+1) and measuring the fix before and after. Use when a query is slow and you need a fix backed by EXPLAIN ANALYZE, not a guess.

invocablev1.0.0

Skill

Embedding Index Tuner

Tune a vector index — HNSW graph parameters and quantization — to hit a recall target at the lowest latency and memory, by sweeping settings against a fixed query set instead of trusting defaults. Use when vector search is slow or memory-hungry, when recall dropped after enabling quantization, or when standing up an index and you need defensible parameters.

invocablev1.0.0

Skill

Postgres Index Strategist

Recommend the right Postgres index for a query or workload — choosing B-Tree vs. GIN vs. BRIN vs. partial/covering/expression, checking for redundant or unused indexes, and verifying the choice against the query plan. Use when a query needs an index, when deciding an index type for jsonb/array/full-text/time-series data, or when auditing an over-indexed table.

invocablev1.0.0

Guide

Choosing Embeddings in 2026: OpenAI vs Cohere vs Voyage vs Open-Source

A decision guide for picking an embedding model for retrieval — accuracy, dimensions, cost, multilingual and domain fit, self-hosting, and lock-in.

4m read· AgentsCamp

Guide

How Embeddings Work: Vectors, Similarity, and Choosing a Model

What an embedding actually is, how similarity is measured, how the models are trained, and the practical rules for using embeddings well in search and RAG.

6m read· AgentsCamp

Guide

Best Vector Database in 2026: pgvector vs Pinecone vs Qdrant vs Weaviate vs Milvus vs Chroma vs LanceDB

A decision guide to vector databases — embedded, server, or managed; whether you already run Postgres; and which fits your scale, filtering, and RAG needs.

5m read· AgentsCamp

Guide

Indexing Postgres at Scale: B-Tree vs GIN vs BRIN and the Hidden Cost of Over-Indexing

A practical guide to choosing Postgres index types — B-Tree, GIN, BRIN, partial, and covering — and why every index you add taxes every write.

4m read· AgentsCamp

Guide

Vector Search at Scale: ANN Indexes, Quantization & Sharding

How to run vector search over millions to billions of vectors without blowing latency, memory, or cost — index families, quantization, filtering, and sharding.

6m read· AgentsCamp

Guide

Zero-Downtime Postgres Migrations: The Expand-Contract Playbook for 2026

How to change a live Postgres schema without downtime or broken deploys — the expand-contract pattern, safe column changes, batched backfills, and CONCURRENTLY.

5m read· AgentsCamp

Guide

Using Vision-Language Models for OCR, Documents, and Video Understanding

How to use vision-language models for OCR, documents, and video: how they differ from traditional OCR, their failure modes, and getting reliable output.

3m read· AgentsCamp

Tool

Chroma

An open-source, Python-first vector database that runs in-process — the fastest path from pip install to a working retrieval prototype.

open sourcesdk

Tool

Docling

Open-source Python library that parses PDFs, DOCX, PPTX, HTML, and images into structured Markdown and JSON with layout, tables, and reading order for RAG.

open sourcesdk

Tool

LanceDB

An open-source embedded vector database built on the Lance columnar format — serverless, multimodal, and designed to scale on local disk or object storage.

open sourcesdk

Tool

Marker

Open-source pipeline that converts PDFs, images, and Office docs into clean Markdown, JSON, or HTML fast, with optional LLM assist for tables and equations.

open sourcesdk

Tool

Mem0

A memory layer for AI agents and apps — persistent, personalized long-term memory across sessions.

open sourcesdk

Tool

Milvus

An open-source vector database built for billion-scale similarity search, with a distributed architecture and a wide menu of index types.

open sourceplatform

Tool

pgroll

An open-source CLI for zero-downtime, reversible Postgres schema migrations using the expand-contract pattern behind versioned schema views.

open sourcecli

Tool

pgvector

An open-source Postgres extension that adds a vector type and HNSW/IVFFlat indexes for similarity search inside your existing database.

open sourcesdk

Tool

Pinecone

A fully managed, serverless vector database for similarity search and RAG — no nodes to run, indexes to tune, or infrastructure to operate.

freemiumplatform

Tool

Qdrant

An open-source vector database written in Rust, built for low-latency similarity search at scale.

open sourceplatform

Tool

Qwen3-VL

Alibaba Qwen's open-weights vision-language model family (2B–235B, Apache-2.0): image and document understanding, OCR, visual reasoning, and video.

open sourceplatform

Tool

Reducto

High-accuracy document ingestion API — parsing, agentic OCR, table and figure extraction, and splitting that turns messy PDFs into LLM-ready data for RAG.

freemiumplatform

Tool

Unstructured

Open-source library plus hosted Platform/API that turns messy documents — PDF, HTML, docx, images, email — into clean, chunked JSON for LLMs and RAG.

freemiumplatform

Tool

Voyage AI

Embedding and reranking models tuned for retrieval, now part of MongoDB.

freemiumplatform

Tool

Weaviate

An open-source vector database with built-in hybrid search, pluggable vectorizer modules, and GraphQL/REST/gRPC APIs.

open sourceplatform

Command

DB Migrate

Generate and apply a database migration the safe way — using the project's migration tool, with expand-contract discipline for breaking changes, lock-free DDL, and a reversible up/down.

/db-migrate<the schema change to make, or a path to a pending migration to review>

Command

Scaffold a pgvector Schema & HNSW Index

Scaffold a production-ready pgvector schema and HNSW index for a corpus — matching the project's migration tooling, distance metric, and embedding dimensions.

/scaffold-pgvector-schema<table/corpus name and embedding dimensions, or a description of the data>

Command

Seed Data

Generate realistic, referentially-consistent seed data and a re-runnable seed script from your actual schema — types and constraints respected, plausible values, FK-dependency insert order, idempotent, never aimed at production.

/seed-data<optional: tables and row volume>

Command

Profile Postgres Queries

Profile a Postgres workload to find the queries actually costing you — rank by total time with pg_stat_statements, EXPLAIN the worst offenders, and recommend the highest-leverage fix.

/profile-postgres-queries<database/connection details, a slow endpoint, or a description of the workload>

Term