Data & ML — AI Agents, Skills & Tools
Agents, skills, guides, tools, and commands for data & ml — 30 curated resources for building with AI coding agents.
Data Engineer
Use this agent to build and maintain data pipelines — ingestion, ELT/ETL, warehouse modeling, orchestration, and data-quality tests. Examples — building an idempotent ingestion job, modeling a fact/dimension table in dbt, writing a safe backfill for a changed schema.
Data Scientist
Use this agent for data analysis — exploration, statistics, SQL, and clear findings. Examples — analyzing a dataset, writing an analytical SQL query, summarizing experiment results.
ML Engineer
Use this agent for production ML — pipelines, training, serving, evaluation, and MLOps. Examples — building a training pipeline, deploying a model, setting up evaluation.
Postgres Migration Engineer
Use this agent to plan and execute a zero-downtime Postgres schema migration — decomposing a breaking change into expand-contract steps, writing batched backfills, building indexes CONCURRENTLY, validating constraints online, and keeping every step reversible with the project's migration tooling. Examples — "add a NOT NULL column to a 200M-row table without downtime", "rename a column safely across a rolling deploy", "split this risky migration into reversible expand/contract steps".
Prompt Engineer
Use this agent to design and iterate the prompts behind an LLM-powered product feature — instructions, few-shot examples, tool schemas, and the evals that prove a change actually helped. Examples — "this classification prompt is flaky, make it reliable", "design the system prompt and function schema for our support agent", "our extraction prompt regressed after I tweaked it, set up evals so this stops happening".
Vector Search Engineer
Use this agent to design, build, and tune the vector-database layer of a search or RAG system — schema and index design (HNSW/IVF + quantization), metadata/payload filtering, hybrid (dense + sparse) search, and ingestion/upsert pipelines — sized to a real latency, recall, and cost budget. Examples — "set up pgvector for our docs with HNSW and filtered search", "our Qdrant queries are slow and recall dropped after quantization", "add metadata filtering so search only returns the current tenant's documents".
SQL Pro
Use this agent for SQL itself — correct joins and window functions, indexing, EXPLAIN plans, schema design, and safe migrations on Postgres/MySQL. Examples — making a slow query fast, designing a normalized schema, writing a reversible migration.
Multimodal Document Extractor
Extract structured data from documents and images with a vision-language model — define the target schema, prompt the VLM to fill it from the page (invoices, forms, receipts, statements, IDs), and verify critical fields against the source. Use when you need reliable structured output from messy, varied, or scanned documents that defeat template-based OCR.
SQL Optimizer
Diagnose a slow SQL query from its execution plan and propose a verified optimization — finding the real bottleneck (sequential scan, missing or unused index, bad join order, app-side N+1) and measuring the fix before and after. Use when a query is slow and you need a fix backed by EXPLAIN ANALYZE, not a guess.
Embedding Index Tuner
Tune a vector index — HNSW graph parameters and quantization — to hit a recall target at the lowest latency and memory, by sweeping settings against a fixed query set instead of trusting defaults. Use when vector search is slow or memory-hungry, when recall dropped after enabling quantization, or when standing up an index and you need defensible parameters.
Postgres Index Strategist
Recommend the right Postgres index for a query or workload — choosing B-Tree vs. GIN vs. BRIN vs. partial/covering/expression, checking for redundant or unused indexes, and verifying the choice against the query plan. Use when a query needs an index, when deciding an index type for jsonb/array/full-text/time-series data, or when auditing an over-indexed table.
Choosing Embeddings in 2026: OpenAI vs Cohere vs Voyage vs Open-Source
A decision guide for picking an embedding model for retrieval — accuracy, dimensions, cost, multilingual and domain fit, self-hosting, and lock-in.
Best Vector Database in 2026: pgvector vs Pinecone vs Qdrant vs Weaviate vs Milvus vs Chroma vs LanceDB
A decision guide to vector databases — embedded, server, or managed; whether you already run Postgres; and which fits your scale, filtering, and RAG needs.
Indexing Postgres at Scale: B-Tree vs GIN vs BRIN and the Hidden Cost of Over-Indexing
A practical guide to choosing Postgres index types — B-Tree, GIN, BRIN, partial, and covering — and why every index you add taxes every write.
Zero-Downtime Postgres Migrations: The Expand-Contract Playbook for 2026
How to change a live Postgres schema without downtime or broken deploys — the expand-contract pattern, safe column changes, batched backfills, and CONCURRENTLY.
Using Vision-Language Models for OCR, Documents, and Video Understanding
How to use vision-language models for OCR, documents, and video: how they differ from traditional OCR, their failure modes, and getting reliable output.
Chroma
An open-source, Python-first vector database that runs in-process — the fastest path from pip install to a working retrieval prototype.
LanceDB
An open-source embedded vector database built on the Lance columnar format — serverless, multimodal, and designed to scale on local disk or object storage.
Mem0
A memory layer for AI agents and apps — persistent, personalized long-term memory across sessions.
Milvus
An open-source vector database built for billion-scale similarity search, with a distributed architecture and a wide menu of index types.
pgroll
An open-source CLI for zero-downtime, reversible Postgres schema migrations using the expand-contract pattern behind versioned schema views.
pgvector
An open-source Postgres extension that adds a vector type and HNSW/IVFFlat indexes for similarity search inside your existing database.
Pinecone
A fully managed, serverless vector database for similarity search and RAG — no nodes to run, indexes to tune, or infrastructure to operate.
Qdrant
An open-source vector database written in Rust, built for low-latency similarity search at scale.
Qwen3-VL
Alibaba Qwen's open-weights vision-language model family (2B–235B, Apache-2.0): image and document understanding, OCR, visual reasoning, and video.
Voyage AI
Embedding and reranking models tuned for retrieval, now part of MongoDB.
Weaviate
An open-source vector database with built-in hybrid search, pluggable vectorizer modules, and GraphQL/REST/gRPC APIs.
DB Migrate
Generate and apply a database migration the safe way — using the project's migration tool, with expand-contract discipline for breaking changes, lock-free DDL, and a reversible up/down.
Scaffold a pgvector Schema & HNSW Index
Scaffold a production-ready pgvector schema and HNSW index for a corpus — matching the project's migration tooling, distance metric, and embedding dimensions.
Profile Postgres Queries
Profile a Postgres workload to find the queries actually costing you — rank by total time with pg_stat_statements, EXPLAIN the worst offenders, and recommend the highest-leverage fix.