DevOps & Infra — AI Agents, Skills & Tools

Agents, skills, guides, tools, and commands for devops & infra — 45 curated resources for building with AI coding agents.

Agent

LLM Cost Optimizer

Use this agent to cut the cost and latency of an application's LLM API usage without losing quality — audit where the tokens and dollars go, then apply caching, model right-sizing, prompt trimming, batching, and budgets, proven against an eval bar. Examples — "our OpenAI bill tripled, find where the spend is and cut it", "this endpoint's p95 is 8s, bring it down", "right-size models per task and add prompt caching to our chat feature".

sonnet6

Agent

Dependency Manager

Use this agent to upgrade project dependencies safely — batching low-risk bumps apart from breaking majors and verifying each step. Examples — clearing months of stale packages, taking a single major version with migration notes, resolving a peer-dependency conflict.

sonnet5

Agent

CI/CD Engineer

Use this agent to design, speed up, and harden CI/CD pipelines on any provider (GitHub Actions, GitLab CI, CircleCI, Buildkite). Examples — setting up a build→test→deploy pipeline from scratch, cutting a 25-minute CI run down with caching and matrix parallelism, adding a canary or blue-green deploy with automatic rollback, or reviewing a workflow for leaked secrets, over-broad tokens, and unpinned third-party actions.

sonnet5

Agent

Cloud Architect

Use this agent to design a cloud architecture on AWS, GCP, or Azure — compute, networking, data stores, IAM, and cost trade-offs. Examples — choosing serverless vs containers for a new service, designing a multi-account network boundary, picking a database and estimating its monthly cost.

sonnet3

Agent

DevOps Engineer

Use this agent for CI/CD, infrastructure, and automation. Examples — writing a CI pipeline, containerizing an app, infrastructure-as-code changes.

sonnet

Agent

Incident Responder

Use this agent during a live production incident to restore service fast and learn from it — triage and severity, mitigation-first action (roll back, fail over, shed load), change correlation, status updates, and the blameless postmortem. Examples — an alert just fired and the API is 5xx-ing, a deploy broke checkout and you need to decide rollback vs. forward-fix, latency is climbing and the pager is going off, or you're writing the postmortem the morning after.

opus4

Agent

Kubernetes Specialist

Use this agent for Kubernetes — manifests, Helm, troubleshooting, scaling, and resource tuning. Examples — debugging a CrashLoopBackOff, writing a Deployment, tuning requests/limits.

sonnet

Agent

SRE Engineer

Use this agent to make reliability measurable: SLIs/SLOs and error budgets, observability, symptom-based alerting, incident response, and capacity. Examples — defining an SLO for a checkout API, fixing a noisy pager, writing a blameless postmortem.

sonnet6

Agent

Terraform Specialist

Use this agent for Terraform and infrastructure-as-code — module design, remote state, plan/apply safety, drift, and provider pinning. Examples — reviewing a plan for destroys before apply, designing a reusable module, resolving state drift after a console change.

sonnet6

Skill

Rate Limiter Designer

Design and implement API rate limiting that actually holds under load — pick the algorithm (token bucket vs sliding-window-counter vs fixed window) and justify it, choose the limiting key and per-tier limits, use cross-instance atomic storage, and return standard 429 signals. Use when protecting an API from abuse or scrapers, enforcing per-tier quotas, or replacing an in-memory limiter that breaks behind multiple replicas.

invocablev1.0.0

Skill

Connection Pool Tuner

Size and tune a database connection pool from the real constraint — the database's shared max_connections and its core count — so total connections (per-instance pool × instance count) stay safely under the cap and a too-large pool stops adding latency. Use when the app throws 'too many connections' or pool-acquire timeouts, when the DB is saturated by connection count, or when deploying to serverless.

invocablev1.0.0

Skill

Deadlock Diagnoser

Diagnose a database deadlock from the engine's own deadlock report, reconstruct the lock cycle (A holds 1 wants 2, B holds 2 wants 1), name the root cause — almost always two code paths locking the same rows in different orders — and fix it with consistent lock ordering, shorter transactions, and a retry-the-victim safeguard. Use when the DB logs deadlock errors, when transactions intermittently fail under load, or when queries mysteriously block each other.

invocablev1.0.0

Skill

Migration Writer

Write a safe, reversible, zero-downtime database migration using expand-contract — add the new shape, backfill in batches, switch reads/writes, then drop the old — so every deploy stays compatible with the running app version. Use when adding or changing schema on a live system, renaming/dropping a column, adding NOT NULL or a foreign key on a large table, or when a migration risks locks, table rewrites, or an unrevertable step.

invocablev1.0.0

Skill

Query Plan Analyzer

Read a slow query's execution plan and turn it into a concrete fix — the exact index to add, the rewrite, or the ANALYZE to run — by getting the REAL plan with EXPLAIN ANALYZE (actual rows + timing, not estimates), finding the offending node, and confirming the fix removes it. Use when one specific query is slow and you need to know WHY, not just that it is.

invocablev1.0.0

Skill

Runbook Writer

Write an operational runbook a half-asleep on-call engineer can execute at 3am — scoped to ONE alert, leading with how to confirm the problem, the copy-pasteable mitigation that stops user pain, then diagnosis, escalation, and verification. Use when an alert has no documented response, after an incident exposed a missing procedure, or when standing up on-call for a service.

invocablev1.0.0

Skill

Alerting Rules Tuner

Cut alert noise and make every page mean something — rewrite alerting rules to fire on user-felt symptoms (error rate, latency SLO burn, failed requests) instead of causes (high CPU, full disk), with duration windows and severity routing so only urgent, actionable conditions reach a human. Use when on-call is fatigued by low-value pages, when real incidents get missed in the noise, or when alerts fire on causes rather than impact.

invocablev1.0.0

Skill

Dashboard Designer

Design a service dashboard that answers one question at a glance — is the service healthy, and if not, where's the problem? — by structuring panels around RED/USE instead of dumping every metric. Use when a service has no dashboard, when the existing one is an unreadable metric wall, or during incident-readiness prep.

invocablev1.0.0

Skill

Distributed Tracing Instrumenter

Instrument a service (or a chain of services) with OpenTelemetry so a single request can be followed end-to-end — context propagated across every hop including async/queue boundaries, spans at the boundaries that matter, deliberate trace-wide sampling, and trace_id stamped on log lines. Use when latency or failures span multiple services, when you have logs but can't reconstruct a request's full path, or when adopting OpenTelemetry.

invocablev1.0.0

Skill

SLO Definer

Turn a vague reliability goal into concrete SLIs, SLOs, an error budget, and burn-rate alerts — service-level indicators measured at the user-facing boundary, targets over a rolling window, and a written policy for what happens when the budget runs out. Use when a service has no defined reliability target, when on-call is noisy and alert-fatigued, or before you commit to an SLA you can't measure.

invocablev1.0.0

Skill

Structured Logging Designer

Design a structured (JSON) logging strategy with a stable field schema, correlation-ID propagation, and a disciplined level policy — then migrate ad-hoc string logs toward it. Use when logs are unsearchable plain text, when debugging a request across services means grepping multiple log streams by hand, or when standing up logging for a new service.

invocablev1.0.0

Skill

Cold Start Optimizer

Cut cold-start latency for serverless functions and slow-booting apps by measuring the init breakdown, then attacking the dominant phase — artifact size, eager imports, eager connections, or under-provisioned memory — instead of reflexively buying provisioned concurrency. Use when serverless p99 spikes on the first request, when a function times out during init, or when scale-to-zero is hurting user-facing latency.

invocablev1.0.0

Skill

Load Test Designer

Design a defensible load test — a realistic workload model, a deliberate test type, and SLO-tied pass/fail thresholds — instead of a meaningless tight-loop script that hammers one endpoint. Use when validating capacity or SLOs before a launch or scaling event, when sizing infrastructure, or when an existing load test reports averages that nobody trusts.

invocablev1.0.0

Skill

Prompt Cache Optimizer

Restructure an LLM call to maximize prompt-cache hit rate and add response/semantic caching — move the stable prefix (system prompt, instructions, few-shot, context) to the front and variable input to the end, set cache breakpoints, and measure the hit rate and savings. Use when repeated calls share large common context and token cost or latency is too high.

invocablev1.0.0

Skill

Dependency Upgrade Planner

Plan and de-risk a major dependency, framework, or runtime upgrade — map the full version path, read every intermediate migration guide, and pin the breaking changes to your actual call sites instead of bumping the number and hoping. Use when a key dependency is several majors behind, when a security advisory forces an upgrade, or before a framework migration.

invocablev1.0.0

Skill

Canary Release Planner

Design a canary / progressive rollout so a bad release reaches 1% of users instead of 100% — staged traffic with bake times, gating metrics compared against the concurrently-running stable baseline, and automated promote-or-rollback. Use when shipping a risky change, when you want automatic rollback on regression, or when moving off all-at-once deploys.

invocablev1.0.0

Skill

Release Notes Writer

Write user-facing release notes — the curated 'what's new and what it means for you' — by starting from the real changes (git log / merged PRs / the changelog since the last release) and translating developer-speak into user impact, grouped by what the user cares about with breaking changes and required actions surfaced first. Use when shipping a release to users or customers and the raw commit log isn't something a user should read, when you need a published GitHub-release / blog / in-app announcement, or when a breaking change must be made unmissable so upgrades don't break.

invocablev1.0.0

Skill

SemVer Advisor

Decide the correct semantic-version bump — major, minor, or patch — by diffing a release range, mapping the changes onto the public API surface, and classifying each as breaking, additive, or a fix. Use before cutting a release when you are unsure whether changes are breaking, when a teammate proposes a bump you want to sanity-check, or when a behavior change has no signature change and you need to know if it is still breaking.

invocablev1.0.0

Skill

Version Bumper

Bump the project version everywhere it lives in one consistent pass — package.json, lockfile, nested/CLI package manifests, version constants, README badges, docs — then roll the changelog's Unreleased section under the new version and stage an annotated git tag. Use when you've already decided the new version (X.Y.Z or a pre-release like -rc.1) and need every artifact updated to the same value without drift, or before cutting a release.

invocablev1.0.0

Skill

Dependency Audit

Audit project dependencies for known vulnerabilities and turn the raw scanner output into a triaged, prioritized upgrade plan. Use when an audit is noisy, a CVE was reported, or you need to know which advisories actually matter.

invocablev1.0.0

Skill

License Compliance Checker

Audit the licenses of a project's dependencies for compatibility with how the project is distributed — flagging copyleft (GPL/AGPL/LGPL), missing or unknown licenses, and other obligations that conflict with your own license or SaaS/proprietary model. Use before shipping or open-sourcing, when adding a dependency, or when legal/procurement asks for a license inventory. This is a licensing review, not a vulnerability scan.

invocablev1.0.0

Skill

Dev Container Designer

Design a reproducible dev environment (Dev Container / Docker) so onboarding is one command and 'works on my machine' dies — by detecting the project's real stack and versions, authoring a devcontainer.json (+ Dockerfile/compose) that pins the runtime to what the repo targets, wires dependent services, caches dependencies, and injects secrets instead of baking them. Use when new contributors struggle to set up the project, when environment drift causes inconsistent behavior, or when standardizing tooling across a team.

invocablev1.0.0

Skill

Dockerfile Optimizer

Shrink and harden an existing Dockerfile — multi-stage builds, cache-friendly layer order, a lean pinned base image, a .dockerignore, and a non-root runtime user — without changing what the image runs. Use when an image is too large, builds are slow because the cache never hits, or a scan flags the container running as root.

invocablev1.0.0

Skill

GitHub Actions Optimizer

Make a GitHub Actions workflow faster, cheaper, and harder to attack — by profiling where wall-clock and billed minutes actually go, then adding content-keyed caching, matrix/job parallelism, run-cancellation, and path filters, and hardening the supply chain (SHA-pinned actions, least-privilege GITHUB_TOKEN, safe fork-PR handling). Use when CI is slow or queues, when a repo burns Actions minutes, or before trusting a workflow that runs on untrusted pull requests.

invocablev1.0.0

Guide

LLM API Pricing in 2026: Every Major Model Compared

Per-million-token prices for Claude, GPT, Gemini, DeepSeek, Mistral, and Grok — plus caching and batch discounts — verified against vendor pricing pages.

4m read· AgentsCamp

Guide

LLM Cost and Latency Engineering: Caching, Right-Sizing, and p95 Budgets

A practical playbook for cutting LLM cost and tail latency — caching, model right-sizing, prompt trimming, and enforced p95 budgets — without losing quality.

3m read· AgentsCamp

Guide

LLM Gateways Compared: Portkey vs Helicone vs LiteLLM for Caching & Cost Control

How Portkey, Helicone, and LiteLLM compare for caching, cost control, and observability — each one's 2026 status and which fits self-hosted vs. hosted.

4m read· AgentsCamp

Tool

Helicone

Open-source LLM observability and AI gateway with one-line integration — logging, tracing, caching, and cost/latency tracking across providers.

open sourceobservability

Tool

LiteLLM

Call 100+ LLM APIs with one OpenAI-format interface — as a Python library or a self-hosted gateway/proxy.

open sourcesdk

Tool

Portkey

An AI gateway and LLMOps platform: route to many LLMs through one API with caching, retries, fallbacks, load balancing, guardrails, and full observability.

freemiumplatform

Tool

Wave Terminal

Open-source terminal that blends the CLI with inline file previews, a built-in editor, a web browser, and a context-aware AI assistant in one window.

open sourceterminal

Command

Add Caching

Add a caching layer to one expensive function or endpoint correctly — confirm it's cacheable, design the cache key/TTL/layer/invalidation, handle stampedes, wrap the call in one place, and report the design.

/add-caching<function or endpoint to cache>

Command

Set Perf Budget

Define and enforce a cost and latency budget for an LLM feature or endpoint — set p95/p99 latency and cost-per-request ceilings, instrument to measure them against real traffic, and wire a check that fails when the budget is breached.

/set-perf-budget<the LLM endpoint/feature to budget, plus any target numbers (e.g. 'chat API, p95 < 2s, < $0.02/req')>

Command

Scaffold Dockerfile

Scaffold a production-grade multi-stage Dockerfile and .dockerignore for the current project.

/scaffold-dockerfile<optional: stack/runtime hint>

Command

Scaffold GitHub Action

Scaffold a hardened GitHub Actions workflow for a stated goal, wired to the project's real test/lint/build commands.

/scaffold-github-action<what the workflow should do — e.g. CI test on PR, lint, release/publish, nightly cron>

Command

Setup Pre-commit Hooks

Set up fast pre-commit hooks that catch problems before they land — detect the repo's existing stack and hook mechanism, run lint/format/typecheck plus a secret scan on staged files only, keep the slow test suite in CI, and make the setup reproducible for the whole team.

/setup-precommit-hooks