Portkey
An AI gateway and LLMOps platform: route to many LLMs through one API with caching, retries, fallbacks, load balancing, guardrails, and full observability.
Portkey is an AI gateway and LLMOps platform: route to 1,600+ LLMs through one OpenAI-compatible API with simple and semantic caching, automatic retries, fallbacks, and load balancing — plus observability (logs, traces, cost and latency), prompt management, guardrails, virtual keys, and budgets. The fast routing gateway is open source (MIT) and self-hostable; the hosted control plane is freemium.
Portkey is an AI gateway paired with an LLMOps control plane. The gateway puts 1,600+ models behind one OpenAI-compatible API and adds the reliability and cost levers you'd otherwise build yourself — caching, retries, fallbacks, load balancing — while the hosted platform layers on observability, prompt management, and governance. It's aimed at teams who want one managed control point for all their LLM traffic, with caching and cost control built in rather than bolted on.
It earns its place in a cost-and-latency stack specifically: caching cuts the cost and latency of repeated calls, routing lets you right-size models per request, observability attributes spend per key/team, and virtual keys with budgets and rate limits cap runaway cost.
Highlights
- Unified API to 1,600+ LLMs — one OpenAI-compatible endpoint across 45+ providers; swap models by changing a string.
- Caching — both simple and semantic caching to cut repeat-call cost and latency.
- Reliability — automatic retries, fallbacks across providers, and load balancing across keys.
- Observability — logs, traces, and cost/latency metrics per request, key, and team.
- Governance — virtual keys, per-team budgets, rate limits, and 50+ guardrails.
In an AI-assisted workflow
# OpenAI-compatible: point your existing client at the gateway
curl https://api.portkey.ai/v1/chat/completions \
-H "x-portkey-api-key: $PORTKEY_API_KEY" \
-d '{"model":"anthropic/claude","messages":[{"role":"user","content":"hi"}]}'Most SDKs work by swapping the base URL and adding Portkey's header, so adoption is a config change.
TIP
Turn on semantic caching for workloads with repetitive or near-duplicate prompts (FAQs, classification, retrieval-augmented answers): it serves a cached response for semantically similar inputs, cutting both spend and p95 latency. Measure the hit rate so you know it's paying off.
Good to know
The Portkey gateway is open source (MIT) and self-hostable from its repo; the hosted platform is freemium — a free tier for prototyping, a paid production tier, and enterprise plans with governance and compliance. As a gateway it sits in your request path and handles your provider keys, so treat it as infrastructure you operate or trust. In 2026, Palo Alto Networks completed its acquisition of Portkey (closed May 2026), folding the gateway into its enterprise AI-security platform; Portkey continues as an actively developed product. Compare the library-or-self-hosted LiteLLM and the observability-first Helicone in LLM Gateways Compared.
Related
- LLM Gateways Compared: Portkey vs Helicone vs LiteLLM for Caching & Cost ControlHow Portkey, Helicone, and LiteLLM compare for caching, cost control, and observability — each one's 2026 status and which fits self-hosted vs. hosted.
- HeliconeOpen-source LLM observability and AI gateway with one-line integration — logging, tracing, caching, and cost/latency tracking across providers.
- LiteLLMCall 100+ LLM APIs with one OpenAI-format interface — as a Python library or a self-hosted gateway/proxy.
- LLM Cost and Latency Engineering: Caching, Right-Sizing, and p95 BudgetsA practical playbook for cutting LLM cost and tail latency — caching, model right-sizing, prompt trimming, and enforced p95 budgets — without losing quality.
- LLM Cost OptimizerUse this agent to cut the cost and latency of an application's LLM API usage without losing quality — audit where the tokens and dollars go, then apply caching, model right-sizing, prompt trimming, batching, and budgets, proven against an eval bar. Examples — "our OpenAI bill tripled, find where the spend is and cut it", "this endpoint's p95 is 8s, bring it down", "right-size models per task and add prompt caching to our chat feature".