Skip to content
agentscamp
Tool

vLLM

A high-throughput, memory-efficient inference and serving engine for LLMs, with PagedAttention, continuous batching, and an OpenAI-compatible API server.

open sourcesdk
Updated Jun 4, 2026
llminferenceservinggpuopen-source

vLLM is an open-source inference and serving engine built to run open-weight LLMs with high throughput and efficient GPU memory use. Its headline innovation, PagedAttention, manages the KV cache like virtual memory so the GPU wastes far less on fragmentation and padding — which, combined with continuous (in-flight) batching, keeps the hardware saturated and pushes far more tokens per second than naive serving. It's the engine most teams reach for when self-hosting an LLM for production traffic.

It is aimed at engineers serving open models at scale who need concurrency, low cost-per-token, and a drop-in API. vLLM ships an OpenAI-compatible server, so existing client code can point at your self-hosted model by changing a base URL.

Highlights

  • PagedAttention — KV-cache management that minimizes memory waste and enables high concurrency.
  • Continuous batching — new requests join the batch in flight instead of waiting, so the GPU isn't idle between requests.
  • OpenAI-compatible API — serve /v1/chat/completions and friends; existing OpenAI clients work by swapping the base URL.
  • Quantization & parallelism — supports AWQ/GPTQ/FP8 and tensor/pipeline parallelism to fit large models and trade quality for footprint.
  • Broad model support — runs most popular open architectures (Llama, Mistral, Qwen, Gemma, and more).

In an AI-assisted workflow

Serve a model with an OpenAI-compatible endpoint, then call it like any OpenAI client:

# start the server (single GPU; add --tensor-parallel-size N for multi-GPU)
vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192
 
# now hit it with any OpenAI client, just change the base URL:
#   base_url="http://localhost:8000/v1"

TIP

Most of vLLM's throughput comes from keeping the batch full — tune --max-num-seqs, --gpu-memory-utilization, and (for big models) --tensor-parallel-size to your GPU and SLO. The llm-inference-engineer tunes these against a p95 and cost target; Scaffold a vLLM Serving Config gets you a sane starting config.

Good to know

vLLM is free and open source under Apache-2.0 and targets Linux with NVIDIA (and other) accelerators — it's production-serving infrastructure, not a local desktop app. For running a model locally on a laptop for development, Ollama or LM Studio are the simpler fit; weigh self-hosting against a hosted API in Self-Host vs API.

Related