vLLM
A high-throughput, memory-efficient inference and serving engine for LLMs, with PagedAttention, continuous batching, and an OpenAI-compatible API server.
vLLM is an open-source inference and serving engine built to run open-weight LLMs with high throughput and efficient GPU memory use. Its headline innovation, PagedAttention, manages the KV cache like virtual memory so the GPU wastes far less on fragmentation and padding — which, combined with continuous (in-flight) batching, keeps the hardware saturated and pushes far more tokens per second than naive serving. It's the engine most teams reach for when self-hosting an LLM for production traffic.
It is aimed at engineers serving open models at scale who need concurrency, low cost-per-token, and a drop-in API. vLLM ships an OpenAI-compatible server, so existing client code can point at your self-hosted model by changing a base URL.
Highlights
- PagedAttention — KV-cache management that minimizes memory waste and enables high concurrency.
- Continuous batching — new requests join the batch in flight instead of waiting, so the GPU isn't idle between requests.
- OpenAI-compatible API — serve
/v1/chat/completionsand friends; existing OpenAI clients work by swapping the base URL. - Quantization & parallelism — supports AWQ/GPTQ/FP8 and tensor/pipeline parallelism to fit large models and trade quality for footprint.
- Broad model support — runs most popular open architectures (Llama, Mistral, Qwen, Gemma, and more).
In an AI-assisted workflow
Serve a model with an OpenAI-compatible endpoint, then call it like any OpenAI client:
# start the server (single GPU; add --tensor-parallel-size N for multi-GPU)
vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192
# now hit it with any OpenAI client, just change the base URL:
# base_url="http://localhost:8000/v1"TIP
Most of vLLM's throughput comes from keeping the batch full — tune --max-num-seqs, --gpu-memory-utilization, and (for big models) --tensor-parallel-size to your GPU and SLO. The llm-inference-engineer tunes these against a p95 and cost target; Scaffold a vLLM Serving Config gets you a sane starting config.
Good to know
vLLM is free and open source under Apache-2.0 and targets Linux with NVIDIA (and other) accelerators — it's production-serving infrastructure, not a local desktop app. For running a model locally on a laptop for development, Ollama or LM Studio are the simpler fit; weigh self-hosting against a hosted API in Self-Host vs API.
Related
- LLM Inference EngineerUse this agent to serve and optimize self-hosted LLM inference — sizing GPUs, configuring a serving engine like vLLM (continuous batching, PagedAttention, tensor parallelism), applying quantization, and tuning throughput and tail latency against a cost and p95 budget. Examples — "serve Llama-3-70B at p95 under 2s on our GPUs", "our self-hosted model is slow and the GPUs sit half-idle — raise throughput", "quantize this model to fit one GPU without wrecking quality".
- Self-Host vs API: When Does Running Your Own LLM Actually Pay Off?The real economics of self-hosting an LLM vs. calling a hosted API — GPU utilization, privacy, latency, and the hidden ops costs that decide the crossover.
- Scaffold a vLLM Serving ConfigScaffold a vLLM serving config for a model on a target GPU — pick precision/quantization and parallelism to fit, set batching and context length, and expose an OpenAI-compatible server.
- OllamaAn open-source tool to run open-weight LLMs locally with a single command, including a local OpenAI-compatible API.
- LM StudioA desktop app for discovering, downloading, and running open-weight LLMs locally with a GUI and a local OpenAI-compatible server.
- UnslothAn open-source library that makes LoRA/QLoRA fine-tuning of LLMs roughly 2x faster and far more memory-efficient, so you can fine-tune on a single GPU.