vLLM — AgentsCamp

vLLM is an open-source inference and serving engine for open-weight LLMs with high throughput on GPUs. PagedAttention manages the KV cache like virtual memory and continuous batching keeps hardware saturated, while an OpenAI-compatible server means existing clients work by swapping the base URL — the default engine for self-hosted production serving.

vLLM is an open-source inference and serving engine built to run open-weight LLMs with high throughput and efficient GPU memory use. Its headline innovation, PagedAttention, manages the KV cache like virtual memory so the GPU wastes far less on fragmentation and padding — which, combined with continuous (in-flight) batching, keeps the hardware saturated and pushes far more tokens per second than naive serving. It's the engine most teams reach for when self-hosting an LLM for production traffic.

It is aimed at engineers serving open models at scale who need concurrency, low cost-per-token, and a drop-in API. vLLM ships an OpenAI-compatible server, so existing client code can point at your self-hosted model by changing a base URL.

Highlights

PagedAttention — KV-cache management that minimizes memory waste and enables high concurrency.
Continuous batching — new requests join the batch in flight instead of waiting, so the GPU isn't idle between requests.
OpenAI-compatible API — serve /v1/chat/completions and friends; existing OpenAI clients work by swapping the base URL.
Quantization & parallelism — supports AWQ/GPTQ/FP8 and tensor/pipeline parallelism to fit large models and trade quality for footprint.
Broad model support — runs most popular open architectures (Llama, Mistral, Qwen, Gemma, and more).

In an AI-assisted workflow

Serve a model with an OpenAI-compatible endpoint, then call it like any OpenAI client:

# start the server (single GPU; add --tensor-parallel-size N for multi-GPU)
vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192
 
# now hit it with any OpenAI client, just change the base URL:
#   base_url="http://localhost:8000/v1"

TIP

Most of vLLM's throughput comes from keeping the batch full — tune --max-num-seqs, --gpu-memory-utilization, and (for big models) --tensor-parallel-size to your GPU and SLO. The llm-inference-engineer tunes these against a p95 and cost target; Scaffold a vLLM Serving Config gets you a sane starting config.

Good to know

vLLM is free and open source under Apache-2.0 and targets Linux with NVIDIA (and other) accelerators — it's production-serving infrastructure, not a local desktop app. For running a model locally on a laptop for development, Ollama or LM Studio are the simpler fit; weigh self-hosting against a hosted API in Self-Host vs API.

Frequently asked questions

What is vLLM?

vLLM is an open-source inference and serving engine built to run open-weight LLMs with high throughput and efficient GPU memory use. Its headline innovation, PagedAttention, manages the KV cache like virtual memory to minimize fragmentation, and continuous batching lets new requests join in flight — together pushing far more tokens per second than naive serving.

How do I serve a model with vLLM?

Run vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 8192 to start an OpenAI-compatible server, then point any OpenAI client at http://localhost:8000/v1. For multi-GPU, add --tensor-parallel-size N; tune --max-num-seqs and --gpu-memory-utilization for throughput.

vLLM vs Ollama?

Different jobs. vLLM is production-serving infrastructure for Linux/GPU deployments that need concurrency and low cost-per-token; Ollama and LM Studio are the simpler fit for running a model locally on a laptop for development. vLLM is what you move to when many concurrent users hit a self-hosted model.

Is vLLM free?

Yes — free and open source under Apache-2.0. It targets Linux with NVIDIA (and other) accelerators; you provide the GPUs.

Highlights

In an AI-assisted workflow

Good to know

Frequently asked questions

Related