Scaffold a vLLM Serving Config
Scaffold a vLLM serving config for a model on a target GPU — pick precision/quantization and parallelism to fit, set batching and context length, and expose an OpenAI-compatible server.
/scaffold-vllm-config<model + target GPU(s) and VRAM, or a description of the serving workload>Install to ~/.claude/commands/scaffold-vllm-config.md
Scope
Treat $ARGUMENTS as what to serve: a model (id/size), the target GPU(s) and VRAM, and ideally the workload shape (chat vs. batch, prompt/response lengths, target concurrency). If the GPU/VRAM isn't given, ask — it determines whether the model fits at all and at what precision.
Goal: a runnable, fits-the-GPU vLLM serving config with an OpenAI-compatible endpoint — sane defaults a human can then load-test and tune, not a guess that OOMs on first launch.
NOTE
This scaffolds a starting config; it does not load-test or tune to an SLO. For benchmarking and tuning throughput/p95 against a budget, hand off to the llm-inference-engineer. For local single-user running, Ollama is simpler than vLLM.
Step 1 — Size the model against the GPU
Estimate the model's memory at candidate precisions (FP16/BF16 vs. FP8 vs. AWQ/GPTQ int4) plus KV-cache headroom for your context length and concurrency. Decide whether it fits one GPU or needs tensor parallelism (--tensor-parallel-size N). State the assumption.
Step 2 — Choose precision/quantization
Pick the highest precision that fits with headroom; drop to FP8 or 4-bit quantization only as needed to fit, and flag that quantization can affect quality so it gets re-checked against an eval set, not assumed safe.
Step 3 — Set the core serving flags
Produce the vllm serve invocation (or equivalent config) with the parameters that matter:
--max-model-len— context length sized to your prompts (don't over-allocate; it costs KV-cache memory).--gpu-memory-utilization— how much VRAM vLLM may use (leave headroom).--max-num-seqs— concurrency / batch width.--tensor-parallel-size— for multi-GPU models.- quantization flag if used.
Step 4 — Expose the OpenAI-compatible endpoint
Confirm the server exposes /v1/chat/completions (and /v1/completions) so existing OpenAI clients work by changing the base URL. Note the host/port and any served-model-name.
Step 5 — Emit the config and a smoke test
Output the final command/config plus a one-line curl (or OpenAI-client snippet) to verify the endpoint responds, and the env/launch notes (GPU visibility, model download/cache).
WARNING
The two failure modes to pre-empt: an out-of-memory crash on launch (precision/context/concurrency too high for the VRAM) and a silent quality drop from quantization. Size conservatively with KV-cache headroom, and re-run your eval set after any quantization before trusting the deployment — see vLLM.
Related
- vLLMA high-throughput, memory-efficient inference and serving engine for LLMs, with PagedAttention, continuous batching, and an OpenAI-compatible API server.
- LLM Inference EngineerUse this agent to serve and optimize self-hosted LLM inference — sizing GPUs, configuring a serving engine like vLLM (continuous batching, PagedAttention, tensor parallelism), applying quantization, and tuning throughput and tail latency against a cost and p95 budget. Examples — "serve Llama-3-70B at p95 under 2s on our GPUs", "our self-hosted model is slow and the GPUs sit half-idle — raise throughput", "quantize this model to fit one GPU without wrecking quality".
- Self-Host vs API: When Does Running Your Own LLM Actually Pay Off?The real economics of self-hosting an LLM vs. calling a hosted API — GPU utilization, privacy, latency, and the hidden ops costs that decide the crossover.
- OllamaAn open-source tool to run open-weight LLMs locally with a single command, including a local OpenAI-compatible API.