DSPy
Program language models instead of prompting them: declare tasks as typed signatures and let optimizers compile the prompts and few-shot examples for you.
DSPy (from Stanford NLP) lets you build LLM pipelines as Python code rather than brittle prompt strings. You declare each step as a typed signature, compose modules like ChainOfThought and ReAct, then run an optimizer (BootstrapFewShot, MIPROv2, GEPA) that searches instructions and few-shot demonstrations against your metric and data. Change models and you recompile, not rewrite.
DSPy is a framework for programming language models rather than prompting them. Instead of hand-writing and hand-tuning prompt strings, you declare what each step of a pipeline does as a typed signature, compose those steps with modules, and let an optimizer generate and tune the actual prompts — instructions and few-shot examples — against a metric you define. It comes out of Stanford NLP and has become the reference tool for treating prompts as something you compile, not craft.
It is aimed at developers building LLM pipelines whose quality is measurable and who are tired of the hand-tuning treadmill — especially multi-step systems (retrieve → reason → answer) where prompt changes ripple and a model upgrade silently undoes weeks of tweaking.
Highlights
- Signatures — declare a task as typed inputs → outputs (
question -> answer); DSPy generates the prompt from the spec. - Modules — compose strategies like
dspy.Predict,dspy.ChainOfThought, anddspy.ReActinto a pipeline that's ordinary Python. - Optimizers —
BootstrapFewShot,MIPROv2, andGEPAsearch demonstrations and instruction wording against your metric, often beating hand-tuned prompts. - Portability — change models and recompile instead of re-hand-tuning every prompt.
- Evals-first — optimization is driven by a metric and example data, so quality is measured, not eyeballed.
In an AI-assisted workflow
import dspy
classify = dspy.ChainOfThought("ticket -> category, urgency") # specify, don't phrase
optimized = dspy.MIPROv2(metric=metric).compile(classify, trainset=train) # compile the promptYou specify the task and the metric; the optimizer figures out the prompt.
TIP
DSPy can't optimize what it can't measure. Invest first in a metric that genuinely reflects quality and a dataset that includes the hard cases — that's where the leverage is. See Programmatic Prompt Optimization with DSPy.
Good to know
DSPy is open source (MIT) and free; you pay your model provider for tokens during compilation and at runtime. It's a Python framework, so it fits Python-based LLM stacks most naturally. It's most worth its complexity on multi-step pipelines with measurable quality — for a single simple prompt, hand-tuning or the prompt-optimizer skill is faster. Background on the techniques it automates: Few-Shot vs Chain-of-Thought vs Structured Prompting.
Related
- Programmatic Prompt Optimization with DSPy: Stop Hand-Tuning PromptsHand-tuning prompts doesn't scale. DSPy treats prompting as programming — declare tasks as typed signatures and let an optimizer compile the prompts for you.
- Few-Shot vs Chain-of-Thought vs Structured Prompting: What to Use When (2026)When to reach for few-shot examples, chain-of-thought reasoning, or structured/output-constrained prompting — a 2026 decision guide to the core techniques.
- Prompt OptimizerDiagnose why a prompt underperforms and rewrite it with the technique that fixes it — clearer structure, few-shot examples, an explicit output contract, or reasoning scaffolding — returning an optimized prompt, the rationale for every change, and what to measure to confirm the lift. Use when a prompt is flaky, verbose, drifting in format, or just not good enough.