# Programmatic Prompt Optimization with DSPy: Stop Hand-Tuning Prompts

> Hand-tuning prompts doesn't scale. DSPy treats prompting as programming — declare tasks as typed signatures and let an optimizer compile the prompts for you.

DSPy reframes prompting as programming: declare the task as a typed signature, compose modules, define a metric, and let an optimizer (BootstrapFewShot, MIPROv2, GEPA) search instructions and few-shot demonstrations against your data. The payoff is a prompt tuned to your metric that survives a model swap — you recompile instead of rewriting by hand.

Hand-tuning prompts is the part of LLM work that doesn't scale. You tweak a sentence, eyeball three outputs, decide it's "better," and ship — then a model upgrade silently undoes all of it and you start over. [DSPy](/tools/dspy) (from Stanford NLP) takes a different stance: treat an LLM pipeline as a **program you compile**, where you specify *what* each step does and an optimizer works out *how* to prompt for it against a metric you define.

## The core idea: specify, don't phrase

The shift is separating a task's **specification** from its **implementation**. In ordinary prompting those are the same thing — the prompt string *is* both the spec and the implementation, so improving it means editing text by feel. DSPy splits them:

- You write a **signature** — a typed declaration of inputs and outputs (`question -> answer`, or a class with field descriptions). That's the spec.
- DSPy generates the actual prompt from the signature, and an **optimizer** tunes that prompt's instructions and few-shot examples. That's the implementation, and you don't hand-write it.

So you stop arguing with prompt wording and start improving the things that actually determine quality: the task spec, the metric, and the data.

## The building blocks

- **Signatures** — declarative input→output specs. `summarize: document -> summary`, with optional field descriptions and types.
- **Modules** — the strategies that turn a signature into a call: `dspy.Predict` (direct), `dspy.ChainOfThought` (reason first), `dspy.ReAct` (reason + tools). You compose them like layers in a network.
- **Metrics** — a function that scores an output against the expected one. This is the objective the optimizer maximizes, so it has to mean something.
- **Optimizers (teleprompters)** — `BootstrapFewShot` generates few-shot demonstrations from your data; `MIPROv2` jointly searches instructions and demonstrations; `GEPA` reflectively evolves instructions from feedback. They compile your program into tuned prompts.

```python
import dspy

# 1. specify the task, don't phrase the prompt
classify = dspy.ChainOfThought("ticket -> category, urgency")

# 2. a metric that scores an output
def metric(example, pred, trace=None):
    return pred.category == example.category and pred.urgency == example.urgency

# 3. let an optimizer compile the prompt + demos against your data
optimized = dspy.MIPROv2(metric=metric).compile(classify, trainset=train)
```

## Why it's worth the ceremony

Two payoffs justify the upfront cost of a metric and a dataset:

1. **It often beats hand-tuning.** An optimizer will try instruction phrasings and example sets you wouldn't have the patience to, and keep only what moves the metric.
2. **Portability.** When you switch models — a cheaper one, a newer one — you **recompile** against the new model instead of re-hand-tuning every prompt. Your prompts are no longer welded to one model's quirks.

> [!NOTE]
> DSPy is **evals-first**. It can't optimize what it can't measure, so the work moves from wording prompts to defining a metric that genuinely reflects quality and assembling a dataset that includes the hard cases. That's the same discipline behind the [prompt-engineer](/agents/data-ai/prompt-engineer) agent — DSPy just automates the inner loop.

## When to reach for it (and when not)

Use DSPy when you have a **multi-step pipeline**, **measurable quality**, and a task you'll **iterate on or re-tune across models**. Skip it for a single simple prompt, a one-off, or anything where you can't define a metric — there, hand-tuning or the [prompt-optimizer](/skills/workflow/prompt-optimizer) skill is faster, and the techniques in [Few-Shot vs Chain-of-Thought vs Structured Prompting](/guides/prompting/prompting-techniques-2026) cover what you'd be doing by hand. When the output needs to be machine-parseable, pair DSPy with the patterns in [Structured Output vs JSON Mode vs Function Calling](/guides/concepts/structured-output-2026).

---

_Source: https://agentscamp.com/guides/prompting/dspy-prompt-optimization — Guide on AgentsCamp._