Token Streaming

Token streaming sends a model's response as it's generated — token by token over Server-Sent Events or websockets — so the consumer renders output immediately rather than waiting for completion.

It exists because inference is sequential: the model produces one token at a time, and a long answer takes real seconds. Streaming doesn't make generation faster — it makes waiting obsolete by shifting the felt metric from total time to time-to-first-token, which is why every chat product streams and why TTFT is a first-class latency number alongside tokens-per-second.

Engineering-wise, the happy path is easy (providers ship SSE out of the box; scaffolding the endpoint is rote) and the edges are where care goes: structured output arrives in fragments (buffer or parse incrementally), tool calls stream as deltas, mid-stream errors leave partial responses to handle, and UI rendering wants throttling so token-rate doesn't thrash the DOM. In agent systems streaming compounds — each step's output streams into visibility, which is how long-running agents stay legible instead of silent.

Frequently asked questions

Why stream tokens instead of waiting for the full response?

Perceived latency. Generation takes as long as it takes, but streaming moves the user's wait from 'full response time' to 'time-to-first-token' — often 10x shorter. For chat and agents, watching text arrive is the difference between feeling instant and feeling broken; nothing about the total time changed, only when value starts arriving.

What's tricky about consuming a token stream?

Structure. Plain prose streams trivially; JSON and tool calls arrive as fragments that only parse when complete — so consumers buffer structured segments, parse incrementally, or use provider events that delimit them. Error handling changes too: a stream can fail mid-response, so handle partial output and aborts deliberately.

Frequently asked questions

Related