AI Glossary
61 AI and LLM-engineering terms, defined precisely — answer first, with the deeper guide linked.
A
- Agent EngineeringAgent engineering is the discipline of building reliable AI agents — designing the tools, context, guardrails, evals, and recovery paths around the model.
- Agent HarnessAn agent harness is the system around the model that makes it an agent — the loop, tools, context management, permissions, and recovery machinery.
- Agent MemoryAgent memory is how an AI agent retains information beyond its context window — working state during a task and persistent knowledge across sessions.
- Agentic AIAgentic AI is the class of AI systems that act toward goals — planning, calling tools, and iterating on results — rather than only generating content.
- AI AgentAn AI agent is an LLM-driven system that pursues a goal in a loop — calling tools, observing results, iterating — instead of returning one answer.
- AI SlopAI slop is low-effort, mass-produced AI-generated content — fluent, generic, and unchecked — flooding feeds, search results, and codebases.
B
C
- Chain-of-Thought (CoT)Chain-of-thought prompting has a model work through intermediate reasoning steps before answering — improving accuracy on multi-step problems.
- ChunkingChunking splits documents into retrievable pieces before embedding — the RAG design decision that quietly determines retrieval quality.
- Computer UseComputer use is an AI agent operating software through its real interface — reading the screen, moving the cursor, clicking, and typing like a person would.
- Constitutional AIConstitutional AI trains models against written principles — the model critiques and revises its own outputs by them, reducing reliance on human labels.
- Context WindowThe context window is the maximum text — measured in tokens — an LLM can consider at once: prompt, conversation, documents, and its own output combined.
- Cosine SimilarityCosine similarity measures how alike two embeddings are by the angle between them — the standard relevance score behind semantic search and RAG retrieval.
D
- DistillationDistillation trains a smaller model to imitate a larger one — using its outputs as training data to get most of the capability at a fraction of the cost.
- DPO (Direct Preference Optimization)DPO aligns a model to preferences directly from chosen-vs-rejected pairs — no reward model, no RL loop — simpler and more stable than classic RLHF.
E
- EmbeddingAn embedding is a vector of numbers representing text's meaning, placed so similar texts land close together — the foundation of semantic search and RAG.
- Embedding DimensionEmbedding dimension is the length of an embedding vector — how many numbers represent each text — trading capacity against storage and search cost.
- Eval DatasetAn eval dataset is the curated set of test cases — inputs with expected outcomes — that an LLM application's quality is measured against.
F
- Few-Shot PromptingFew-shot prompting includes worked examples in the prompt so the model learns the task's pattern from demonstrations instead of instructions alone.
- Fine-TuningFine-tuning continues training a pretrained model on your own examples, changing its weights to teach durable behavior, format, or domain style.
- Frontier ModelA frontier model is one of the most capable AI models available — the leading edge from labs like Anthropic, OpenAI, and Google, defining the state of the art.
- Function Calling (Tool Calling)Function calling lets an LLM request structured invocations of your code: describe tools with schemas, the model emits typed calls, your app executes them.
G
- GroundingGrounding ties a model's output to verifiable sources — retrieved documents, tool results, citations — instead of training-data memory.
- GuardrailsGuardrails are programmatic checks around an LLM — validating inputs and outputs in code — enforcing safety and format rules a prompt alone can't guarantee.
H
- HallucinationA hallucination is fluent, confident output that is factually wrong or fabricated — plausible text unsupported by any source, the signature LLM failure mode.
- Human-in-the-Loop (HITL)Human-in-the-loop design inserts human judgment at decisive points in an AI workflow — approving actions, resolving ambiguity, owning the irreversible steps.
- Hybrid SearchHybrid search runs keyword (BM25) and semantic (vector) retrieval together and merges the results — catching both exact terms and paraphrases.
I
J
K
L
- LLM-as-JudgeLLM-as-judge uses a language model to score AI outputs against a rubric — evaluating quality at scale where exact-match metrics fail and humans don't scale.
- LoRA (Low-Rank Adaptation)LoRA fine-tunes a model by training small low-rank adapter matrices instead of all weights — a fraction of the memory and cost, nearly full-tune quality.
M
- MCP (Model Context Protocol)MCP is the open standard for connecting AI models to external tools and data: write one server, and any MCP client — Claude Code, IDEs, agents — can use it.
- Mixture of Experts (MoE)MoE is a model architecture where a router activates only a few expert subnetworks per token — huge total capacity, a fraction of the compute per token.
- Multimodal AIMultimodal AI processes more than one kind of input or output — text, images, audio, video — in a single model, like an LLM that reads screenshots or speaks.
O
P
- Prompt CachingPrompt caching reuses the computed state of a repeated prompt prefix across requests — dramatically cutting cost and time-to-first-token for stable context.
- Prompt InjectionPrompt injection is an attack where untrusted content carries instructions an LLM then follows — overriding its task, leaking data, or triggering tool calls.
- Prompt TemplateA prompt template is a parameterized prompt — fixed instructions with variable slots — turning prompts from strings into versioned, testable components.
Q
R
- RAG (Retrieval-Augmented Generation)RAG retrieves relevant documents from your own data and injects them into an LLM's prompt at query time, grounding answers in facts the model wasn't trained on.
- Reasoning ModelA reasoning model is an LLM trained to think before answering — generating internal reasoning tokens it can spend adaptively on hard problems.
- Red-Teaming (AI)AI red-teaming is adversarial testing — attacking your model or agent with jailbreaks, injections, and misuse scenarios to find failures before users do.
- RerankingReranking is a second-pass scoring step: a cross-encoder model re-orders the top results from fast retrieval so the truly relevant few rise to the top.
- RLHF (Reinforcement Learning from Human Feedback)RLHF trains a model against human preferences: people rank outputs, a reward model learns the ranking, and the LLM is optimized to produce preferred responses.
S
- Semantic SearchSemantic search retrieves results by meaning rather than keyword overlap — embedding queries and documents in one vector space and matching by similarity.
- SLM (Small Language Model)A small language model is a compact LLM — roughly 1–15B parameters — that runs cheaply or locally, trading peak capability for speed and deployability.
- Speculative DecodingSpeculative decoding speeds up generation: a small draft model proposes tokens, the large model verifies them in one parallel pass — same output, fewer steps.
- Structured OutputStructured output makes an LLM return data in a guaranteed shape — JSON matching your schema — so code can consume model responses without parsing prose.
- SubagentA subagent is a specialist agent a primary agent delegates to — running in its own context window with its own prompt and tools, returning only a summary.
- Synthetic DataSynthetic data is training or eval data generated by a model rather than collected from the world — filling gaps, balancing classes, bootstrapping fine-tunes.
- System PromptThe system prompt is the standing instruction layer an LLM receives before user input — defining its role, rules, tools, and tone for the whole conversation.
T
- TemperatureTemperature controls how random an LLM's token choices are: low values make output focused and repeatable, high values make it varied and creative.
- Token (LLM)A token is the unit LLMs read and write — a word fragment of roughly 3–4 characters in English. Models are priced, limited, and measured in tokens, not words.
- Token StreamingToken streaming delivers model output incrementally as it's generated — via SSE or websockets — so users see text immediately instead of waiting.
- Top-p (Nucleus Sampling)Top-p sampling restricts an LLM's next-token choices to the smallest set whose probabilities sum to p — cutting the long tail of unlikely tokens adaptively.
- Tracing (LLM)LLM tracing records every step of a model-driven request — prompts, tool calls, retrievals, tokens, latency — so multi-step behavior is debuggable.
V
- Vector DatabaseA vector database stores embeddings and answers nearest-neighbor queries fast — the retrieval layer under RAG and semantic search, using ANN indexes like HNSW.
- Vibe CodingVibe coding is building software by describing intent in natural language and letting an AI agent write the code, judging results by behavior.
- VLM (Vision-Language Model)A VLM jointly understands images and text — reading documents, screenshots, charts, and photos and reasoning about them in language.