Terms Dictionary
Industry terms, concepts, and standards used across Pro Trailblazer content. Filter by letter, category, or search.
A
-
Agent
Agentic SystemsAn LLM call wrapped in three things: a scoped job, a set of tools it is allowed to call, and a clear output contract. Agents are not smarter models, they are narrowly focused models with defined inputs, outputs, and side effects.
-
Agentic workflow
Agentic SystemsAn AI system that splits one job into steps, assigns each step to a specialized agent, and routes the work between them through a central orchestrator. The intelligence lives in how the pieces are wired together, not in any single model call.
-
Attention mechanism
Model ArchitectureThe method that lets each token look at other tokens in the sequence and decide which ones matter for understanding it.
B
-
Backpropagation
Model TrainingThe algorithm that calculates how each weight contributed to the loss, walking backward through the network so weights can be updated. Training-only. Not involved in generating a response once the model is deployed.
-
Batch size
Model TrainingHow many training examples are processed together before one weight update. Bigger batches give more stable gradient estimates but require more memory.
-
Best-of-N
InferenceA sampling strategy that draws N independent completions from a model, scores each one (with a reward model, verifier, or log-probability), and returns the highest-scoring completion. Replaces voting with picking, which works well when the scoring function is trustworthy and badly when it is not.
C
-
Causal masking
Model ArchitectureA mask applied inside attention that prevents a token from looking at future tokens. This is what makes the model autoregressive, so it can only condition on what came before.
-
Chain-of-thought
InferenceA prompting or decoding pattern where the model produces intermediate reasoning steps before its final answer, rather than jumping straight to a conclusion. Longer chains tend to improve accuracy on multi-step problems by letting the model factor the work into smaller subproblems.
-
Context window
Model ArchitectureThe maximum number of tokens the model can hold at once. Prompt, prior conversation, and output-so-far all live inside it.
E
-
Embeddings
Model ArchitectureHigh-dimensional vectors that represent tokens. Each token maps to a list of numbers that encodes its meaning, and similar meanings end up near each other in that space. These self-organize during pretraining without explicit labels.
-
Example Term
General TermsThis is a placeholder glossary term showing how entries are structured. Replace with real terminology as you build out the glossary.
You can include markdown-style HTML here for richer descriptions. This field is optional.
F
-
Few-shot
InferenceA prompting pattern where the model receives two to five example input-output pairs before the actual task. The examples shift the conditional distribution over next tokens toward the demonstrated format without any weight updates to the model.
-
Forward pass
Model ArchitectureData flowing input to output through the model. Tokens go in, embeddings get looked up, attention and feedforward layers transform things, a prediction comes out. Happens during both training and inference.
G
-
Gating network
Model ArchitectureThe small neural network inside a Mixture of Experts layer that scores every expert for a given token and picks the top-k to route through. The gating network is tiny compared to the experts but decides everything about which parameters a token actually activates.
-
Gradient descent
Model TrainingThe optimization procedure that uses the gradients from backprop to step weights in the direction that lowers loss. The ball-rolling-downhill image.
-
Greedy decoding
InferenceThe simplest decoding strategy: always pick the highest-probability token at each step. Deterministic and fast, but tends to produce repetitive or bland output on longer generations because small probability gaps compound over many steps.
H
-
Hyperparameters
Model TrainingSettings chosen by humans before training (learning rate, batch size, layer count, etc.). Not learned by the model.
J
-
JSON-RPC
Agentic SystemsA simple remote procedure call format that encodes every request and response as a JSON object with a method name, parameters, and an id. MCP uses JSON-RPC 2.0 as its wire format over both stdio and HTTP.
K
-
KV cache
InferencePer-position keys and values from the attention layer, stored in GPU memory during generation so they do not have to be recomputed for every new token. Without a KV cache, autoregressive inference would scale quadratically with sequence length instead of linearly per step.
L
-
Logits
Model ArchitectureThe raw, unnormalized scores a language model's final layer produces, one per token in the vocabulary. Softmax turns logits into a probability distribution, and most sampling parameters (temperature, top-k, top-p) are applied to the logits before that final normalization.
-
Loss function
Model TrainingA score for how wrong the model's prediction was on a training example. Training aims to drive this down.
M
-
Majority voting
InferenceThe aggregation step at the heart of self-consistency: take N independent samples, extract each final answer, and return the answer that appears most often. Works on discrete answers (math results, multiple choice, yes/no) but not on open-ended generation where no two samples produce the same string.
-
MCP host
Agentic SystemsThe AI application (Claude Desktop, Claude Code, an IDE, an agent) that runs the model and speaks MCP to one or more servers. The host owns the conversation, the tool catalog, and any API keys.
-
MCP server
Agentic SystemsA small program that exposes tools, resources, and prompts to an MCP host. Servers run as separate processes (stdio child or remote HTTP) so a single host can plug many of them in without changing its code.
-
Mixture of Experts
Model ArchitectureA transformer architecture that replaces each layer's feed-forward block with a bank of smaller feed-forward networks called experts, plus a gating network that routes every token through only a small top-k subset. This gives the model a huge total parameter count while keeping per-token compute small.
-
Model Context Protocol
Agentic SystemsAn open standard from Anthropic that lets AI apps discover and call external tools, read external data, and invoke templated workflows through a common JSON-RPC interface. Abbreviated MCP.
-
Multi-head attention
Model ArchitectureRunning attention in parallel multiple times within the same layer, each head with its own learned projections. Different heads can specialize in different kinds of relationships.
O
-
Orchestrator
Agentic SystemsThe component in an agentic workflow that holds the pipeline definition, passes outputs between agents, handles retries on failure, and decides when to drop a task. It is usually a few hundred lines of code, not a separate model, and it is where the business logic lives.
-
Overfitting
Model TrainingWhen a model learns training data too specifically and fails to generalize. Good on training set, bad on new data.
P
-
Parallel training
Model TrainingTraining across many GPUs or TPUs at once. Can be data parallelism (same model copies, different batches), model parallelism (model split across devices), or hybrids.
-
Pretraining
Model TrainingThe initial training phase on massive text corpora, where the model learns general language patterns, before any fine-tuning or alignment work.
Q
-
Q/K/V framework
Model ArchitectureThe three projections inside attention. Each token produces a Query (what am I looking for?), a Key (what do I offer?), and a Value (the actual info to pass along). Q times K produces match scores, and those scores weight the sum of Vs.
S
-
Self-consistency
InferenceA sampling strategy that asks the model for many independent reasoning chains on the same question and returns the most common final answer. The intuition is that wrong answers tend to disagree with each other while correct answers tend to agree, so majority voting concentrates the signal.
-
Softmax
Model ArchitectureThe function that turns a vector of raw scores (logits) into a probability distribution by exponentiating each score and normalizing so the results sum to 1. It is the last step before a language model either picks or samples a token.
-
Sparse activation
Model ArchitectureUsing only a fraction of a model's parameters for any given input. In Mixture of Experts, only the top-k experts per layer are touched for each token, so compute scales with active parameters rather than total parameters. Memory still scales with the total.
T
-
Temperature
InferenceAn inference-time sampling parameter. Zero means always pick the highest-probability token. Higher values spread probability across alternatives for more varied output.
-
Test-time compute
InferenceCompute spent at the moment a model answers a question, rather than during training. Scaling test-time compute, through longer reasoning chains, more parallel samples, or search over branches, can raise accuracy without changing the weights. The core idea behind self-consistency, best-of-N, and the o1, o3, and R1 reasoning-model family.
-
Tokens
Model ArchitectureThe discrete units a model actually processes. Text is broken into tokens (usually subword chunks, not whole words) before anything else happens.
-
Tool use
Agentic SystemsThe pattern where an LLM emits a structured call to an external function (a pricing API, a database lookup, a code runner) and incorporates the result into its next response. Tool use is what turns a text generator into an agent that can actually change the state of the world.
-
Top-k sampling
InferenceA sampling strategy that keeps only the K highest-probability tokens at each step, discards the rest, renormalizes, and draws a token from what is left. The cutoff is rank-based, so the same K can feel too loose on peaked distributions and too tight on flat ones.
-
Top-p sampling
InferenceA sampling strategy, also called nucleus sampling, that keeps tokens in descending probability order until their cumulative mass crosses a threshold P, then samples from that set. The eligible set grows and shrinks with the shape of the distribution, which is why it handles uncertainty better than top-k.
-
Training loop
Model TrainingThe repeated cycle of forward pass, compute loss, backpropagation, weight update, then repeat, over billions of examples.
-
Transformer layers
Model ArchitectureStacked blocks of attention plus feedforward. Each layer refines token representations based on the layer below. Modern LLMs stack dozens to 100+.
Z
-
Zero-shot
InferenceA prompting pattern where the model receives only the task description with no examples of the desired input-output format. The model relies entirely on its pretrained knowledge to produce a response.
No terms match your filters.