Self-attention, Q/K/V, multi-head attention, and next-token prediction — the architecture behind Claude, GPT, and every modern LLM.
Before transformers, models (RNNs/LSTMs) processed text one token at a time, squeezing all past information through a single hidden state. By token 50, information about token 1 is mostly gone.
"The merchant who submitted the chargeback dispute last quarter ... was flagged."↑ By the time the RNN reaches "flagged", it has mostly forgotten "merchant"
Processes tokens one by one. Information fades over distance. Can't parallelize — slow on GPUs. Token 1 is a whisper by token 50.
Every token can directly look at every other token, no matter how far apart. No bottleneck. Fully parallel — fast on GPUs.
The core mechanism. Each token asks: "Which other tokens are relevant to me?" and pulls information from them directly.
All tokens are processed simultaneously — not one by one. This is why transformers train 100x faster than RNNs on modern GPUs.
Self-attention uses three concepts borrowed from information retrieval. Think of it like a library search:
"What am I looking for?"
The current token asking a question
"What do I contain?"
Every token advertising what it offers
"Here's my actual content"
The information each token provides
Let's trace attention for the word "rate" in: "The merchant chargeback rate is high"
"rate" sends out its Query and compares against every token's Key via dot product:
q_rate · k_The = 0.1 (not relevant)q_rate · k_merchant = 0.4 (somewhat relevant)q_rate · k_chargeback = 0.8 (very relevant — what kind of rate?)q_rate · k_rate = 0.9 (self-attention)q_rate · k_is = 0.2 (low)q_rate · k_high = 0.7 (relevant — describes the rate)
Divide by √dk, then apply softmax so weights sum to 1:
The: 0.03 | merchant: 0.10 | chargeback: 0.28 | rate: 0.35 | is: 0.05 | high: 0.19
Multiply each token's Value by its attention weight and sum:
output_rate = 0.03·v_The + 0.10·v_merchant + 0.28·v_chargeback + 0.35·v_rate + 0.05·v_is + 0.19·v_high
The result: "rate" now knows it's a chargeback rate that is high. Its embedding is enriched with context from the most relevant tokens.
Attention(Q, K, V) = softmax(Q·KT / √dk) · VOne attention head captures one type of relationship. But language has many simultaneous relationships. The solution: run multiple attention heads in parallel, each learning different patterns.
"rate" → "is" (subject-verb)
"rate" → "chargeback" (modifier)
"rate" → "high" (descriptor)
"rate" → nearby tokens
GPT-2 has 12 heads. Claude likely has 64-128 heads. Each head independently computes Q, K, V and produces its own attention pattern. The outputs are concatenated and projected back to the original dimension.
Each transformer block has two main parts, repeated N times (12 in GPT-2, 96 in GPT-4):
After passing through the transformer blocks, each token's embedding is simultaneously:
"merchant" knows what "merchant" means
"merchant" at position 3 ≠ position 103
"bank" in "river bank" ≠ "bank account"
No previous architecture achieved all three at once. This is why transformers dominate.
The same transformer architecture is used for both training and inference — but the process is very different:
Anthropic/OpenAI/Meta train the model on billions of sentences. Takes weeks on thousands of GPUs. Costs millions of dollars.
You send a prompt, Claude generates a response. Takes seconds. Costs fractions of a cent.
Step 1: Process ["Assess","this","merchant"] through all blocks → predict "risk"
Step 2: Process ["Assess","this","merchant","risk"] through ALL blocks again → predict "rating"
Step 3: Process ["Assess","this","merchant","risk","rating"] again → predict ":"
... repeat until done. Each step = full forward pass, no weights updated.
| Pre-Training | Inference | |
|---|---|---|
| Goal | Learn language patterns | Generate useful output |
| Processing | All tokens in parallel | One token at a time (sequential) |
| Weights | Updated every batch | Frozen — read-only |
| Cost | Millions of $, weeks of GPU time | Fractions of a cent per call |
| Happens | Once (by Anthropic/OpenAI/Meta) | Every time you send a prompt |
| Who does it | AI companies with GPU clusters | You, via API or Claude chat |
After all transformer blocks process the input, the model predicts the most probable next token. This is the fundamental operation — everything Claude generates is one token at a time.
Input: "The merchant risk rating is"→ Transformer processes all tokens → final layer produces probabilities:
The model generates text one token at a time, feeding each prediction back as input:
The original Transformer (2017) had both an encoder and decoder. Researchers discovered that using only one half works better for certain tasks:
| Architecture | What It Does | Models | Used For |
|---|---|---|---|
| Encoder-only | Reads text bidirectionally (sees left AND right) | BERT, RoBERTa | Understanding: classification, NER, search |
| Decoder-only ⭐ | Reads left-to-right, generates one token at a time | GPT, Claude, LLaMA | Generation: chat, code, writing |
| Encoder-Decoder | Reads input fully, then generates output | T5, BART | Translation, summarization |
| Model | Type | Parameters | Context | Key Innovation |
|---|---|---|---|---|
| BERT (2018) | Encoder | 340M | 512 | Bidirectional attention, MLM pretraining |
| GPT-2 (2019) | Decoder | 1.5B | 1,024 | Showed scale improves quality |
| GPT-4 (2023) | Decoder | ~1.8T (MoE) | 128K | Mixture of Experts, multimodal |
| LLaMA 3 (2024) | Decoder | 8B–405B | 128K | Open-source, RoPE, GQA, SwiGLU |
| Claude (2024) | Decoder | Unknown | 200K | Constitutional AI, long context |
| T5 (2020) | Enc-Dec | 11B | 512 | Everything as text-to-text |
When you chat with Claude, it generates one token at a time using masked self-attention. It can only look at tokens it has already generated — never peeks ahead.
When Claude searches your documents, it uses embedding similarity (cosine) to find relevant chunks. The transformer then processes those chunks to generate an answer.
For risk assessments, use low temperature (0.1-0.3) for consistent ratings. For brainstorming, use higher temperature (0.7-1.0) for creative ideas.
Claude's 200K context window means it can "see" ~150 pages at once. Every token in that window attends to every other token — that's the power of self-attention.