The Transformer — How AI Thinks

Self-attention, Q/K/V, multi-head attention, and next-token prediction — the architecture behind Claude, GPT, and every modern LLM.

📖 Day 1 Reference ⚡ Interactive 🏦 Finance Context

🧠 The Problem: Sequential Bottleneck

Before transformers, models (RNNs/LSTMs) processed text one token at a time, squeezing all past information through a single hidden state. By token 50, information about token 1 is mostly gone.

The bottleneck problem
"The merchant who submitted the chargeback dispute last quarter ... was flagged."
↑ By the time the RNN reaches "flagged", it has mostly forgotten "merchant"
🔗

RNN: Sequential Chain

Processes tokens one by one. Information fades over distance. Can't parallelize — slow on GPUs. Token 1 is a whisper by token 50.

🌐

Transformer: Direct Access

Every token can directly look at every other token, no matter how far apart. No bottleneck. Fully parallel — fast on GPUs.

🔍

Self-Attention

The core mechanism. Each token asks: "Which other tokens are relevant to me?" and pulls information from them directly.

Parallel Processing

All tokens are processed simultaneously — not one by one. This is why transformers train 100x faster than RNNs on modern GPUs.

🔗 Where the Transformer Fits

📝Your Text
✂️Tokenizer
🔢Token IDs
📊Embeddings
🧠Transformer
(Attention)
💬Output

🔑 Query, Key, Value — The Language of Attention

Self-attention uses three concepts borrowed from information retrieval. Think of it like a library search:

🔍

Query (Q)

"What am I looking for?"
The current token asking a question

🏷️

Key (K)

"What do I contain?"
Every token advertising what it offers

📄

Value (V)

"Here's my actual content"
The information each token provides

📖 Step-by-Step: How Attention Works

Let's trace attention for the word "rate" in: "The merchant chargeback rate is high"

Step 1 — Compute similarity scores

"rate" sends out its Query and compares against every token's Key via dot product:

Dot products (raw attention scores)
q_rate · k_The = 0.1  (not relevant)
q_rate · k_merchant = 0.4  (somewhat relevant)
q_rate · k_chargeback = 0.8  (very relevant — what kind of rate?)
q_rate · k_rate = 0.9  (self-attention)
q_rate · k_is = 0.2  (low)
q_rate · k_high = 0.7  (relevant — describes the rate)

Step 2 — Normalize into probabilities (softmax)

Divide by √dk, then apply softmax so weights sum to 1:

Attention weights (after softmax)
The: 0.03 | merchant: 0.10 | chargeback: 0.28 | rate: 0.35 | is: 0.05 | high: 0.19

Step 3 — Weighted average of Values

Multiply each token's Value by its attention weight and sum:

New embedding for "rate"
output_rate = 0.03·v_The + 0.10·v_merchant + 0.28·v_chargeback + 0.35·v_rate + 0.05·v_is + 0.19·v_high

The result: "rate" now knows it's a chargeback rate that is high. Its embedding is enriched with context from the most relevant tokens.

💡
The matrix formula: Attention(Q, K, V) = softmax(Q·KT / √dk) · V
This is a single matrix multiplication — highly parallelizable on GPUs. Every token attends to every other token simultaneously.

🎮 Self-Attention Heatmap

Click a token to see what it attends to
🎓
This heatmap shows attention weights — how much each token (row) pays attention to every other token (column). Brighter = stronger attention. Click any token on the left to highlight its attention pattern.
🔍
What to notice: Each row shows what one token "pays attention to". Nouns attend strongly to their adjectives. Verbs attend to their subjects. "rate" attends to "chargeback" (what kind of rate?) and "high" (what about the rate?). This is how the model understands context.

🏗️ Multi-Head Attention: Multiple Perspectives

One attention head captures one type of relationship. But language has many simultaneous relationships. The solution: run multiple attention heads in parallel, each learning different patterns.

Head 1: Syntax

"rate" → "is" (subject-verb)

Head 2: Semantics

"rate" → "chargeback" (modifier)

Head 3: Sentiment

"rate" → "high" (descriptor)

Head 4: Position

"rate" → nearby tokens

GPT-2 has 12 heads. Claude likely has 64-128 heads. Each head independently computes Q, K, V and produces its own attention pattern. The outputs are concatenated and projected back to the original dimension.

🔄 Inside a Transformer Block

Each transformer block has two main parts, repeated N times (12 in GPT-2, 96 in GPT-4):

Input embeddings
  ↓
Multi-Head Self-Attention — tokens share information
  ↓ + residual connection + layer norm
Feed-Forward Network (MLP) — refine each token independently
  ↓ + residual connection + layer norm
Output embeddings (→ next block or final output)
💡
Residual connections add the input of each layer back to its output — like a "highway" that prevents information from fading through many layers. Without them, deep networks (96 layers!) would fail to train.

🎭 Three Superpowers After Attention

After passing through the transformer blocks, each token's embedding is simultaneously:

📝

Token-Aware

"merchant" knows what "merchant" means

📍

Position-Aware

"merchant" at position 3 ≠ position 103

🌐

Context-Aware

"bank" in "river bank" ≠ "bank account"

No previous architecture achieved all three at once. This is why transformers dominate.

🔄 Training vs Inference: What Happens When?

The same transformer architecture is used for both training and inference — but the process is very different:

🏗️

Pre-Training (happens once)

Anthropic/OpenAI/Meta train the model on billions of sentences. Takes weeks on thousands of GPUs. Costs millions of dollars.

• All tokens processed in parallel
• Model sees the full sentence
• Masking prevents peeking ahead
• Weights are updated every batch
• Learns: embeddings, Q/K/V matrices, MLP weights

Inference (every time you prompt Claude)

You send a prompt, Claude generates a response. Takes seconds. Costs fractions of a cent.

• Tokens generated one at a time
• Each new token = full forward pass
• Uses its own predictions as input
• Weights are frozen (read-only)
• No learning — just applying what was learned
What happens when you prompt Claude: "Assess this merchant"
Step 1: Process ["Assess","this","merchant"] through all blocks → predict "risk"
Step 2: Process ["Assess","this","merchant","risk"] through ALL blocks again → predict "rating"
Step 3: Process ["Assess","this","merchant","risk","rating"] again → predict ":"
... repeat until done. Each step = full forward pass, no weights updated.
Pre-TrainingInference
GoalLearn language patternsGenerate useful output
ProcessingAll tokens in parallelOne token at a time (sequential)
WeightsUpdated every batchFrozen — read-only
CostMillions of $, weeks of GPU timeFractions of a cent per call
HappensOnce (by Anthropic/OpenAI/Meta)Every time you send a prompt
Who does itAI companies with GPU clustersYou, via API or Claude chat
💡
For AnyCompany participants: When you use Claude or Bedrock, you're only doing inference — the model is frozen and just applying what it learned during pre-training. This is why it's so cheap (cents per assessment) and fast (seconds per response). The expensive training already happened.

🎯 Next Token Prediction: How AI Generates Text

After all transformer blocks process the input, the model predicts the most probable next token. This is the fundamental operation — everything Claude generates is one token at a time.

Example: predicting the next token
Input: "The merchant risk rating is"
→ Transformer processes all tokens → final layer produces probabilities:
0.8
Low (0.1-0.3)
Deterministic — always picks "RED"
Medium (0.5-0.8)
Balanced — usually "RED", sometimes "AMBER"
High (1.0-2.0)
Creative — could pick any token

🔄 Autoregressive Generation

The model generates text one token at a time, feeding each prediction back as input:

Step 1: "The merchant risk rating is" → predicts "RED"
Step 2: "The merchant risk rating is RED" → predicts "due"
Step 3: "The merchant risk rating is RED due" → predicts "to"
Step 4: "The merchant risk rating is RED due to" → predicts "high"
Step 5: "... RED due to high" → predicts "chargeback"
Step 6: "... high chargeback" → predicts "rate"
Step 7: "... chargeback rate" → predicts [EOS] ← done!
🔍
Training vs Inference: During training, all tokens are processed in parallel (the model sees the full answer and learns from it). During inference (when you use Claude), tokens are generated one at a time — each new token requires a full forward pass through all transformer blocks.

🌳 The Transformer Family Tree

The original Transformer (2017) had both an encoder and decoder. Researchers discovered that using only one half works better for certain tasks:

ArchitectureWhat It DoesModelsUsed For
Encoder-onlyReads text bidirectionally (sees left AND right)BERT, RoBERTaUnderstanding: classification, NER, search
Decoder-onlyReads left-to-right, generates one token at a timeGPT, Claude, LLaMAGeneration: chat, code, writing
Encoder-DecoderReads input fully, then generates outputT5, BARTTranslation, summarization

📊 Model Comparison

ModelTypeParametersContextKey Innovation
BERT (2018)Encoder340M512Bidirectional attention, MLM pretraining
GPT-2 (2019)Decoder1.5B1,024Showed scale improves quality
GPT-4 (2023)Decoder~1.8T (MoE)128KMixture of Experts, multimodal
LLaMA 3 (2024)Decoder8B–405B128KOpen-source, RoPE, GQA, SwiGLU
Claude (2024)DecoderUnknown200KConstitutional AI, long context
T5 (2020)Enc-Dec11B512Everything as text-to-text

🏦 What This Means for AnyCompany

💬

Claude = Decoder-Only

When you chat with Claude, it generates one token at a time using masked self-attention. It can only look at tokens it has already generated — never peeks ahead.

🔍

RAG Uses Embeddings

When Claude searches your documents, it uses embedding similarity (cosine) to find relevant chunks. The transformer then processes those chunks to generate an answer.

🌡️

Temperature = Creativity

For risk assessments, use low temperature (0.1-0.3) for consistent ratings. For brainstorming, use higher temperature (0.7-1.0) for creative ideas.

📏

Context Window = Memory

Claude's 200K context window means it can "see" ~150 pages at once. Every token in that window attends to every other token — that's the power of self-attention.