How AI turns words into numbers that capture meaning — from one-hot vectors to dense embeddings where similar words live close together.
After tokenization, each token gets an integer ID from the vocabulary. "merchant" might be token #15234, "risk" might be #8901. But these numbers are arbitrary — #15234 isn't "close to" #15235 in any meaningful way.
The model needs a representation where similar words have similar numbers. That's what embeddings do.
"merchant" = 15234, "vendor" = 42891. These IDs say nothing about the words being synonyms. The model can't learn from arbitrary numbers.
Give each token a vector with a single 1. But with 50,000 tokens, that's 50,000 dimensions — and "cute" is just as far from "adorable" as from "airplane".
Learn compact vectors (256-4096 dimensions) where similar words end up close together. "merchant" and "vendor" get similar vectors automatically.
Words appearing in similar contexts get similar embeddings. "merchant" and "vendor" both appear near "payment", "transaction", "account" — so they end up close in embedding space.
Embeddings convert meaningless token IDs into rich vectors that capture semantic meaning.
Let's follow a real sentence through every step of the pipeline:
| Stage | What You See |
|---|---|
| 📝 Your text | "The merchant's chargeback rate" |
| ✂️ Tokenizer BPE splits into subwords |
The
merchant
's
charge
back
rate
"chargeback" isn't in the vocabulary as one token, so BPE splits it → charge + back |
| 🔢 Token IDs Vocabulary lookup |
464
15234
338
1234
1235
4873
Each token string maps to a unique integer — just a row number in the vocabulary table. These numbers have no meaning by themselves. |
| 📊 Embedding Table lookup → 768-dim vector |
ID 464 → [0.12, -0.34, 0.78, 0.05, ..., -0.21] ← "The"
ID 15234 → [0.23, -0.41, 0.87, -0.12, ..., 0.56] ← "merchant"
ID 338 → [-0.08, 0.15, 0.22, 0.44, ..., 0.03] ← "'s"
ID 1234 → [0.45, -0.28, 0.63, 0.31, ..., -0.17] ← "charge"
ID 1235 → [0.38, -0.19, 0.55, 0.27, ..., -0.09] ← "back"
ID 4873 → [-0.15, 0.72, 0.33, 0.91, ..., -0.28] ← "rate"
Each ID looks up one row in the embedding matrix (50,000 rows × 768 columns). This is instant — no computation, just a table lookup. The 768 numbers encode meaning. |
| 🧠 Transformer Self-attention enriches |
Each 768-dim vector is now refined by looking at all other tokens. After attention, "charge" knows it's next to "back" (so it means chargeback, not electrical charge). The vectors are now context-dependent. |
The simplest way to represent tokens as numbers: give each token a vector of size |V| (vocabulary size) with a single 1 at its position and 0s everywhere else.
merchant, risk, payment, fraud, bank, loan, rate, green
In one-hot space, every word is equally far from every other word. The distance between "merchant" and "vendor" is the same as between "merchant" and "banana".
merchant · vendor = 0 (orthogonal — no similarity!)merchant · banana = 0 (same distance!)merchant · merchant = 1 (only identical words match)
We need a representation where meaning is encoded in the numbers — not just identity.
Instead of sparse one-hot vectors, we learn dense vectors (typically 256-4096 dimensions) where each dimension captures some aspect of meaning. Similar words get similar vectors.
[0,0,0,1,0,...,0]
50,000 dims
1 non-zero
Neural network discovers meaningful dimensions
[0.23,-0.41,0.87,...]
768 dims
ALL non-zero
The key insight: "You shall know a word by the company it keeps." Words appearing in similar contexts get similar embeddings.
Target: "merchant" → Predict: "payment", "risk", "chargeback", "account"Target: "vendor" → Predict: "payment", "invoice", "contract", "account"
Since "merchant" and "vendor" both predict similar context words ("payment", "account"), their embeddings converge to similar vectors.
vector("merchant") - vector("sell") + vector("buy") ≈ vector("customer")vector("Singapore") - vector("SGD") + vector("MYR") ≈ vector("Malaysia")
Each dimension in an embedding vector captures some abstract feature. While individual dimensions aren't always interpretable, patterns emerge:
| Hypothetical Dimension | High Value | Low Value | Finance Example |
|---|---|---|---|
| Dim 42: "financial-ness" | payment, merchant, loan | banana, cloud, guitar | Helps group finance terms |
| Dim 108: "risk level" | fraud, default, breach | safe, approved, verified | Captures risk sentiment |
| Dim 256: "formality" | pursuant, hereby, whereas | hey, cool, lol | Distinguishes legal vs casual |
| Dim 512: "geography" | Singapore, Jakarta, Bangkok | abstract, concept, theory | Groups location terms |
Real embeddings have 768-4096 dimensions. Each word's position in this high-dimensional space encodes its full meaning.
Select two words to compute their cosine similarity — the standard way to measure how "close" two embeddings are:
Once we have embeddings, we need to measure how similar two words are. The standard metric is cosine similarity — it measures the angle between two vectors, ignoring their magnitude.
| Metric | Formula | Range | Best For |
|---|---|---|---|
| Cosine Similarity ⭐ | cos(θ) = (A·B) / (‖A‖·‖B‖) | -1 to 1 | Comparing meaning regardless of vector length |
| Euclidean Distance | √(Σ(aᵢ-bᵢ)²) | 0 to ∞ | When magnitude matters |
| Dot Product | Σ(aᵢ·bᵢ) | -∞ to ∞ | Fast approximate similarity |
"chargeback" ↔ "dispute" (0.91)
"merchant" ↔ "vendor" (0.87)
"PayLater" ↔ "BNPL" (0.89)
"merchant" ↔ "payment" (0.62)
"risk" ↔ "insurance" (0.55)
"loan" ↔ "interest" (0.58)
"merchant" ↔ "banana" (0.05)
"chargeback" ↔ "guitar" (0.02)
"SGD" ↔ "poetry" (0.01)
Sparse, no meaning. Every word equally far from every other word.
Dense, meaningful, but static — "bank" gets the same vector whether it means "river bank" or "bank account".
Context-dependent! "bank" gets different vectors in different sentences. Uses bidirectional LSTMs.
Context-dependent + parallel attention. Each token's embedding is enriched by ALL other tokens. This is what Claude, GPT, LLaMA use.
Let's trace how "Assess this merchant's risk" becomes numbers the model can process:
"Assess this merchant's risk" → ["Assess", "this", "merchant", "'s", "risk"]
["Assess", "this", "merchant", "'s", "risk"] → [8234, 428, 15234, 338, 8901]
Each token maps to a unique integer from the vocabulary table.
Each token ID indexes into the embedding matrix — a giant table of learned vectors. For a model with vocabulary size 50,000 and embedding dimension 768:
ID 15234 ("merchant") → [0.23, -0.41, 0.87, -0.12, ..., 0.56] (768 numbers)ID 8901 ("risk") → [-0.15, 0.72, 0.33, 0.91, ..., -0.28] (768 numbers)
The model needs to know where each token sits. Position embeddings are added so "The merchant is risky" and "Is the merchant risky?" produce different representations.
embedding("merchant") + position(3) = context-ready vector
The transformer's self-attention mechanism lets each token's embedding absorb information from all other tokens. After attention, "risk" in "merchant risk" has a different embedding than "risk" in "health risk" — because the surrounding context is different.
This is the breakthrough that makes modern LLMs so powerful: context-dependent embeddings.
| Model | Embedding Dim | Vocab Size | Embedding Matrix Size |
|---|---|---|---|
| BERT Base | 768 | 30,522 | 23.4M parameters |
| GPT-2 | 1,600 | 50,257 | 80.4M parameters |
| LLaMA 3 8B | 4,096 | 128,256 | 525M parameters |
| Claude (estimated) | ~8,192 | ~100K | ~819M parameters |
The embedding matrix alone can be hundreds of millions of parameters — and that's just the first layer!