From Tokens to Meaning — Embeddings Explainer

🤔 The Problem: Tokens Are Just IDs

After tokenization, each token gets an integer ID from the vocabulary. "merchant" might be token #15234, "risk" might be #8901. But these numbers are arbitrary — #15234 isn't "close to" #15235 in any meaningful way.

The model needs a representation where similar words have similar numbers. That's what embeddings do.

🔢

Token IDs Are Meaningless

"merchant" = 15234, "vendor" = 42891. These IDs say nothing about the words being synonyms. The model can't learn from arbitrary numbers.

📊

One-Hot: Simple but Wasteful

Give each token a vector with a single 1. But with 50,000 tokens, that's 50,000 dimensions — and "cute" is just as far from "adorable" as from "airplane".

🧠

Dense Embeddings: The Solution

Learn compact vectors (256-4096 dimensions) where similar words end up close together. "merchant" and "vendor" get similar vectors automatically.

💡

"You Know a Word by Its Company"

Words appearing in similar contexts get similar embeddings. "merchant" and "vendor" both appear near "payment", "transaction", "account" — so they end up close in embedding space.

🔗 Where Embeddings Fit

📝Your Text

→

✂️Tokenizer

→

🔢Token IDs

→

📊Embeddings

→

🧠Transformer

→

💬Output

Embeddings convert meaningless token IDs into rich vectors that capture semantic meaning.

🔎 Trace It: From Text to Vectors

Let's follow a real sentence through every step of the pipeline:

Stage	What You See
📝 Your text	"The merchant's chargeback rate"
✂️ Tokenizer BPE splits into subwords	The merchant 's charge back rate "chargeback" isn't in the vocabulary as one token, so BPE splits it → charge + back
🔢 Token IDs Vocabulary lookup	464 15234 338 1234 1235 4873 Each token string maps to a unique integer — just a row number in the vocabulary table. These numbers have no meaning by themselves.
📊 Embedding Table lookup → 768-dim vector	ID 464 → [0.12, -0.34, 0.78, 0.05, ..., -0.21] ← "The" ID 15234 → [0.23, -0.41, 0.87, -0.12, ..., 0.56] ← "merchant" ID 338 → [-0.08, 0.15, 0.22, 0.44, ..., 0.03] ← "'s" ID 1234 → [0.45, -0.28, 0.63, 0.31, ..., -0.17] ← "charge" ID 1235 → [0.38, -0.19, 0.55, 0.27, ..., -0.09] ← "back" ID 4873 → [-0.15, 0.72, 0.33, 0.91, ..., -0.28] ← "rate" Each ID looks up one row in the embedding matrix (50,000 rows × 768 columns). This is instant — no computation, just a table lookup. The 768 numbers encode meaning.
🧠 Transformer Self-attention enriches	Each 768-dim vector is now refined by looking at all other tokens. After attention, "charge" knows it's next to "back" (so it means chargeback, not electrical charge). The vectors are now context-dependent.

💡

The key takeaway: Humans see words. The tokenizer sees subword strings. The model sees integer IDs. The embedding layer converts those meaningless IDs into rich 768-dimensional vectors where position = meaning. Everything after that builds on these vectors.

🔢 One-Hot Encoding: The Naive Approach

The simplest way to represent tokens as numbers: give each token a vector of size |V| (vocabulary size) with a single 1 at its position and 0s everywhere else.

Vocabulary (simplified): 8 tokens

merchant, risk, payment, fraud, bank, loan, rate, green

⚠️

Two fatal problems:
1. No similarity: The dot product of any two one-hot vectors is 0. "merchant" and "vendor" are just as different as "merchant" and "airplane".
2. Huge dimensions: Real vocabularies have 50,000-100,000 tokens. Each vector is 50,000-dimensional with 49,999 zeros. Extremely wasteful.

📐 Why One-Hot Fails: No Similarity

In one-hot space, every word is equally far from every other word. The distance between "merchant" and "vendor" is the same as between "merchant" and "banana".

Dot products (similarity) in one-hot space

merchant · vendor = 0 (orthogonal — no similarity!)
merchant · banana = 0 (same distance!)
merchant · merchant = 1 (only identical words match)

We need a representation where meaning is encoded in the numbers — not just identity.

🧠 Dense Embeddings: Meaning in Numbers

Instead of sparse one-hot vectors, we learn dense vectors (typically 256-4096 dimensions) where each dimension captures some aspect of meaning. Similar words get similar vectors.

One-Hot

[0,0,0,1,0,...,0]
50,000 dims
1 non-zero

→

Learned Mapping

Neural network discovers meaningful dimensions

Dense Embedding

[0.23,-0.41,0.87,...]
768 dims
ALL non-zero

🏗️ How Embeddings Are Learned: Word2Vec

The key insight: "You shall know a word by the company it keeps." Words appearing in similar contexts get similar embeddings.

Skip-gram: Predict Context from Target

Training example

Target: "merchant" → Predict: "payment", "risk", "chargeback", "account"
Target: "vendor" → Predict: "payment", "invoice", "contract", "account"

Since "merchant" and "vendor" both predict similar context words ("payment", "account"), their embeddings converge to similar vectors.

The Result: Semantic Arithmetic

Embedding arithmetic

vector("merchant") - vector("sell") + vector("buy") ≈ vector("customer")
vector("Singapore") - vector("SGD") + vector("MYR") ≈ vector("Malaysia")

💡

For AnyCompany: In a well-trained embedding space, "chargeback", "dispute", and "refund" would cluster together. "PayLater", "BNPL", and "installment" would form another cluster. The model learns these relationships from data — no one programs them in.

📊 What Embedding Dimensions Capture

Each dimension in an embedding vector captures some abstract feature. While individual dimensions aren't always interpretable, patterns emerge:

Hypothetical Dimension	High Value	Low Value	Finance Example
Dim 42: "financial-ness"	payment, merchant, loan	banana, cloud, guitar	Helps group finance terms
Dim 108: "risk level"	fraud, default, breach	safe, approved, verified	Captures risk sentiment
Dim 256: "formality"	pursuant, hereby, whereas	hey, cool, lol	Distinguishes legal vs casual
Dim 512: "geography"	Singapore, Jakarta, Bangkok	abstract, concept, theory	Groups location terms

Real embeddings have 768-4096 dimensions. Each word's position in this high-dimensional space encodes its full meaning.

🎮 3D Embedding Space Explorer

Drag to rotate · Scroll to zoom · Click a word to see neighbors

🎓

This is a 3D projection of a high-dimensional embedding space. Words that appear in similar contexts cluster together. Drag to rotate, scroll to zoom, click any word to highlight its nearest neighbors.

← less financial · · · more financial → ← concrete · · · abstract → depth: specificity →

🖱 drag to rotate
⚲ scroll to zoom
👆 click word to find neighbors

🔍

Why click a word? This simulates how RAG (Retrieval Augmented Generation) works. When you ask Claude a question, it converts your query into an embedding, finds the nearest document embeddings, and uses those as context. Click "merchant" — the 5 connected words are what a RAG system would retrieve. Notice "banana" never appears as a neighbor.

Finance

Risk

Food/Retail

Technology

General

🔍 Cosine Similarity Calculator

Select two words to compute their cosine similarity — the standard way to measure how "close" two embeddings are:

↔

      Select two words above
    

0.8 – 1.0
Very similar

0.3 – 0.7
Somewhat related

0.0 – 0.2
Unrelated

📐 Measuring Similarity: Cosine vs Euclidean

Once we have embeddings, we need to measure how similar two words are. The standard metric is cosine similarity — it measures the angle between two vectors, ignoring their magnitude.

Metric	Formula	Range	Best For
Cosine Similarity ⭐	cos(θ) = (A·B) / (‖A‖·‖B‖)	-1 to 1	Comparing meaning regardless of vector length
Euclidean Distance	√(Σ(aᵢ-bᵢ)²)	0 to ∞	When magnitude matters
Dot Product	Σ(aᵢ·bᵢ)	-∞ to ∞	Fast approximate similarity

🔍

Why cosine over Euclidean? Cosine is scale-invariant. Vectors [1, 2] and [100, 200] point in the same direction (cosine = 1) even though their Euclidean distance is huge. For embeddings, direction matters more than magnitude.

🏦 Similarity in Finance Context

High Similarity (0.85+)

"chargeback" ↔ "dispute" (0.91)
"merchant" ↔ "vendor" (0.87)
"PayLater" ↔ "BNPL" (0.89)

Medium Similarity (0.4-0.7)

"merchant" ↔ "payment" (0.62)
"risk" ↔ "insurance" (0.55)
"loan" ↔ "interest" (0.58)

Low Similarity (0.0-0.2)

"merchant" ↔ "banana" (0.05)
"chargeback" ↔ "guitar" (0.02)
"SGD" ↔ "poetry" (0.01)

💡

Why this matters for RAG: When you ask Claude "What's our chargeback policy?", the system uses embedding similarity to find the most relevant documents. If "chargeback" and "dispute" have high similarity, the system also retrieves dispute-related docs — giving Claude better context.

📈 Evolution of Text Representation

1️⃣

One-Hot Encoding

Sparse, no meaning. Every word equally far from every other word.

2️⃣

Word2Vec / GloVe

Dense, meaningful, but static — "bank" gets the same vector whether it means "river bank" or "bank account".

3️⃣

ELMo

Context-dependent! "bank" gets different vectors in different sentences. Uses bidirectional LSTMs.

4️⃣

Transformer Embeddings ⭐

Context-dependent + parallel attention. Each token's embedding is enriched by ALL other tokens. This is what Claude, GPT, LLaMA use.

🔗 The Complete Embedding Pipeline

Let's trace how "Assess this merchant's risk" becomes numbers the model can process:

Step 1: Tokenize

BPE Tokenization

"Assess this merchant's risk" → ["Assess", "this", "merchant", "'s", "risk"]

Step 2: Token IDs

Vocabulary lookup

["Assess", "this", "merchant", "'s", "risk"] → [8234, 428, 15234, 338, 8901]

Each token maps to a unique integer from the vocabulary table.

Step 3: Embedding Lookup

Each token ID indexes into the embedding matrix — a giant table of learned vectors. For a model with vocabulary size 50,000 and embedding dimension 768:

Embedding matrix: 50,000 × 768

ID 15234 ("merchant") → [0.23, -0.41, 0.87, -0.12, ..., 0.56] (768 numbers)
ID 8901 ("risk") → [-0.15, 0.72, 0.33, 0.91, ..., -0.28] (768 numbers)

🔍

This is just a table lookup — no computation! The embedding matrix is learned during training and stored as model weights. At inference time, converting token IDs to embeddings is instant.

Step 4: Add Position Encoding

The model needs to know where each token sits. Position embeddings are added so "The merchant is risky" and "Is the merchant risky?" produce different representations.

Final input to transformer

embedding("merchant") + position(3) = context-ready vector

Step 5: Self-Attention Enriches Embeddings

The transformer's self-attention mechanism lets each token's embedding absorb information from all other tokens. After attention, "risk" in "merchant risk" has a different embedding than "risk" in "health risk" — because the surrounding context is different.

This is the breakthrough that makes modern LLMs so powerful: context-dependent embeddings.

💰 Embedding Dimensions Across Models

Model	Embedding Dim	Vocab Size	Embedding Matrix Size
BERT Base	768	30,522	23.4M parameters
GPT-2	1,600	50,257	80.4M parameters
LLaMA 3 8B	4,096	128,256	525M parameters
Claude (estimated)	~8,192	~100K	~819M parameters

The embedding matrix alone can be hundreds of millions of parameters — and that's just the first layer!