From characters to tokens to numbers — the first step in every LLM pipeline. Interactive explainer with finance examples.
AI models like Claude, GPT, and LLaMA don't understand text the way we do. They work with numbers, not letters. Before any AI can process your message, your text needs to be broken into small pieces called tokens and converted to numbers.
Think of it like this: you're sending a message to someone who only speaks in numbered codes. You need a translation system — that's what a tokenizer does.
You could split text by spaces, but then the AI would need a separate number for every word it's ever seen — including misspellings, slang, and every language on Earth. That's millions of entries. We want a smarter, smaller vocabulary.
Byte Pair Encoding starts with individual characters and repeatedly merges the most common pair into a new token. Over many merges, it discovers useful pieces — common syllables, word roots, suffixes.
Tokenization is the very first step in every LLM pipeline. It determines how the model "sees" your text, affects how many tokens fit in the context window, and directly impacts your cost.
Every API call is billed by token count — both input and output. Understanding tokenization helps you estimate costs and optimize prompts. Numbers and special characters often cost more tokens than you'd expect.
There are four levels at which you can split text. Each has trade-offs:
| Level | How It Splits | Example: "chargebacks" | Trade-off |
|---|---|---|---|
| Word | Each word = 1 token | chargebacks | Simple but huge vocabulary, can't handle unknown words |
| Subword ⭐ | Meaningful pieces | charge + backs | Best balance — what all modern models use |
| Character | Each character = 1 token | charge... | Tiny vocabulary but very long sequences |
| Byte | Raw byte encoding | 9910497114... | Handles any language but extremely long |
Tokenization is Step 1 — everything downstream depends on how text is split into tokens.
BPE is the most widely used tokenization algorithm. It's used by GPT, LLaMA, Claude, and most modern LLMs. The idea is beautifully simple:
Split the entire training corpus into individual characters. This is your starting vocabulary.
Find which two adjacent tokens appear together most often across the corpus.
Combine that pair into a single new token and add it to the vocabulary.
Keep merging until vocabulary reaches desired size or no pair appears more than once.
Let's walk through BPE on the classic textbook example. Watch how common suffixes like "est" and "low" emerge naturally:
low lower newest widest
l o w _ l o w e r _ n e w e s t _ w i d e s t _
Each character becomes its own token. The _ marks end-of-word boundaries.
e + s → esAppears in "newest" and "widest" — 2 occurrences. Merge them.
es + t → estThe suffix "est" emerges! Common in English superlatives.
l + o → loAppears in "low" and "lower" — 2 occurrences.
lo + w → lowThe word "low" becomes a single token! It's frequent enough to earn its own entry.
est + _ → est_The end-of-word suffix "est_" is now a single token.
A common confusion: the BPE playground above shows how the tokenizer is trained — discovering merge rules from a corpus. But when you actually use Claude, the tokenizer works differently:
Happens once, before the LLM is trained. BPE scans a massive corpus (Wikipedia, books, code) and discovers ~50,000-100,000 merge rules. These rules are saved as a file.
Output: merges.txt + vocab.json
This is what our playground simulates ☝️
Happens every time you send a prompt. Your text is split into characters, then the pre-learned merge rules are applied in order. No new rules are discovered — it's a fast lookup.
Input: your prompt
Output: token IDs → embeddings → transformer
"Assess this merchant's risk"→ split into characters → apply 100K pre-learned merge rules in order→ ["Assess", " this", " merchant", "'s", " risk"] → 5 tokens
Let's see how BPE handles text you'd actually encounter at AnyCompany:
"The merchant's chargeback rate is 4.1%"
A trained BPE tokenizer would split this roughly as:
= 11 tokens. Notice: "chargeback" splits into "charge" + "back" (known subwords), and "4.1%" becomes 4 separate tokens — numbers and punctuation are expensive!
SGD 4,200.50, 3.2%, 2024-01-15 uses ~15 tokens — mostly because each digit, comma, and period is a separate token.No merges yet — press ▶ to start
Split text into individual characters (+ end-of-word marker ▁)
Count all adjacent token pairs
Merge the most frequent pair into a new token
Repeat until no pair appears more than once
Modern LLMs use one of three subword tokenization algorithms. They all solve the same problem — finding the right-sized pieces — but approach it differently:
| Algorithm | Direction | Merge Criterion | Used By |
|---|---|---|---|
| BPE Byte Pair Encoding |
Bottom-up (merge) | Most frequent pair | GPT LLaMA Claude |
| WordPiece | Bottom-up (merge) | Most likely pair (probability) | BERT |
| Unigram | Top-down (prune) | Remove least useful tokens | T5 |
Start with characters. Repeatedly merge the most frequent adjacent pair. Simple, fast, effective.
Like BPE, but merges the most likely pair based on probability, not raw frequency. Uses ## prefix for continuations.
Start with all possible substrings. Iteratively remove tokens that contribute least to overall likelihood. Opposite direction from BPE.
Every vocabulary includes reserved tokens with specific purposes:
| Token | Purpose | Example |
|---|---|---|
[PAD] | Fills sequences to a fixed length so batches are uniform | merchant risk [PAD] [PAD] |
[UNK] | Represents any word not found in the vocabulary | the [UNK] rate is |
[CLS] | Class token placed at start (BERT-specific) | [CLS] assess this merchant |
[SEP] | Separates two segments | ... data [SEP] assess ... |
[MASK] | Hides a token for the model to predict (BERT training) | the [MASK] rate is high |
Tokens are the currency of AI — literally. Every API call is billed by token count. Understanding tokenization helps you estimate costs, optimize prompts, and choose the right model.
"AnyCompany Financial Group" = 4 tokens
"Merchant risk assessment" = 4 tokens
"$4,200.50" = 6 tokens (each digit, comma, period, dollar sign). Financial data with lots of numbers costs more.
Type or paste text below to estimate token count and cost across different models:
Estimate only — actual token count depends on the specific model's tokenizer. Assumes ~75% input, ~25% output.
| Content Type | Approx. Tokens | Cost (Sonnet 4) |
|---|---|---|
| A short question ("Assess this merchant") | ~5 tokens | < $0.001 |
| A paragraph of merchant data (10 lines) | ~150 tokens | $0.0005 |
| Our engineered prompt template | ~400 tokens | $0.0012 |
| A full risk assessment output (8 sections) | ~800 tokens | $0.012 |
| Total per assessment (input + output) | ~1,350 tokens | $0.013 |
| 200 assessments per month | ~270,000 tokens | $2.64 |
Simple classification → Nova Micro ($0.04/1M). Narrative generation → Sonnet ($3/1M). Don't use Opus to sort mail.
Remove redundant instructions (10-20% savings). Use shorter examples. Constrain output length.
Bedrock offers 50% discount for batch inference. Perfect for monthly portfolio assessments.
Cache repeated system prompts — saves ~90% on the template portion for subsequent calls.
Tokenization is just the first step. Here's the complete pipeline that turns your text into an AI response:
Your text is split into subword tokens using BPE (or WordPiece/Unigram). Each token maps to an integer ID from the vocabulary.
"merchant risk" → ["merchant", "risk"] → [15234, 8901]
Each token ID is converted into a dense vector of numbers (typically 768–12,288 dimensions). These vectors capture semantic meaning — similar words get similar vectors.
The model needs to know where each token sits in the sequence. Position embeddings are added so "The merchant is risky" and "Is the merchant risky?" produce different representations despite having the same words.
Modern models like LLaMA use RoPE (Rotary Position Embeddings) — which encodes relative position, so the relationship between "merchant" and "risk" is the same whether they're at positions 3,4 or 103,104.
This is where the magic happens. Every token can directly look at every other token and decide which ones are relevant. The model asks three questions for each token:
"What am I looking for?"
"What do I contain?"
"Here's my actual content"
When processing "The merchant's chargeback rate is high", the word "high" pays strong attention to "chargeback rate" (what's high?) and "merchant" (whose rate?). This context-awareness is what makes transformers so powerful.
The transformer produces a probability distribution over the entire vocabulary for the next token. The model picks the most likely token (or samples from the distribution based on temperature), appends it, and repeats.
Input: "The merchant risk rating is"
→ P("RED")=0.72, P("AMBER")=0.21, P("GREEN")=0.05, ...
→ Output: "RED"
Temperature controls randomness: low temperature (0.0) = always pick the highest probability. High temperature (1.0) = more creative/random. For risk assessments, you want low temperature for consistency.
Each breakthrough solved a limitation of the previous approach:
Sparse, no meaning. "cute" and "adorable" are equally different as "cute" and "airplane".
Dense, meaningful, but static — one vector per word regardless of context.
Sequential processing, captures order, but information bottleneck for long sequences.
Context-dependent embeddings via bidirectional LSTMs. "bank" gets different vectors in "river bank" vs "bank account".
Parallel attention over all tokens. No bottleneck. This is what powers Claude, GPT, and LLaMA today.