How AI Reads Your Text

From characters to tokens to numbers — the first step in every LLM pipeline. Interactive explainer with finance examples.

📖 Day 1 Reference ⚡ Interactive 💰 Finance Context

🤔 The Problem: AI Can't Read Words

AI models like Claude, GPT, and LLaMA don't understand text the way we do. They work with numbers, not letters. Before any AI can process your message, your text needs to be broken into small pieces called tokens and converted to numbers.

Think of it like this: you're sending a message to someone who only speaks in numbered codes. You need a translation system — that's what a tokenizer does.

✂️

Why Not Just Use Whole Words?

You could split text by spaces, but then the AI would need a separate number for every word it's ever seen — including misspellings, slang, and every language on Earth. That's millions of entries. We want a smarter, smaller vocabulary.

🧩

BPE: Learning the Building Blocks

Byte Pair Encoding starts with individual characters and repeatedly merges the most common pair into a new token. Over many merges, it discovers useful pieces — common syllables, word roots, suffixes.

Why This Matters

Tokenization is the very first step in every LLM pipeline. It determines how the model "sees" your text, affects how many tokens fit in the context window, and directly impacts your cost.

💰

You Pay Per Token

Every API call is billed by token count — both input and output. Understanding tokenization helps you estimate costs and optimize prompts. Numbers and special characters often cost more tokens than you'd expect.

📏 Token Granularity: How Big Should Each Piece Be?

There are four levels at which you can split text. Each has trade-offs:

LevelHow It SplitsExample: "chargebacks"Trade-off
WordEach word = 1 tokenchargebacksSimple but huge vocabulary, can't handle unknown words
SubwordMeaningful piecescharge + backsBest balance — what all modern models use
CharacterEach character = 1 tokencharge...Tiny vocabulary but very long sequences
ByteRaw byte encoding9910497114...Handles any language but extremely long
💡
The sweet spot is subword tokenization — it captures root meanings ("charge", "chargeback", "chargebacks" share a root) while keeping vocabulary manageable. This is what Claude, GPT, and LLaMA all use.

🔗 Where Tokenization Fits in the LLM Pipeline

📝Your Text
✂️Tokenizer
(BPE)
🔢Token IDs
📊Embeddings
🧠Transformer
💬Output

Tokenization is Step 1 — everything downstream depends on how text is split into tokens.

⚙️ BPE: Byte Pair Encoding — Step by Step

BPE is the most widely used tokenization algorithm. It's used by GPT, LLaMA, Claude, and most modern LLMs. The idea is beautifully simple:

1️⃣

Start with Characters

Split the entire training corpus into individual characters. This is your starting vocabulary.

2️⃣

Count Adjacent Pairs

Find which two adjacent tokens appear together most often across the corpus.

3️⃣

Merge the Top Pair

Combine that pair into a single new token and add it to the vocabulary.

4️⃣

Repeat

Keep merging until vocabulary reaches desired size or no pair appears more than once.

📖 Worked Example: The Classic BPE Demo

Let's walk through BPE on the classic textbook example. Watch how common suffixes like "est" and "low" emerge naturally:

Input Text
low lower newest widest

Step 0 — Initial character split:

l o w _ l o w e r _ n e w e s t _ w i d e s t _

Each character becomes its own token. The _ marks end-of-word boundaries.

Step 1 — Most frequent pair: e + ses

Appears in "newest" and "widest" — 2 occurrences. Merge them.

Step 2 — Most frequent pair: es + test

The suffix "est" emerges! Common in English superlatives.

Step 3 — Most frequent pair: l + olo

Appears in "low" and "lower" — 2 occurrences.

Step 4 — Most frequent pair: lo + wlow

The word "low" becomes a single token! It's frequent enough to earn its own entry.

Step 5 — Most frequent pair: est + _est_

The end-of-word suffix "est_" is now a single token.

🔍
Notice what happened: BPE discovered that "low" and "est" are meaningful building blocks — without any knowledge of English! It learned this purely from frequency patterns. "newest" = "n" + "e" + "w" + "est_" and "widest" = "w" + "i" + "d" + "est_".

🔄 Training vs Inference: Two Different Phases

A common confusion: the BPE playground above shows how the tokenizer is trained — discovering merge rules from a corpus. But when you actually use Claude, the tokenizer works differently:

🏗️

Phase 1: Training the Tokenizer

Happens once, before the LLM is trained. BPE scans a massive corpus (Wikipedia, books, code) and discovers ~50,000-100,000 merge rules. These rules are saved as a file.

Output: merges.txt + vocab.json
This is what our playground simulates ☝️

Phase 2: Using at Inference

Happens every time you send a prompt. Your text is split into characters, then the pre-learned merge rules are applied in order. No new rules are discovered — it's a fast lookup.

Input: your prompt
Output: token IDs → embeddings → transformer

Inference example — what actually happens when you prompt Claude
"Assess this merchant's risk"
→ split into characters → apply 100K pre-learned merge rules in order
→ ["Assess", " this", " merchant", "'s", " risk"] → 5 tokens
🔍
Key difference: Our playground discovers 5 merge rules from a tiny corpus and produces 14 tokens. A real tokenizer like Claude's has ~100,000 merge rules learned from billions of words — so common words like "merchant" and "risk" are already single tokens. The merge rules are fixed after training and never change at inference.

🏦 BPE on Finance Text

Let's see how BPE handles text you'd actually encounter at AnyCompany:

AnyCompany Merchant Data
"The merchant's chargeback rate is 4.1%"

A trained BPE tokenizer would split this roughly as:

The merchant 's charge back rate is 4 . 1 %

= 11 tokens. Notice: "chargeback" splits into "charge" + "back" (known subwords), and "4.1%" becomes 4 separate tokens — numbers and punctuation are expensive!

⚠️
Key insight for finance: Tables of financial data with lots of numbers cost more tokens than the same amount of narrative text. A CSV row like SGD 4,200.50, 3.2%, 2024-01-15 uses ~15 tokens — mostly because each digit, comma, and period is a separate token.

🎮 BPE Interactive Playground

Step 0 / 0
🎓
Press ▶ Play to watch BPE build a vocabulary step by step, or use the arrow buttons to step through manually.
initial split
Tokens 0 tokens
📊 Pair Frequencies (top 15)
📚 Vocabulary (0)
🔀 Merge History

No merges yet — press ▶ to start

📖 How BPE Works

Split text into individual characters (+ end-of-word marker )

Count all adjacent token pairs

Merge the most frequent pair into a new token

Repeat until no pair appears more than once

💡
Try it yourself: Type any text in the input box — try your name, a finance term like "chargebacks", or even some Malay/Indonesian text. Watch how BPE discovers patterns in whatever you give it.

🔬 The Three Tokenization Algorithms

Modern LLMs use one of three subword tokenization algorithms. They all solve the same problem — finding the right-sized pieces — but approach it differently:

AlgorithmDirectionMerge CriterionUsed By
BPE
Byte Pair Encoding
Bottom-up (merge) Most frequent pair GPT LLaMA Claude
WordPiece Bottom-up (merge) Most likely pair (probability) BERT
Unigram Top-down (prune) Remove least useful tokens T5

📊 BPE vs WordPiece vs Unigram

BPE — Bottom Up

Start with characters. Repeatedly merge the most frequent adjacent pair. Simple, fast, effective.

a + r → ar (freq: 3)
ar + e → are (freq: 2)
...

WordPiece — Bottom Up

Like BPE, but merges the most likely pair based on probability, not raw frequency. Uses ## prefix for continuations.

"teddy" → "ted" + "##dy"
"playing" → "play" + "##ing"

Unigram — Top Down

Start with all possible substrings. Iteratively remove tokens that contribute least to overall likelihood. Opposite direction from BPE.

"bear" → best split:
"be" + "ar" (P=9.5×10⁻⁴)
🔍
Why does it matter which algorithm? The same text may produce different token counts on different models. "AnyCompany Financial Group" might be 4 tokens on Claude but 5 on BERT. This affects cost, context window usage, and even model performance on certain tasks.

📚 Special Tokens — The Model's Control Codes

Every vocabulary includes reserved tokens with specific purposes:

TokenPurposeExample
[PAD]Fills sequences to a fixed length so batches are uniformmerchant risk [PAD] [PAD]
[UNK]Represents any word not found in the vocabularythe [UNK] rate is
[CLS]Class token placed at start (BERT-specific)[CLS] assess this merchant
[SEP]Separates two segments... data [SEP] assess ...
[MASK]Hides a token for the model to predict (BERT training)the [MASK] rate is high

💰 Why Finance Teams Should Care About Tokens

Tokens are the currency of AI — literally. Every API call is billed by token count. Understanding tokenization helps you estimate costs, optimize prompts, and choose the right model.

📊

1 token ≈ 4 characters ≈ ¾ word

"AnyCompany Financial Group" = 4 tokens
"Merchant risk assessment" = 4 tokens

⚠️

Numbers Are Expensive

"$4,200.50" = 6 tokens (each digit, comma, period, dollar sign). Financial data with lots of numbers costs more.

🧮 Token Cost Estimator

Type or paste text below to estimate token count and cost across different models:

~0 tokens
Claude Sonnet: $0.000
Nova Pro: $0.000

Estimate only — actual token count depends on the specific model's tokenizer. Assumes ~75% input, ~25% output.

📋 Token Estimation Quick Reference

Content TypeApprox. TokensCost (Sonnet 4)
A short question ("Assess this merchant")~5 tokens< $0.001
A paragraph of merchant data (10 lines)~150 tokens$0.0005
Our engineered prompt template~400 tokens$0.0012
A full risk assessment output (8 sections)~800 tokens$0.012
Total per assessment (input + output)~1,350 tokens$0.013
200 assessments per month~270,000 tokens$2.64
💡
The 171x price spread: Nova Micro costs $0.035/1M input tokens. Claude Opus 4 costs $15.00/1M. That's a 428x difference! But cheaper ≠ worse for every task. The right question is: "What's the cheapest model that meets my quality threshold?"

🎯 Cost Optimization Strategies

📐

Right-Size Your Model

Simple classification → Nova Micro ($0.04/1M). Narrative generation → Sonnet ($3/1M). Don't use Opus to sort mail.

✂️

Optimize Prompts

Remove redundant instructions (10-20% savings). Use shorter examples. Constrain output length.

📦

Batch Processing

Bedrock offers 50% discount for batch inference. Perfect for monthly portfolio assessments.

🔄

Prompt Caching

Cache repeated system prompts — saves ~90% on the template portion for subsequent calls.

🔗 The Full Journey: From Text to AI Output

Tokenization is just the first step. Here's the complete pipeline that turns your text into an AI response:

Step 1: Tokenization

Your text is split into subword tokens using BPE (or WordPiece/Unigram). Each token maps to an integer ID from the vocabulary.

Example
"merchant risk" → ["merchant", "risk"] → [15234, 8901]

Step 2: Token Embeddings

Each token ID is converted into a dense vector of numbers (typically 768–12,288 dimensions). These vectors capture semantic meaning — similar words get similar vectors.

🔍
Key insight: In embedding space, "merchant" and "vendor" are close together, while "merchant" and "airplane" are far apart. The model learns these relationships during training.

Step 3: Position Encoding

The model needs to know where each token sits in the sequence. Position embeddings are added so "The merchant is risky" and "Is the merchant risky?" produce different representations despite having the same words.

Modern models like LLaMA use RoPE (Rotary Position Embeddings) — which encodes relative position, so the relationship between "merchant" and "risk" is the same whether they're at positions 3,4 or 103,104.

Step 4: Self-Attention (The Transformer)

This is where the magic happens. Every token can directly look at every other token and decide which ones are relevant. The model asks three questions for each token:

Query (Q)

"What am I looking for?"

Key (K)

"What do I contain?"

Value (V)

"Here's my actual content"

When processing "The merchant's chargeback rate is high", the word "high" pays strong attention to "chargeback rate" (what's high?) and "merchant" (whose rate?). This context-awareness is what makes transformers so powerful.

Step 5: Output Generation

The transformer produces a probability distribution over the entire vocabulary for the next token. The model picks the most likely token (or samples from the distribution based on temperature), appends it, and repeats.

Generation Example
Input: "The merchant risk rating is"
→ P("RED")=0.72, P("AMBER")=0.21, P("GREEN")=0.05, ...
→ Output: "RED"

Temperature controls randomness: low temperature (0.0) = always pick the highest probability. High temperature (1.0) = more creative/random. For risk assessments, you want low temperature for consistency.

📈 The Evolution of Text Representation

Each breakthrough solved a limitation of the previous approach:

1️⃣
One-Hot Encoding

Sparse, no meaning. "cute" and "adorable" are equally different as "cute" and "airplane".

2️⃣
Word2vec / GloVe

Dense, meaningful, but static — one vector per word regardless of context.

3️⃣
RNN / LSTM / GRU

Sequential processing, captures order, but information bottleneck for long sequences.

4️⃣
ELMo

Context-dependent embeddings via bidirectional LSTMs. "bank" gets different vectors in "river bank" vs "bank account".

5️⃣
Transformers (Attention) ⭐

Parallel attention over all tokens. No bottleneck. This is what powers Claude, GPT, and LLaMA today.