Day 1: Tokenization, Pricing & Model Selection

What Are Tokens?

LLMs don't process words — they process tokens. A token is a piece of text, roughly:

TOKEN ESTIMATION

1 token ≈ 4 characters ≈ ¾ of a word "AnyCompany Financial Group" = 4 tokens "The merchant's chargeback rate is 4.1%" = 9 tokens A typical merchant risk assessment (500 words) ≈ 650 tokens

Why this matters for cost: You pay per token — both for what you send (input) and what the AI generates (output). Longer prompts and longer outputs cost more.

Quick Token Estimation

Content	Approximate tokens
A short question ("Assess this merchant")	~5 tokens
A paragraph of merchant data (10 lines)	~150 tokens
Our engineered prompt template	~400 tokens
A full risk assessment output (8 sections)	~800 tokens
Total per assessment (input + output)	~1,350 tokens

How Tokenization Works

Models use subword tokenization — they break text into meaningful pieces, not whole words:

Text	Tokens	Count	Note
"chargebacks"	["charge", "backs"]	2	Split into meaningful subwords
"PayLater"	["Pay", "Later"]	2	CamelCase splits naturally
"SGD"	["SG", "D"]	2	Abbreviations may split
"$4,200"	["$", "4", ",", "200"]	4	Numbers are expensive!

⚠️ Key insight for finance: Numbers and special characters use more tokens than you'd expect. A table of financial data with lots of numbers costs more tokens than the same amount of narrative text.

Why Different Models Tokenize Differently

Each model family has its own tokenizer — the same text may be a different number of tokens on different models:

Claude uses a BPE (Byte-Pair Encoding) tokenizer optimized for English and code
Llama uses a SentencePiece tokenizer with a 128K vocabulary
Nova uses Amazon's own tokenizer

Why Different Models Perform Differently

Not all AI models are created equal. They differ in size (parameters), training (data and techniques), and architecture — which affects speed, quality, and cost.

Model Size = Brain Size

Parameters are the "knowledge" stored in the model. More parameters = more capacity for complex reasoning, but also slower and more expensive.

Model size	Parameters	Analogy	Good for
Small	1-17B	Junior analyst — fast, handles routine tasks	Classification, simple extraction, FAQ
Medium	17-70B	Senior analyst — balanced speed and depth	Reports, structured analysis, narratives
Large	70B+	Expert consultant — thorough but expensive	Complex reasoning, multi-step analysis, research

Choosing the Right Model for the Task

Different tasks need different trade-offs. Match the model to the job — not every task needs the most powerful option:

Task type	What matters most	Model category	Examples on Bedrock
Classification & routing	Speed, low cost	Small / lightweight models	Nova Micro, Nova Lite
Data extraction & summarization	Accuracy, structured output	Mid-range models	Nova Pro, Claude Haiku, Llama Maverick
Narrative generation & analysis	Quality, reasoning depth	Capable models	Claude Sonnet, Llama 70B, DeepSeek
Complex multi-step reasoning	Depth, nuance, thoroughness	Frontier models	Claude Sonnet, Claude Opus

💡 The key question: "What's the most cost-effective model that meets my quality threshold for this specific task?" — not "which model is the best overall." The Model Arena demo helps you answer this by comparing outputs side-by-side.

Why Does This Happen?

💡 The Three Factors

1. Parameters (size): More parameters = more "knowledge" stored in the model. A 70B model has seen more patterns during training than a 7B model. But more parameters means more computation per token → slower and more expensive.

2. Training data & techniques: Claude models are trained with extensive safety tuning (RLHF/DPO) which makes them more cautious and thorough. Llama models are trained as general-purpose open-source models. Nova models are optimized for AWS integration and cost efficiency.

3. Architecture optimizations: Some models use techniques like mixture-of-experts (MoE) where only a fraction of parameters activate per token — making them faster without losing quality. Others use knowledge distillation to compress a large model's knowledge into a smaller one.

What We Saw in the Demo

In the Merchant Risk Assessment demo, you may have noticed:

Claude Sonnet 4 produced the most detailed analysis with specific data citations — but took 10-15 seconds
Nova Pro produced good structured output — in 3-5 seconds
Nova Lite was fastest (2-3 seconds) but the analysis was more surface-level
Llama models sometimes interpreted risk thresholds differently from Claude — because they were trained on different data with different safety tuning

This is why model selection matters — and why we use decision rules in the prompt to enforce consistency across models.

Claude Model Family: Opus, Sonnet & Haiku

Since your team uses Claude (via Cowork and Cursor), here's how the three Claude tiers compare — and which one fits which finance task.

	Opus 4.7	Sonnet 4.6	Haiku 4.5
Role	Most capable — complex reasoning	Best balance of speed + intelligence	Fastest, near-frontier intelligence
Pricing (per 1M tokens)	$5 input / $25 output	$3 input / $15 output	$1 input / $5 output
Context window	1M tokens (~750 pages)	1M tokens (~750 pages)	200K tokens (~150 pages)
Max output	128K tokens	64K tokens	64K tokens
Speed	Moderate	Fast	Fastest
Extended thinking	Adaptive thinking	Yes	Yes
Knowledge cutoff	Jan 2026	Aug 2025	Feb 2025

Which Claude Model for Which Finance Task?

Finance task	Recommended	Why
Document classification (invoice vs receipt vs complaint)	Haiku 4.5	Simple task, speed matters, 3x cheaper than Sonnet
Invoice data extraction	Haiku 4.5	Structured extraction doesn't need deep reasoning
Customer complaint response drafts	Sonnet 4.6	Needs empathy and nuance, but not deep analysis
Merchant risk assessment narrative	Sonnet 4.6	Needs structured reasoning, data citation, and actionable recommendations
Credit committee narrative	Sonnet 4.6	Multi-perspective analysis (bull/bear case) needs good reasoning
Regulatory impact assessment	Sonnet 4.6 or Opus 4.7	Cross-referencing multiple documents, nuanced interpretation
Complex multi-step financial analysis	Opus 4.7	Deep reasoning across large datasets, highest accuracy
Bulk monthly assessments (200+ merchants)	Haiku 4.5	Cost-effective at scale — $1/1M tokens vs $3 for Sonnet

What This Means for Your Daily Tools

Tool	Model used	Can you change it?
Claude Cowork	Sonnet (Pro plan) or Opus (Max plan)	No — Anthropic assigns based on your plan
Cursor	Claude Sonnet, Opus, Haiku + GPT, Gemini, DeepSeek	Yes — select in settings per conversation
Kiro (workshop)	Auto-selected by task	No — Kiro picks the best model automatically
Bedrock Playground	All models available	Yes — full control for testing and comparison

💡 Key takeaway: In Claude Cowork (your daily tool), you don't choose the model — focus on writing great prompts and skills. In Cursor, you CAN choose — use Haiku for speed, Sonnet for quality, Opus for complex analysis. The prompt engineering skills from Day 2 work regardless of which model or tool you use.

Knowledge Cutoffs: Why They Matter for Finance

Each model's knowledge has a cutoff date — it doesn't know about events after that date:

Opus 4.7 (Jan 2026) — knows about the latest regulatory changes, market events
Sonnet 4.6 (Aug 2025) — good for most tasks, but may miss very recent regulations
Haiku 4.5 (Feb 2025) — older knowledge, fine for data extraction but not for regulatory questions

For questions about recent regulatory changes (e.g., "What did MAS announce in Q4 2025?"), use Sonnet or Opus. For data extraction and classification tasks, the knowledge cutoff doesn't matter — Haiku is fine.

For the most current information, use RAG grounding (Day 2 Module 4) — attach the actual document and tell the AI to answer ONLY from that document. This bypasses the knowledge cutoff entirely.

Bedrock Pricing: What Finance Teams Need to Know

Pricing Model: Pay Per Token

PRICING FORMULA

Cost = (Input tokens × Input price) + (Output tokens × Output price) Output tokens are 3-5x more expensive than input tokens (because generation is computationally harder than reading)

Model Pricing Comparison (On-Demand, US regions)

Model	Input (per 1M tokens)	Output (per 1M tokens)	Best for
Amazon Nova Micro	$0.035	$0.14	Simple classification, routing
Amazon Nova Lite	$0.06	$0.24	Drafts, summaries, FAQ
Llama 4 Maverick 17B	$0.22	$0.88	Cost-effective moderate tasks
DeepSeek v3.2	$0.62	$1.85	Reasoning, cost-effective
Amazon Nova Pro	$0.80	$3.20	Reports, analysis
Claude Haiku 4.5	$0.80	$4.00	Quality + speed balance
Llama 3.3 70B	$2.65	$3.50	Open-source experimentation
Claude Sonnet 4	$3.00	$15.00	Complex reasoning, compliance
Claude Opus 4	$15.00	$75.00	Most complex tasks

Prices as of 2025-2026. Check aws.amazon.com/bedrock/pricing for current rates.

Cost Per Merchant Risk Assessment

Using our workshop prompt template (~400 input tokens + ~800 output tokens):

Model	Cost per assessment	50/week	200/month
Nova Micro	$0.000126	$0.006	$0.025
Nova Lite	$0.000216	$0.011	$0.043
Nova Pro	$0.002880	$0.144	$0.576
Claude Haiku 4.5	$0.003520	$0.176	$0.704
Claude Sonnet 4	$0.013200	$0.660	$2.640
Claude Opus 4	$0.066000	$3.300	$13.200

✅ The business case: Even Claude Sonnet 4 (high-quality) costs only $2.64/month for 200 merchant risk assessments. Compare that to an analyst spending 30 minutes each at $50/hour = $5,000/month. The AI is 99.95% cheaper.

Cost Optimization Strategies

Strategy	Savings	How it works
Right-size your model	Up to 428x	Use Nova Micro for classification, Sonnet for complex analysis — don't use Opus for simple tasks
Optimize prompts	10-40%	Remove redundant instructions, use shorter examples, constrain output length
Batch processing	50%	Submit requests in bulk (not real-time) — perfect for monthly portfolio assessments
Intelligent Prompt Routing	Up to 30%	Bedrock auto-routes simple tasks to cheaper models, complex tasks to powerful ones
Prompt caching	Up to 90%	Cache your template — pay full price once, 10% for every reuse

Right-Sizing Guide

Task	Recommended model	Cost tier
Document classification ("Is this an invoice or receipt?")	Nova Micro / Lite	$0.04-0.06/1M tokens
Data extraction (fields from invoice PDF)	Nova Pro / Haiku	$0.80/1M tokens
Narrative generation (risk assessment, credit narrative)	Sonnet / Llama 70B	$2.65-3.00/1M tokens
Complex reasoning (regulatory impact, multi-step analysis)	Sonnet / Opus	$3.00-15.00/1M tokens

Context Windows: How Much Can the Model "See"?

The context window is the maximum amount of text the model can process at once — your prompt + the AI's response must fit within it.

Model	Context window	Text equivalent	Practical meaning
Nova Micro	128K tokens	~100 pages	Can read a short book
Nova Pro	300K tokens	~230 pages	Can read a long report
Claude Sonnet 4	200K tokens	~150 pages	Can read a full policy manual
Llama 3.3 70B	128K tokens	~100 pages	Can read a short book

💡 For finance: A typical merchant data file + prompt template + policy document fits easily within any model's context window. You'd only hit limits with very large documents (100+ page regulatory filings).

Token Estimation Quick Reference

Content type	Tokens per page	Tokens per item
Plain English text	~250/page	—
Financial data (CSV)	~400/page	~50/row
JSON structured data	~350/page	—
A typical email	—	~200 tokens
A merchant risk assessment	—	~800 tokens
A credit committee narrative	—	~600 tokens
An invoice (extracted text)	—	~300 tokens

Counting Tokens Before You Spend

Bedrock provides a CountTokens API that lets you check how many tokens your input will use — before you send the actual request. This is free (no charge for counting).

What you can do	Why it matters
Estimate costs before sending requests	Know the cost before you commit — especially for large batch jobs
Optimize prompts to fit within token limits	Trim your prompt if it's too long for the context window
Plan token usage in your applications	Budget your monthly token spend accurately

💡 Key point: Token counting is model-specific — the same text may produce different token counts on different models because each uses a different tokenizer. The CountTokens API returns the exact count for the model you specify.

Example: Count tokens with Python

PYTHON — CountTokens API

import boto3, json client = boto3.client("bedrock-runtime") # Count tokens for a Converse-style request response = client.count_tokens( modelId="anthropic.claude-sonnet-4-20250514-v1:0", input={ "converse": { "messages": [ {"role": "user", "content": [ {"text": "Assess this merchant's risk level based on the following data..."} ]} ], "system": [ {"text": "You are a Senior Risk Analyst..."} ] } } ) print(f"Input tokens: {response['inputTokens']}") # Use this to estimate cost before running the actual inference

Token Quotas: Understanding Rate Limits

AWS sets quotas on how many tokens you can use per minute (TPM) and per day (TPD). Understanding how these work helps you avoid throttling.

Key Terms

Term	What it means
Tokens per Minute (TPM)	Maximum tokens (input + output) you can use in one minute
Tokens per Day (TPD)	Maximum tokens per day (default = TPM × 1,440)
Requests per Minute (RPM)	Maximum number of API calls per minute
max_tokens	Parameter you set to limit how long the AI's response can be

The Burndown Rate: Why Output Tokens Cost More Quota

For newer Claude models (3.7 and later), output tokens consume 5x the quota of input tokens. This is because generating text is computationally much harder than reading it.

Model	Input burndown	Output burndown	Example: 1,000 input + 100 output
Claude Sonnet 4, Opus 4	1:1	5:1	1,000 + (100 × 5) = 1,500 quota tokens
Nova, Llama, older Claude	1:1	1:1	1,000 + 100 = 1,100 quota tokens

⚠️ Important: You're only billed for actual tokens used (1,100 in the example above). The 5x burndown only affects your quota (rate limit), not your bill. But it means you can hit throttling limits faster with Claude 4+ models.

Why max_tokens Matters for Throughput

Bedrock reserves quota for max_tokens at the start of each request, then adjusts after the response is generated:

	max_tokens = 32,000 (too high)	max_tokens = 1,250 (optimized)
Initial quota reserved	40,000 tokens	9,250 tokens
Actual quota used	9,000 tokens	9,000 tokens
Wasted reservation	31,000 tokens	250 tokens
Impact	Fewer concurrent requests possible	More concurrent requests possible

✅ Optimization tip: Set max_tokens close to your expected output size. For a merchant risk assessment (~800 tokens output), set max_tokens to 1,000-1,200 — not the default 4,096 or 32,000. This lets you run more concurrent requests within your quota.

Monitor Your Usage

Use Amazon CloudWatch to track your token consumption:

InputTokenCount — tokens sent to the model
OutputTokenCount — tokens generated by the model
CacheReadInputTokens — tokens served from cache (cheaper)
CacheWriteInputTokens — tokens written to cache

Navigate to CloudWatch → Dashboards → Automatic dashboards → Bedrock → "Token Counts by Model" to see your usage patterns.

Data Privacy & Security

💡 With Amazon Bedrock:

Your data stays in your AWS account — it is not used to train the models
You control the region, encryption, and access
This is different from using ChatGPT or Claude.ai directly — Bedrock provides enterprise-grade data isolation
All API calls are logged and auditable via CloudTrail
You can restrict which models and regions are available to your team

Workshop Connection

Concept	Where you'll see it
Token estimation	Day 2: Understanding why prompt length matters for cost and quality
Model selection	Day 1 Demo: Model Arena — compare 3 models on the same task
Cost optimization	Day 2 Module 7: Bedrock Prompt Management and Optimization
Right-sizing models	Day 3: Intelligent Prompt Routing in workflow automation
Context windows	Day 2: Managing long conversations and knowing when to start fresh

← GenAI Use Cases Workshop Home →

💰 Tokenization, Pricing & Model Selection

What Are Tokens?

Quick Token Estimation

How Tokenization Works

Why Different Models Tokenize Differently

Why Different Models Perform Differently

Model Size = Brain Size

Choosing the Right Model for the Task

Why Does This Happen?

What We Saw in the Demo

Claude Model Family: Opus, Sonnet & Haiku

Which Claude Model for Which Finance Task?

What This Means for Your Daily Tools

Knowledge Cutoffs: Why They Matter for Finance

Bedrock Pricing: What Finance Teams Need to Know

Pricing Model: Pay Per Token

Model Pricing Comparison (On-Demand, US regions)

Cost Per Merchant Risk Assessment

Cost Optimization Strategies

Right-Sizing Guide

Context Windows: How Much Can the Model "See"?

Token Estimation Quick Reference

Counting Tokens Before You Spend

Example: Count tokens with Python

Token Quotas: Understanding Rate Limits

Key Terms

The Burndown Rate: Why Output Tokens Cost More Quota

Why max_tokens Matters for Throughput

Monitor Your Usage

Data Privacy & Security

Workshop Connection