๐Ÿ’ฐ Tokenization, Pricing & Model Selection

How AI models process text, what it costs, and why different models perform differently โ€” explained for finance teams.

What Are Tokens?

LLMs don't process words โ€” they process tokens. A token is a piece of text, roughly:

TOKEN ESTIMATION
1 token โ‰ˆ 4 characters โ‰ˆ ยพ of a word "AnyCompany Financial Group" = 4 tokens "The merchant's chargeback rate is 4.1%" = 9 tokens A typical merchant risk assessment (500 words) โ‰ˆ 650 tokens

Why this matters for cost: You pay per token โ€” both for what you send (input) and what the AI generates (output). Longer prompts and longer outputs cost more.

Quick Token Estimation

ContentApproximate tokens
A short question ("Assess this merchant")~5 tokens
A paragraph of merchant data (10 lines)~150 tokens
Our engineered prompt template~400 tokens
A full risk assessment output (8 sections)~800 tokens
Total per assessment (input + output)~1,350 tokens

How Tokenization Works

Models use subword tokenization โ€” they break text into meaningful pieces, not whole words:

TextTokensCountNote
"chargebacks"["charge", "backs"]2Split into meaningful subwords
"PayLater"["Pay", "Later"]2CamelCase splits naturally
"SGD"["SG", "D"]2Abbreviations may split
"$4,200"["$", "4", ",", "200"]4Numbers are expensive!
โš ๏ธ Key insight for finance: Numbers and special characters use more tokens than you'd expect. A table of financial data with lots of numbers costs more tokens than the same amount of narrative text.

Why Different Models Tokenize Differently

Each model family has its own tokenizer โ€” the same text may be a different number of tokens on different models:

Why Different Models Perform Differently

Not all AI models are created equal. They differ in size (parameters), training (data and techniques), and architecture โ€” which affects speed, quality, and cost.

Model Size = Brain Size

Parameters are the "knowledge" stored in the model. More parameters = more capacity for complex reasoning, but also slower and more expensive.

Model sizeParametersAnalogyGood for
Small1-17BJunior analyst โ€” fast, handles routine tasksClassification, simple extraction, FAQ
Medium17-70BSenior analyst โ€” balanced speed and depthReports, structured analysis, narratives
Large70B+Expert consultant โ€” thorough but expensiveComplex reasoning, multi-step analysis, research

Choosing the Right Model for the Task

Different tasks need different trade-offs. Match the model to the job โ€” not every task needs the most powerful option:

Task typeWhat matters mostModel categoryExamples on Bedrock
Classification & routingSpeed, low costSmall / lightweight modelsNova Micro, Nova Lite
Data extraction & summarizationAccuracy, structured outputMid-range modelsNova Pro, Claude Haiku, Llama Maverick
Narrative generation & analysisQuality, reasoning depthCapable modelsClaude Sonnet, Llama 70B, DeepSeek
Complex multi-step reasoningDepth, nuance, thoroughnessFrontier modelsClaude Sonnet, Claude Opus
๐Ÿ’ก The key question: "What's the most cost-effective model that meets my quality threshold for this specific task?" โ€” not "which model is the best overall." The Model Arena demo helps you answer this by comparing outputs side-by-side.

Why Does This Happen?

๐Ÿ’ก The Three Factors

1. Parameters (size): More parameters = more "knowledge" stored in the model. A 70B model has seen more patterns during training than a 7B model. But more parameters means more computation per token โ†’ slower and more expensive.

2. Training data & techniques: Claude models are trained with extensive safety tuning (RLHF/DPO) which makes them more cautious and thorough. Llama models are trained as general-purpose open-source models. Nova models are optimized for AWS integration and cost efficiency.

3. Architecture optimizations: Some models use techniques like mixture-of-experts (MoE) where only a fraction of parameters activate per token โ€” making them faster without losing quality. Others use knowledge distillation to compress a large model's knowledge into a smaller one.

What We Saw in the Demo

In the Merchant Risk Assessment demo, you may have noticed:

This is why model selection matters โ€” and why we use decision rules in the prompt to enforce consistency across models.

Claude Model Family: Opus, Sonnet & Haiku

Since your team uses Claude (via Cowork and Cursor), here's how the three Claude tiers compare โ€” and which one fits which finance task.

Opus 4.7Sonnet 4.6Haiku 4.5
RoleMost capable โ€” complex reasoningBest balance of speed + intelligenceFastest, near-frontier intelligence
Pricing (per 1M tokens)$5 input / $25 output$3 input / $15 output$1 input / $5 output
Context window1M tokens (~750 pages)1M tokens (~750 pages)200K tokens (~150 pages)
Max output128K tokens64K tokens64K tokens
SpeedModerateFastFastest
Extended thinkingAdaptive thinkingYesYes
Knowledge cutoffJan 2026Aug 2025Feb 2025

Which Claude Model for Which Finance Task?

Finance taskRecommendedWhy
Document classification (invoice vs receipt vs complaint)Haiku 4.5Simple task, speed matters, 3x cheaper than Sonnet
Invoice data extractionHaiku 4.5Structured extraction doesn't need deep reasoning
Customer complaint response draftsSonnet 4.6Needs empathy and nuance, but not deep analysis
Merchant risk assessment narrativeSonnet 4.6Needs structured reasoning, data citation, and actionable recommendations
Credit committee narrativeSonnet 4.6Multi-perspective analysis (bull/bear case) needs good reasoning
Regulatory impact assessmentSonnet 4.6 or Opus 4.7Cross-referencing multiple documents, nuanced interpretation
Complex multi-step financial analysisOpus 4.7Deep reasoning across large datasets, highest accuracy
Bulk monthly assessments (200+ merchants)Haiku 4.5Cost-effective at scale โ€” $1/1M tokens vs $3 for Sonnet

What This Means for Your Daily Tools

ToolModel usedCan you change it?
Claude CoworkSonnet (Pro plan) or Opus (Max plan)No โ€” Anthropic assigns based on your plan
CursorClaude Sonnet, Opus, Haiku + GPT, Gemini, DeepSeekYes โ€” select in settings per conversation
Kiro (workshop)Auto-selected by taskNo โ€” Kiro picks the best model automatically
Bedrock PlaygroundAll models availableYes โ€” full control for testing and comparison
๐Ÿ’ก Key takeaway: In Claude Cowork (your daily tool), you don't choose the model โ€” focus on writing great prompts and skills. In Cursor, you CAN choose โ€” use Haiku for speed, Sonnet for quality, Opus for complex analysis. The prompt engineering skills from Day 2 work regardless of which model or tool you use.

Knowledge Cutoffs: Why They Matter for Finance

Each model's knowledge has a cutoff date โ€” it doesn't know about events after that date:

For questions about recent regulatory changes (e.g., "What did MAS announce in Q4 2025?"), use Sonnet or Opus. For data extraction and classification tasks, the knowledge cutoff doesn't matter โ€” Haiku is fine.

For the most current information, use RAG grounding (Day 2 Module 4) โ€” attach the actual document and tell the AI to answer ONLY from that document. This bypasses the knowledge cutoff entirely.

Bedrock Pricing: What Finance Teams Need to Know

Pricing Model: Pay Per Token

PRICING FORMULA
Cost = (Input tokens ร— Input price) + (Output tokens ร— Output price) Output tokens are 3-5x more expensive than input tokens (because generation is computationally harder than reading)

Model Pricing Comparison (On-Demand, US regions)

ModelInput (per 1M tokens)Output (per 1M tokens)Best for
Amazon Nova Micro$0.035$0.14Simple classification, routing
Amazon Nova Lite$0.06$0.24Drafts, summaries, FAQ
Llama 4 Maverick 17B$0.22$0.88Cost-effective moderate tasks
DeepSeek v3.2$0.62$1.85Reasoning, cost-effective
Amazon Nova Pro$0.80$3.20Reports, analysis
Claude Haiku 4.5$0.80$4.00Quality + speed balance
Llama 3.3 70B$2.65$3.50Open-source experimentation
Claude Sonnet 4$3.00$15.00Complex reasoning, compliance
Claude Opus 4$15.00$75.00Most complex tasks

Prices as of 2025-2026. Check aws.amazon.com/bedrock/pricing for current rates.

Cost Per Merchant Risk Assessment

Using our workshop prompt template (~400 input tokens + ~800 output tokens):

ModelCost per assessment50/week200/month
Nova Micro$0.000126$0.006$0.025
Nova Lite$0.000216$0.011$0.043
Nova Pro$0.002880$0.144$0.576
Claude Haiku 4.5$0.003520$0.176$0.704
Claude Sonnet 4$0.013200$0.660$2.640
Claude Opus 4$0.066000$3.300$13.200
โœ… The business case: Even Claude Sonnet 4 (high-quality) costs only $2.64/month for 200 merchant risk assessments. Compare that to an analyst spending 30 minutes each at $50/hour = $5,000/month. The AI is 99.95% cheaper.

Cost Optimization Strategies

StrategySavingsHow it works
Right-size your modelUp to 428xUse Nova Micro for classification, Sonnet for complex analysis โ€” don't use Opus for simple tasks
Optimize prompts10-40%Remove redundant instructions, use shorter examples, constrain output length
Batch processing50%Submit requests in bulk (not real-time) โ€” perfect for monthly portfolio assessments
Intelligent Prompt RoutingUp to 30%Bedrock auto-routes simple tasks to cheaper models, complex tasks to powerful ones
Prompt cachingUp to 90%Cache your template โ€” pay full price once, 10% for every reuse

Right-Sizing Guide

TaskRecommended modelCost tier
Document classification ("Is this an invoice or receipt?")Nova Micro / Lite$0.04-0.06/1M tokens
Data extraction (fields from invoice PDF)Nova Pro / Haiku$0.80/1M tokens
Narrative generation (risk assessment, credit narrative)Sonnet / Llama 70B$2.65-3.00/1M tokens
Complex reasoning (regulatory impact, multi-step analysis)Sonnet / Opus$3.00-15.00/1M tokens

Context Windows: How Much Can the Model "See"?

The context window is the maximum amount of text the model can process at once โ€” your prompt + the AI's response must fit within it.

ModelContext windowText equivalentPractical meaning
Nova Micro128K tokens~100 pagesCan read a short book
Nova Pro300K tokens~230 pagesCan read a long report
Claude Sonnet 4200K tokens~150 pagesCan read a full policy manual
Llama 3.3 70B128K tokens~100 pagesCan read a short book
๐Ÿ’ก For finance: A typical merchant data file + prompt template + policy document fits easily within any model's context window. You'd only hit limits with very large documents (100+ page regulatory filings).

Token Estimation Quick Reference

Content typeTokens per pageTokens per item
Plain English text~250/pageโ€”
Financial data (CSV)~400/page~50/row
JSON structured data~350/pageโ€”
A typical emailโ€”~200 tokens
A merchant risk assessmentโ€”~800 tokens
A credit committee narrativeโ€”~600 tokens
An invoice (extracted text)โ€”~300 tokens

Counting Tokens Before You Spend

Bedrock provides a CountTokens API that lets you check how many tokens your input will use โ€” before you send the actual request. This is free (no charge for counting).

What you can doWhy it matters
Estimate costs before sending requestsKnow the cost before you commit โ€” especially for large batch jobs
Optimize prompts to fit within token limitsTrim your prompt if it's too long for the context window
Plan token usage in your applicationsBudget your monthly token spend accurately
๐Ÿ’ก Key point: Token counting is model-specific โ€” the same text may produce different token counts on different models because each uses a different tokenizer. The CountTokens API returns the exact count for the model you specify.

Example: Count tokens with Python

PYTHON โ€” CountTokens API
import boto3, json client = boto3.client("bedrock-runtime") # Count tokens for a Converse-style request response = client.count_tokens( modelId="anthropic.claude-sonnet-4-20250514-v1:0", input={ "converse": { "messages": [ {"role": "user", "content": [ {"text": "Assess this merchant's risk level based on the following data..."} ]} ], "system": [ {"text": "You are a Senior Risk Analyst..."} ] } } ) print(f"Input tokens: {response['inputTokens']}") # Use this to estimate cost before running the actual inference

Token Quotas: Understanding Rate Limits

AWS sets quotas on how many tokens you can use per minute (TPM) and per day (TPD). Understanding how these work helps you avoid throttling.

Key Terms

TermWhat it means
Tokens per Minute (TPM)Maximum tokens (input + output) you can use in one minute
Tokens per Day (TPD)Maximum tokens per day (default = TPM ร— 1,440)
Requests per Minute (RPM)Maximum number of API calls per minute
max_tokensParameter you set to limit how long the AI's response can be

The Burndown Rate: Why Output Tokens Cost More Quota

For newer Claude models (3.7 and later), output tokens consume 5x the quota of input tokens. This is because generating text is computationally much harder than reading it.

ModelInput burndownOutput burndownExample: 1,000 input + 100 output
Claude Sonnet 4, Opus 41:15:11,000 + (100 ร— 5) = 1,500 quota tokens
Nova, Llama, older Claude1:11:11,000 + 100 = 1,100 quota tokens
โš ๏ธ Important: You're only billed for actual tokens used (1,100 in the example above). The 5x burndown only affects your quota (rate limit), not your bill. But it means you can hit throttling limits faster with Claude 4+ models.

Why max_tokens Matters for Throughput

Bedrock reserves quota for max_tokens at the start of each request, then adjusts after the response is generated:

max_tokens = 32,000 (too high)max_tokens = 1,250 (optimized)
Initial quota reserved40,000 tokens9,250 tokens
Actual quota used9,000 tokens9,000 tokens
Wasted reservation31,000 tokens250 tokens
ImpactFewer concurrent requests possibleMore concurrent requests possible
โœ… Optimization tip: Set max_tokens close to your expected output size. For a merchant risk assessment (~800 tokens output), set max_tokens to 1,000-1,200 โ€” not the default 4,096 or 32,000. This lets you run more concurrent requests within your quota.

Monitor Your Usage

Use Amazon CloudWatch to track your token consumption:

Navigate to CloudWatch โ†’ Dashboards โ†’ Automatic dashboards โ†’ Bedrock โ†’ "Token Counts by Model" to see your usage patterns.

Data Privacy & Security

๐Ÿ’ก With Amazon Bedrock:
  • Your data stays in your AWS account โ€” it is not used to train the models
  • You control the region, encryption, and access
  • This is different from using ChatGPT or Claude.ai directly โ€” Bedrock provides enterprise-grade data isolation
  • All API calls are logged and auditable via CloudTrail
  • You can restrict which models and regions are available to your team

Workshop Connection

ConceptWhere you'll see it
Token estimationDay 2: Understanding why prompt length matters for cost and quality
Model selectionDay 1 Demo: Model Arena โ€” compare 3 models on the same task
Cost optimizationDay 2 Module 7: Bedrock Prompt Management and Optimization
Right-sizing modelsDay 3: Intelligent Prompt Routing in workflow automation
Context windowsDay 2: Managing long conversations and knowing when to start fresh