How AI models process text, what it costs, and why different models perform differently โ explained for finance teams.
LLMs don't process words โ they process tokens. A token is a piece of text, roughly:
Why this matters for cost: You pay per token โ both for what you send (input) and what the AI generates (output). Longer prompts and longer outputs cost more.
| Content | Approximate tokens |
|---|---|
| A short question ("Assess this merchant") | ~5 tokens |
| A paragraph of merchant data (10 lines) | ~150 tokens |
| Our engineered prompt template | ~400 tokens |
| A full risk assessment output (8 sections) | ~800 tokens |
| Total per assessment (input + output) | ~1,350 tokens |
Models use subword tokenization โ they break text into meaningful pieces, not whole words:
| Text | Tokens | Count | Note |
|---|---|---|---|
| "chargebacks" | ["charge", "backs"] | 2 | Split into meaningful subwords |
| "PayLater" | ["Pay", "Later"] | 2 | CamelCase splits naturally |
| "SGD" | ["SG", "D"] | 2 | Abbreviations may split |
| "$4,200" | ["$", "4", ",", "200"] | 4 | Numbers are expensive! |
Each model family has its own tokenizer โ the same text may be a different number of tokens on different models:
Not all AI models are created equal. They differ in size (parameters), training (data and techniques), and architecture โ which affects speed, quality, and cost.
Parameters are the "knowledge" stored in the model. More parameters = more capacity for complex reasoning, but also slower and more expensive.
| Model size | Parameters | Analogy | Good for |
|---|---|---|---|
| Small | 1-17B | Junior analyst โ fast, handles routine tasks | Classification, simple extraction, FAQ |
| Medium | 17-70B | Senior analyst โ balanced speed and depth | Reports, structured analysis, narratives |
| Large | 70B+ | Expert consultant โ thorough but expensive | Complex reasoning, multi-step analysis, research |
Different tasks need different trade-offs. Match the model to the job โ not every task needs the most powerful option:
| Task type | What matters most | Model category | Examples on Bedrock |
|---|---|---|---|
| Classification & routing | Speed, low cost | Small / lightweight models | Nova Micro, Nova Lite |
| Data extraction & summarization | Accuracy, structured output | Mid-range models | Nova Pro, Claude Haiku, Llama Maverick |
| Narrative generation & analysis | Quality, reasoning depth | Capable models | Claude Sonnet, Llama 70B, DeepSeek |
| Complex multi-step reasoning | Depth, nuance, thoroughness | Frontier models | Claude Sonnet, Claude Opus |
In the Merchant Risk Assessment demo, you may have noticed:
This is why model selection matters โ and why we use decision rules in the prompt to enforce consistency across models.
Since your team uses Claude (via Cowork and Cursor), here's how the three Claude tiers compare โ and which one fits which finance task.
| Opus 4.7 | Sonnet 4.6 | Haiku 4.5 | |
|---|---|---|---|
| Role | Most capable โ complex reasoning | Best balance of speed + intelligence | Fastest, near-frontier intelligence |
| Pricing (per 1M tokens) | $5 input / $25 output | $3 input / $15 output | $1 input / $5 output |
| Context window | 1M tokens (~750 pages) | 1M tokens (~750 pages) | 200K tokens (~150 pages) |
| Max output | 128K tokens | 64K tokens | 64K tokens |
| Speed | Moderate | Fast | Fastest |
| Extended thinking | Adaptive thinking | Yes | Yes |
| Knowledge cutoff | Jan 2026 | Aug 2025 | Feb 2025 |
| Finance task | Recommended | Why |
|---|---|---|
| Document classification (invoice vs receipt vs complaint) | Haiku 4.5 | Simple task, speed matters, 3x cheaper than Sonnet |
| Invoice data extraction | Haiku 4.5 | Structured extraction doesn't need deep reasoning |
| Customer complaint response drafts | Sonnet 4.6 | Needs empathy and nuance, but not deep analysis |
| Merchant risk assessment narrative | Sonnet 4.6 | Needs structured reasoning, data citation, and actionable recommendations |
| Credit committee narrative | Sonnet 4.6 | Multi-perspective analysis (bull/bear case) needs good reasoning |
| Regulatory impact assessment | Sonnet 4.6 or Opus 4.7 | Cross-referencing multiple documents, nuanced interpretation |
| Complex multi-step financial analysis | Opus 4.7 | Deep reasoning across large datasets, highest accuracy |
| Bulk monthly assessments (200+ merchants) | Haiku 4.5 | Cost-effective at scale โ $1/1M tokens vs $3 for Sonnet |
| Tool | Model used | Can you change it? |
|---|---|---|
| Claude Cowork | Sonnet (Pro plan) or Opus (Max plan) | No โ Anthropic assigns based on your plan |
| Cursor | Claude Sonnet, Opus, Haiku + GPT, Gemini, DeepSeek | Yes โ select in settings per conversation |
| Kiro (workshop) | Auto-selected by task | No โ Kiro picks the best model automatically |
| Bedrock Playground | All models available | Yes โ full control for testing and comparison |
Each model's knowledge has a cutoff date โ it doesn't know about events after that date:
For questions about recent regulatory changes (e.g., "What did MAS announce in Q4 2025?"), use Sonnet or Opus. For data extraction and classification tasks, the knowledge cutoff doesn't matter โ Haiku is fine.
For the most current information, use RAG grounding (Day 2 Module 4) โ attach the actual document and tell the AI to answer ONLY from that document. This bypasses the knowledge cutoff entirely.
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Best for |
|---|---|---|---|
| Amazon Nova Micro | $0.035 | $0.14 | Simple classification, routing |
| Amazon Nova Lite | $0.06 | $0.24 | Drafts, summaries, FAQ |
| Llama 4 Maverick 17B | $0.22 | $0.88 | Cost-effective moderate tasks |
| DeepSeek v3.2 | $0.62 | $1.85 | Reasoning, cost-effective |
| Amazon Nova Pro | $0.80 | $3.20 | Reports, analysis |
| Claude Haiku 4.5 | $0.80 | $4.00 | Quality + speed balance |
| Llama 3.3 70B | $2.65 | $3.50 | Open-source experimentation |
| Claude Sonnet 4 | $3.00 | $15.00 | Complex reasoning, compliance |
| Claude Opus 4 | $15.00 | $75.00 | Most complex tasks |
Prices as of 2025-2026. Check aws.amazon.com/bedrock/pricing for current rates.
Using our workshop prompt template (~400 input tokens + ~800 output tokens):
| Model | Cost per assessment | 50/week | 200/month |
|---|---|---|---|
| Nova Micro | $0.000126 | $0.006 | $0.025 |
| Nova Lite | $0.000216 | $0.011 | $0.043 |
| Nova Pro | $0.002880 | $0.144 | $0.576 |
| Claude Haiku 4.5 | $0.003520 | $0.176 | $0.704 |
| Claude Sonnet 4 | $0.013200 | $0.660 | $2.640 |
| Claude Opus 4 | $0.066000 | $3.300 | $13.200 |
| Strategy | Savings | How it works |
|---|---|---|
| Right-size your model | Up to 428x | Use Nova Micro for classification, Sonnet for complex analysis โ don't use Opus for simple tasks |
| Optimize prompts | 10-40% | Remove redundant instructions, use shorter examples, constrain output length |
| Batch processing | 50% | Submit requests in bulk (not real-time) โ perfect for monthly portfolio assessments |
| Intelligent Prompt Routing | Up to 30% | Bedrock auto-routes simple tasks to cheaper models, complex tasks to powerful ones |
| Prompt caching | Up to 90% | Cache your template โ pay full price once, 10% for every reuse |
| Task | Recommended model | Cost tier |
|---|---|---|
| Document classification ("Is this an invoice or receipt?") | Nova Micro / Lite | $0.04-0.06/1M tokens |
| Data extraction (fields from invoice PDF) | Nova Pro / Haiku | $0.80/1M tokens |
| Narrative generation (risk assessment, credit narrative) | Sonnet / Llama 70B | $2.65-3.00/1M tokens |
| Complex reasoning (regulatory impact, multi-step analysis) | Sonnet / Opus | $3.00-15.00/1M tokens |
The context window is the maximum amount of text the model can process at once โ your prompt + the AI's response must fit within it.
| Model | Context window | Text equivalent | Practical meaning |
|---|---|---|---|
| Nova Micro | 128K tokens | ~100 pages | Can read a short book |
| Nova Pro | 300K tokens | ~230 pages | Can read a long report |
| Claude Sonnet 4 | 200K tokens | ~150 pages | Can read a full policy manual |
| Llama 3.3 70B | 128K tokens | ~100 pages | Can read a short book |
| Content type | Tokens per page | Tokens per item |
|---|---|---|
| Plain English text | ~250/page | โ |
| Financial data (CSV) | ~400/page | ~50/row |
| JSON structured data | ~350/page | โ |
| A typical email | โ | ~200 tokens |
| A merchant risk assessment | โ | ~800 tokens |
| A credit committee narrative | โ | ~600 tokens |
| An invoice (extracted text) | โ | ~300 tokens |
Bedrock provides a CountTokens API that lets you check how many tokens your input will use โ before you send the actual request. This is free (no charge for counting).
| What you can do | Why it matters |
|---|---|
| Estimate costs before sending requests | Know the cost before you commit โ especially for large batch jobs |
| Optimize prompts to fit within token limits | Trim your prompt if it's too long for the context window |
| Plan token usage in your applications | Budget your monthly token spend accurately |
AWS sets quotas on how many tokens you can use per minute (TPM) and per day (TPD). Understanding how these work helps you avoid throttling.
| Term | What it means |
|---|---|
| Tokens per Minute (TPM) | Maximum tokens (input + output) you can use in one minute |
| Tokens per Day (TPD) | Maximum tokens per day (default = TPM ร 1,440) |
| Requests per Minute (RPM) | Maximum number of API calls per minute |
| max_tokens | Parameter you set to limit how long the AI's response can be |
For newer Claude models (3.7 and later), output tokens consume 5x the quota of input tokens. This is because generating text is computationally much harder than reading it.
| Model | Input burndown | Output burndown | Example: 1,000 input + 100 output |
|---|---|---|---|
| Claude Sonnet 4, Opus 4 | 1:1 | 5:1 | 1,000 + (100 ร 5) = 1,500 quota tokens |
| Nova, Llama, older Claude | 1:1 | 1:1 | 1,000 + 100 = 1,100 quota tokens |
Bedrock reserves quota for max_tokens at the start of each request, then adjusts after the response is generated:
| max_tokens = 32,000 (too high) | max_tokens = 1,250 (optimized) | |
|---|---|---|
| Initial quota reserved | 40,000 tokens | 9,250 tokens |
| Actual quota used | 9,000 tokens | 9,000 tokens |
| Wasted reservation | 31,000 tokens | 250 tokens |
| Impact | Fewer concurrent requests possible | More concurrent requests possible |
max_tokens close to your expected output size. For a merchant risk assessment (~800 tokens output), set max_tokens to 1,000-1,200 โ not the default 4,096 or 32,000. This lets you run more concurrent requests within your quota.
Use Amazon CloudWatch to track your token consumption:
Navigate to CloudWatch โ Dashboards โ Automatic dashboards โ Bedrock โ "Token Counts by Model" to see your usage patterns.
| Concept | Where you'll see it |
|---|---|
| Token estimation | Day 2: Understanding why prompt length matters for cost and quality |
| Model selection | Day 1 Demo: Model Arena โ compare 3 models on the same task |
| Cost optimization | Day 2 Module 7: Bedrock Prompt Management and Optimization |
| Right-sizing models | Day 3: Intelligent Prompt Routing in workflow automation |
| Context windows | Day 2: Managing long conversations and knowing when to start fresh |