Questions raised during the workshop, answered with data and sources.
There's no single "best" model โ the right choice depends on your task complexity, quality threshold, and budget. AWS recommends a tiered approach:
| Factor | What to evaluate |
|---|---|
| Task quality | Does the model follow structured output instructions? Does it adhere to decision rules? |
| Cost | Price per 1M tokens varies 143x across models (Nova Micro $0.035 โ Claude Opus $5.00 input) |
| Latency | Time-to-first-token and total generation time โ matters for real-time vs batch |
| Context window | How much data can the model "see" at once (128Kโ1M tokens depending on model) |
| Compliance | Data residency, encryption, audit trail requirements |
Practical strategy: Use a tiered model approach โ route simple tasks (classification, extraction) to cheaper models (Nova Micro/Lite), moderate tasks to mid-tier (Nova Pro, Haiku), and complex reasoning to frontier models (Claude Sonnet/Opus). Amazon Bedrock's Intelligent Prompt Routing can do this automatically with up to 30% cost reduction.
LLMs have a knowledge cutoff date โ they don't know about events after their training data ends. Three approaches:
Short answer: No. The hardware does not affect the quality of the trained model. Given the same architecture, data, and hyperparameters, a model trained on TPUs produces the same quality output as one trained on GPUs. The differences are in efficiency:
| Aspect | Google TPU | NVIDIA GPU |
|---|---|---|
| Model quality | Same | Same |
| Training speed | Excels at large-scale (100B+ params) | Excels at flexible workloads |
| Cost efficiency | Up to 4x better perf-per-dollar (TPU v6e vs H100) | Better for smaller/mixed workloads |
| Energy efficiency | 2-3x better perf-per-watt | More general-purpose |
| Framework support | Best with JAX, TensorFlow | Supports everything |
| Availability | Google Cloud only | AWS, Azure, GCP, on-prem |
Why it matters for you: As a consumer of models via Bedrock, you don't choose the hardware โ the provider already trained the model. Google uses TPUs for Gemini; Anthropic uses GPUs for Claude. The quality difference comes from architecture, training data, and RLHF โ not the chip.
Use a stronger model than the one being judged. The judge should be at least as capable as the model being evaluated. Using a weaker model to judge a stronger one produces unreliable scores.
| Practice | Why |
|---|---|
| Binary pass/fail over 1-5 scales | Likert scales introduce noise โ "what's the difference between 3 and 4?" Binary judgments force clarity |
| Require written critiques | When the judge marks something as failing, it must explain why โ makes evaluations defensible |
| Watch for position bias | Swapping response order can shift accuracy by >10%. Randomize presentation order |
| Watch for verbosity bias | Longer responses score higher regardless of quality. Penalize unnecessary length |
| Domain accuracy drops | General: 80%+ human agreement. Finance/legal/medical: drops to 60-70%. Use domain-specific rubrics |
Reasoning models as judges: Extended thinking models can be effective because they "show their work" โ but they're slower and more expensive. For most use cases, a frontier model (Claude Sonnet 4) with a well-designed rubric is sufficient.
| Temperature | Behavior | Best for |
|---|---|---|
| 0.0 โ 0.3 (Low) | Deterministic, most probable tokens | Compliance reports, data extraction, risk ratings, code, structured outputs |
| 0.4 โ 0.7 (Medium) | Balanced creativity and consistency | Business writing, email drafts, summaries, analysis narratives |
| 0.8 โ 1.5 (High) | Creative, explores less probable tokens | Brainstorming, marketing copy, creative writing, diverse alternatives |
For AnyCompany Finance:
| Capability | Claude (Anthropic) | Gemini (Google) |
|---|---|---|
| Coding | Strongest โ 82.1% SWE-bench | Good โ 63.8% SWE-bench |
| Writing quality | Most nuanced, natural prose | Functional but less polished |
| Scientific reasoning | Strong | Strongest โ 94.3% GPQA Diamond |
| Instruction following | Excellent โ precise on complex constraints | Good but less precise |
| Multimodal | Good image understanding | Strongest โ native multimodal, video |
| Context window | 200K tokens | Up to 1M tokens |
| Web access | Needs tools/MCP | Built-in Google Search |
| Cost (frontier) | $3-5/1M input tokens | Competitive, often cheaper |
Key takeaway: No single model wins everything. For structured finance outputs, Claude follows complex formatting more precisely. For research with large documents, Gemini's context window is an advantage. Test your specific use case on 2-3 models and pick the best value.
Traditional OCR struggles with dynamic layouts because it extracts characters but doesn't understand document structure. The modern approach:
| Approach | Accuracy | Handles new layouts? | Maintenance |
|---|---|---|---|
| Template-based OCR | 85-92% | โ New template per vendor | High |
| AI-powered OCR (Textract) | 95-98% character | โ Partially | Medium |
| Multimodal Vision LLM | 94-98.5% field | โ Yes โ "looks" at the doc | Low โ prompt-based |
Recommended architecture for AnyCompany:
At scale: 3,000+ invoices/month โ route standard invoices through Textract (~$0.01/page), complex ones to Vision LLM (~$0.01-0.03/page). Total: ~$30-90/month vs manual processing cost.
| Aspect | Amazon Bedrock | Direct Provider API |
|---|---|---|
| Data privacy | Data stays in your AWS account. VPC endpoints available. | Data sent to provider's infrastructure. |
| Authentication | AWS IAM roles | API keys โ separate management |
| Multi-model | Single Converse API for all models | One API per provider |
| Billing | Consolidated on AWS bill | Separate per provider |
| Compliance | SOC, HIPAA, PCI-DSS certifications | Varies by provider |
| Guardrails | Built-in content filtering, PII redaction, grounding | Must build your own |
| Monitoring | CloudWatch integration | Must set up your own |
| Latency | Slight overhead from routing | Marginally lower |
| New features | May arrive days/weeks later | Available first |
| Batch discount | 50% off for batch inference | Volume discounts vary |
Amazon Bedrock Guardrails has a Contextual Grounding Check with two distinct scores:
| Check | What it measures | Example |
|---|---|---|
| Grounding | Is the response factually accurate based on the source? New info not in source = ungrounded. | Source: "Tokyo is the capital of Japan." Response: "Capital of Japan is London." โ Ungrounded |
| Relevance | Does the response actually answer the user's question? | Query: "Capital of Japan?" Response: "Capital of UK is London." โ Irrelevant (grounded but wrong question) |
How it works: You provide a grounding source (reference docs), a query (user's question), and the model response. The system scores both grounding and relevance (0โ0.99). You set thresholds โ responses below are blocked as hallucinations.
Limits: Grounding source max 100,000 chars, query max 1,000 chars, response max 5,000 chars.
This is a known limitation of all AI coding assistants caused by context window limits. When conversation history exceeds the model's context window, the tool summarizes older messages โ and important details can be lost.
| Strategy | How | Why it helps |
|---|---|---|
| Short, focused conversations | One task per chat session | Prevents context buildup |
| Steering/rules files | Put persistent context in workspace config files | Loaded fresh each message, not accumulated |
| Reference files | Keep data in .md/.csv files, reference them | Tool reads on demand vs carrying in history |
| Duplicate chat | In Cursor, duplicate before context fills | Preserves full context in the copy |
| Write plans to files | Ask AI to save plan to plan.md | Plan survives across sessions |
| Larger context models | Select Claude 200K or Gemini 1M | More room before summarization |
Markdown files are just plain text files with a .md extension. You don't need any special software โ any text editor works.
Option A: TextEdit
.rtfmy-file.mdOption B: Terminal
touch my-file.md
open -e my-file.md
Option C: VS Code / Kiro
.mdOption A: Notepad
my-file.md, set encoding to UTF-8Option B: VS Code / Kiro
.md.md extension.