Parking Lot — Participant Q&A

📋 Questions

LLM Selection — Cost, performance, guidance
Web search capabilities for AI agents
TPU vs GPU — Does hardware affect model quality?
LLM-as-Judge — Which model to use?
Temperature — When to use low vs high?
Gemini vs Claude — Quality differences
P2P Invoice extraction at scale beyond OCR
Bedrock vs direct LLM provider endpoint
Grounding vs Relevance in Bedrock Guardrails
Cursor auto-summarizes and loses context
How to save a file in Markdown (.md) format

Q1 LLM Selection — What is the guidance on selecting the best model?

There's no single "best" model — the right choice depends on your task complexity, quality threshold, and budget. AWS recommends a tiered approach:

Factor	What to evaluate
Task quality	Does the model follow structured output instructions? Does it adhere to decision rules?
Cost	Price per 1M tokens varies 143x across models (Nova Micro $0.035 → Claude Opus $5.00 input)
Latency	Time-to-first-token and total generation time — matters for real-time vs batch
Context window	How much data can the model "see" at once (128K–1M tokens depending on model)
Compliance	Data residency, encryption, audit trail requirements

Practical strategy: Use a tiered model approach — route simple tasks (classification, extraction) to cheaper models (Nova Micro/Lite), moderate tasks to mid-tier (Nova Pro, Haiku), and complex reasoning to frontier models (Claude Sonnet/Opus). Amazon Bedrock's Intelligent Prompt Routing can do this automatically with up to 30% cost reduction.

New: Bedrock Service Tiers (2025) — Priority (premium, lowest latency for mission-critical), Standard (consistent performance), and Flex (lower cost, higher latency for batch workloads). You can choose the tier per API call.

Sources: AWS Blog — Optimizing Cost for Bedrock, AWS Blog — Bedrock Service Tiers

Q2 How to get web search capabilities on AI agents?

LLMs have a knowledge cutoff date — they don't know about events after their training data ends. Three approaches:

MCP + Web Search Tools — Connect your AI agent to a web search MCP server (e.g., Brave Search MCP, Tavily MCP). The agent searches the web in real-time. This is what Kiro uses.
RAG (Retrieval-Augmented Generation) — Index your own documents and let the agent retrieve relevant chunks before generating. Covered in Day 2 Module 4. Best for internal/proprietary information.
Tool Use / Function Calling — Give the agent access to APIs (news feeds, market data, regulatory databases) via tool definitions. The agent decides when to call which API.

For AnyCompany: Combine RAG (for internal policies) with web search MCP (for current regulatory updates). On Day 3 Lab 9, you'll see MCP in action — connecting Kiro to a database to query merchant data in plain English.

Sources: Model Context Protocol — modelcontextprotocol.io, Anthropic MCP Documentation

Q3 TPU from Google vs GPU — Will that produce different quality models?

Short answer: No. The hardware does not affect the quality of the trained model. Given the same architecture, data, and hyperparameters, a model trained on TPUs produces the same quality output as one trained on GPUs. The differences are in efficiency:

Aspect	Google TPU	NVIDIA GPU
Model quality	Same	Same
Training speed	Excels at large-scale (100B+ params)	Excels at flexible workloads
Cost efficiency	Up to 4x better perf-per-dollar (TPU v6e vs H100)	Better for smaller/mixed workloads
Energy efficiency	2-3x better perf-per-watt	More general-purpose
Framework support	Best with JAX, TensorFlow	Supports everything
Availability	Google Cloud only	AWS, Azure, GCP, on-prem

Why it matters for you: As a consumer of models via Bedrock, you don't choose the hardware — the provider already trained the model. Google uses TPUs for Gemini; Anthropic uses GPUs for Claude. The quality difference comes from architecture, training data, and RLHF — not the chip.

Sources: Introl — TPU vs GPU Framework, Wevolver — Comprehensive Comparison, DZone — Real-World Testing

Q4 LLM-as-Judge — Which model should be the judge? Reasoning model? Bigger model?

Use a stronger model than the one being judged. The judge should be at least as capable as the model being evaluated. Using a weaker model to judge a stronger one produces unreliable scores.

Practice	Why
Binary pass/fail over 1-5 scales	Likert scales introduce noise — "what's the difference between 3 and 4?" Binary judgments force clarity
Require written critiques	When the judge marks something as failing, it must explain why — makes evaluations defensible
Watch for position bias	Swapping response order can shift accuracy by >10%. Randomize presentation order
Watch for verbosity bias	Longer responses score higher regardless of quality. Penalize unnecessary length
Domain accuracy drops	General: 80%+ human agreement. Finance/legal/medical: drops to 60-70%. Use domain-specific rubrics

Reasoning models as judges: Extended thinking models can be effective because they "show their work" — but they're slower and more expensive. For most use cases, a frontier model (Claude Sonnet 4) with a well-designed rubric is sufficient.

Our workshop approach: We use Claude Sonnet 4 as the judge for both prompt template scoring (Day 2) and agent canvas scoring (Day 3), with structured rubrics (5-6 criteria, binary per criterion with written justification).

Sources: LangChain 2025 — 53.3% of teams use LLM-as-Judge, ACL 2025 — Fine-tuned Judges Not a Substitute for GPT-4, Practical Guide to Building Evaluators

Q5 When should we use lower vs higher temperature?

Temperature	Behavior	Best for
0.0 – 0.3 (Low)	Deterministic, most probable tokens	Compliance reports, data extraction, risk ratings, code, structured outputs
0.4 – 0.7 (Medium)	Balanced creativity and consistency	Business writing, email drafts, summaries, analysis narratives
0.8 – 1.5 (High)	Creative, explores less probable tokens	Brainstorming, marketing copy, creative writing, diverse alternatives

For AnyCompany Finance:

Merchant risk assessments → Low (0.1-0.3): Consistent GREEN/AMBER/RED ratings every time
Invoice data extraction → Low (0.0-0.2): Numbers and fields must be deterministic
Executive summaries → Medium (0.4-0.6): Some phrasing variation is fine, facts must be grounded
Brainstorming features → High (0.8-1.0): You want diverse ideas

Practical tip: In Kiro and Claude Cowork, you can't directly set temperature. Instead, control it through your prompt: add constraints like "ONLY use provided data" for low-temperature behavior, or "generate 5 diverse alternatives" for high-temperature behavior.

Q6 Gemini vs Claude — How do different models produce different quality?

Capability	Claude (Anthropic)	Gemini (Google)
Coding	Strongest — 82.1% SWE-bench	Good — 63.8% SWE-bench
Writing quality	Most nuanced, natural prose	Functional but less polished
Scientific reasoning	Strong	Strongest — 94.3% GPQA Diamond
Instruction following	Excellent — precise on complex constraints	Good but less precise
Multimodal	Good image understanding	Strongest — native multimodal, video
Context window	200K tokens	Up to 1M tokens
Web access	Needs tools/MCP	Built-in Google Search
Cost (frontier)	$3-5/1M input tokens	Competitive, often cheaper

Key takeaway: No single model wins everything. For structured finance outputs, Claude follows complex formatting more precisely. For research with large documents, Gemini's context window is an advantage. Test your specific use case on 2-3 models and pick the best value.

Sources: Neuronad — Claude vs Gemini 2026, Zemith — Claude vs Gemini, Linos.ai — Frontier Model Comparison

Q7 P2P Invoice Data Extraction — How to handle dynamic invoices at scale beyond OCR?

Traditional OCR struggles with dynamic layouts because it extracts characters but doesn't understand document structure. The modern approach:

Approach	Accuracy	Handles new layouts?	Maintenance
Template-based OCR	85-92%	❌ New template per vendor	High
AI-powered OCR (Textract)	95-98% character	✅ Partially	Medium
Multimodal Vision LLM	94-98.5% field	✅ Yes — "looks" at the doc	Low — prompt-based

Recommended architecture for AnyCompany:

Amazon Textract for initial extraction (fast, cheap, handles standard layouts)
Multimodal LLM fallback for complex invoices — send the image to Claude/GPT-4o Vision, returns structured JSON
Validation layer — programmatic checks (totals = sum of line items, GST, currency)
Human-in-the-loop for low-confidence extractions

At scale: 3,000+ invoices/month — route standard invoices through Textract (~$0.01/page), complex ones to Vision LLM (~$0.01-0.03/page). Total: ~$30-90/month vs manual processing cost.

IDC 2024 finding: Organizations migrating from manual PDF data entry to AI-powered IDP report average cost reduction of 73% and accuracy improvement from 94% to 98.5% at field level.

Sources: arXiv — Benchmarking LLM Strategies for Invoice Processing, Veryfi — Invoice OCR 2025 Benchmark, Papirus — AI vs OCR vs Manual

Q8 Bedrock vs direct LLM provider endpoint — What's the difference?

Aspect	Amazon Bedrock	Direct Provider API
Data privacy	Data stays in your AWS account. VPC endpoints available.	Data sent to provider's infrastructure.
Authentication	AWS IAM roles	API keys — separate management
Multi-model	Single Converse API for all models	One API per provider
Billing	Consolidated on AWS bill	Separate per provider
Compliance	SOC, HIPAA, PCI-DSS certifications	Varies by provider
Guardrails	Built-in content filtering, PII redaction, grounding	Must build your own
Monitoring	CloudWatch integration	Must set up your own
Latency	Slight overhead from routing	Marginally lower
New features	May arrive days/weeks later	Available first
Batch discount	50% off for batch inference	Volume discounts vary

For regulated finance: Bedrock is strongly preferred — IAM integration, VPC endpoints (data never leaves your network), compliance certifications, built-in guardrails, and CloudWatch monitoring. The slight latency overhead is negligible vs the security benefits.

Sources: AWS — Bedrock APIs Documentation, Zilliz — Bedrock vs Direct API

Q9 Grounding vs Relevance in Bedrock Guardrails — What's the difference?

Amazon Bedrock Guardrails has a Contextual Grounding Check with two distinct scores:

Check	What it measures	Example
Grounding	Is the response factually accurate based on the source? New info not in source = ungrounded.	Source: "Tokyo is the capital of Japan." Response: "Capital of Japan is London." → Ungrounded
Relevance	Does the response actually answer the user's question?	Query: "Capital of Japan?" Response: "Capital of UK is London." → Irrelevant (grounded but wrong question)

How it works: You provide a grounding source (reference docs), a query (user's question), and the model response. The system scores both grounding and relevance (0–0.99). You set thresholds — responses below are blocked as hallucinations.

Limits: Grounding source max 100,000 chars, query max 1,000 chars, response max 5,000 chars.

For AnyCompany: When building a RAG-powered policy Q&A system, grounding checks ensure the AI's answer comes from your policy documents, and relevance checks ensure it actually answers what was asked. Critical for compliance — you can't have the AI inventing regulatory requirements.

Source: AWS Documentation — Contextual Grounding Check

Q10 Cursor auto-summarizes after ~10 conversations and loses context. How to improve?

This is a known limitation of all AI coding assistants caused by context window limits. When conversation history exceeds the model's context window, the tool summarizes older messages — and important details can be lost.

Strategy	How	Why it helps
Short, focused conversations	One task per chat session	Prevents context buildup
Steering/rules files	Put persistent context in workspace config files	Loaded fresh each message, not accumulated
Reference files	Keep data in `.md`/`.csv` files, reference them	Tool reads on demand vs carrying in history
Duplicate chat	In Cursor, duplicate before context fills	Preserves full context in the copy
Write plans to files	Ask AI to save plan to `plan.md`	Plan survives across sessions
Larger context models	Select Claude 200K or Gemini 1M	More room before summarization

In Kiro: Auto-summarization triggers at ~80% context usage. Steering files consume tokens on every message — keep them concise. Use manual-inclusion steering files so they only load when explicitly referenced.

The numbers problem: Numerical data consumes more tokens than text ("$4,200" = 4 tokens). Keep numerical data in files and ask the AI to read the file, rather than pasting numbers into chat.

Sources: Cursor Forum — Summarizing Chat Content, Kiro Docs — Summarization

Q11 How to save a file in Markdown (.md) format?

Markdown files are just plain text files with a .md extension. You don't need any special software — any text editor works.

🍎 macOS

Option A: TextEdit

Open TextEdit
Format → Make Plain Text (⇧⌘T) — critical, otherwise saves as .rtf
Write or paste your content
File → Save — name it my-file.md
If macOS warns, click "Use .md"

Option B: Terminal

touch my-file.md
open -e my-file.md

Option C: VS Code / Kiro

File → New File (⌘N)
File → Save As (⇧⌘S) — name it .md

🪟 Windows

Option A: Notepad

Open Notepad
Write or paste your content
File → Save As
Change "Save as type" to "All Files (*.*)"
Name it my-file.md, set encoding to UTF-8
Click Save

Option B: VS Code / Kiro

File → New File (Ctrl+N)
File → Save As (Ctrl+Shift+S) — name it .md

From Bedrock Playground (Day 1 Labs): Select all the AI's response → Copy → Paste into your text editor → Save As with .md extension.

Why Markdown? As covered in Day 2 — Markdown uses 60% fewer tokens than HTML, is readable by both humans and AI, and is the standard format for AI tools (Kiro steering files, SKILL.md, Claude projects).

🅿️ Parking Lot — Participant Q&A