๐Ÿ…ฟ๏ธ Parking Lot โ€” Participant Q&A

Questions raised during the workshop, answered with data and sources.

๐Ÿ“‹ Questions

  1. LLM Selection โ€” Cost, performance, guidance
  2. Web search capabilities for AI agents
  3. TPU vs GPU โ€” Does hardware affect model quality?
  4. LLM-as-Judge โ€” Which model to use?
  5. Temperature โ€” When to use low vs high?
  6. Gemini vs Claude โ€” Quality differences
  7. P2P Invoice extraction at scale beyond OCR
  8. Bedrock vs direct LLM provider endpoint
  9. Grounding vs Relevance in Bedrock Guardrails
  10. Cursor auto-summarizes and loses context
  11. How to save a file in Markdown (.md) format

Q1 LLM Selection โ€” What is the guidance on selecting the best model?

There's no single "best" model โ€” the right choice depends on your task complexity, quality threshold, and budget. AWS recommends a tiered approach:

FactorWhat to evaluate
Task qualityDoes the model follow structured output instructions? Does it adhere to decision rules?
CostPrice per 1M tokens varies 143x across models (Nova Micro $0.035 โ†’ Claude Opus $5.00 input)
LatencyTime-to-first-token and total generation time โ€” matters for real-time vs batch
Context windowHow much data can the model "see" at once (128Kโ€“1M tokens depending on model)
ComplianceData residency, encryption, audit trail requirements

Practical strategy: Use a tiered model approach โ€” route simple tasks (classification, extraction) to cheaper models (Nova Micro/Lite), moderate tasks to mid-tier (Nova Pro, Haiku), and complex reasoning to frontier models (Claude Sonnet/Opus). Amazon Bedrock's Intelligent Prompt Routing can do this automatically with up to 30% cost reduction.

New: Bedrock Service Tiers (2025) โ€” Priority (premium, lowest latency for mission-critical), Standard (consistent performance), and Flex (lower cost, higher latency for batch workloads). You can choose the tier per API call.

Q2 How to get web search capabilities on AI agents?

LLMs have a knowledge cutoff date โ€” they don't know about events after their training data ends. Three approaches:

  1. MCP + Web Search Tools โ€” Connect your AI agent to a web search MCP server (e.g., Brave Search MCP, Tavily MCP). The agent searches the web in real-time. This is what Kiro uses.
  2. RAG (Retrieval-Augmented Generation) โ€” Index your own documents and let the agent retrieve relevant chunks before generating. Covered in Day 2 Module 4. Best for internal/proprietary information.
  3. Tool Use / Function Calling โ€” Give the agent access to APIs (news feeds, market data, regulatory databases) via tool definitions. The agent decides when to call which API.
For AnyCompany: Combine RAG (for internal policies) with web search MCP (for current regulatory updates). On Day 3 Lab 9, you'll see MCP in action โ€” connecting Kiro to a database to query merchant data in plain English.

Q3 TPU from Google vs GPU โ€” Will that produce different quality models?

Short answer: No. The hardware does not affect the quality of the trained model. Given the same architecture, data, and hyperparameters, a model trained on TPUs produces the same quality output as one trained on GPUs. The differences are in efficiency:

AspectGoogle TPUNVIDIA GPU
Model qualitySameSame
Training speedExcels at large-scale (100B+ params)Excels at flexible workloads
Cost efficiencyUp to 4x better perf-per-dollar (TPU v6e vs H100)Better for smaller/mixed workloads
Energy efficiency2-3x better perf-per-wattMore general-purpose
Framework supportBest with JAX, TensorFlowSupports everything
AvailabilityGoogle Cloud onlyAWS, Azure, GCP, on-prem

Why it matters for you: As a consumer of models via Bedrock, you don't choose the hardware โ€” the provider already trained the model. Google uses TPUs for Gemini; Anthropic uses GPUs for Claude. The quality difference comes from architecture, training data, and RLHF โ€” not the chip.

Q4 LLM-as-Judge โ€” Which model should be the judge? Reasoning model? Bigger model?

Use a stronger model than the one being judged. The judge should be at least as capable as the model being evaluated. Using a weaker model to judge a stronger one produces unreliable scores.

PracticeWhy
Binary pass/fail over 1-5 scalesLikert scales introduce noise โ€” "what's the difference between 3 and 4?" Binary judgments force clarity
Require written critiquesWhen the judge marks something as failing, it must explain why โ€” makes evaluations defensible
Watch for position biasSwapping response order can shift accuracy by >10%. Randomize presentation order
Watch for verbosity biasLonger responses score higher regardless of quality. Penalize unnecessary length
Domain accuracy dropsGeneral: 80%+ human agreement. Finance/legal/medical: drops to 60-70%. Use domain-specific rubrics

Reasoning models as judges: Extended thinking models can be effective because they "show their work" โ€” but they're slower and more expensive. For most use cases, a frontier model (Claude Sonnet 4) with a well-designed rubric is sufficient.

Our workshop approach: We use Claude Sonnet 4 as the judge for both prompt template scoring (Day 2) and agent canvas scoring (Day 3), with structured rubrics (5-6 criteria, binary per criterion with written justification).

Q5 When should we use lower vs higher temperature?

TemperatureBehaviorBest for
0.0 โ€“ 0.3 (Low)Deterministic, most probable tokensCompliance reports, data extraction, risk ratings, code, structured outputs
0.4 โ€“ 0.7 (Medium)Balanced creativity and consistencyBusiness writing, email drafts, summaries, analysis narratives
0.8 โ€“ 1.5 (High)Creative, explores less probable tokensBrainstorming, marketing copy, creative writing, diverse alternatives

For AnyCompany Finance:

Practical tip: In Kiro and Claude Cowork, you can't directly set temperature. Instead, control it through your prompt: add constraints like "ONLY use provided data" for low-temperature behavior, or "generate 5 diverse alternatives" for high-temperature behavior.

Q6 Gemini vs Claude โ€” How do different models produce different quality?

CapabilityClaude (Anthropic)Gemini (Google)
CodingStrongest โ€” 82.1% SWE-benchGood โ€” 63.8% SWE-bench
Writing qualityMost nuanced, natural proseFunctional but less polished
Scientific reasoningStrongStrongest โ€” 94.3% GPQA Diamond
Instruction followingExcellent โ€” precise on complex constraintsGood but less precise
MultimodalGood image understandingStrongest โ€” native multimodal, video
Context window200K tokensUp to 1M tokens
Web accessNeeds tools/MCPBuilt-in Google Search
Cost (frontier)$3-5/1M input tokensCompetitive, often cheaper

Key takeaway: No single model wins everything. For structured finance outputs, Claude follows complex formatting more precisely. For research with large documents, Gemini's context window is an advantage. Test your specific use case on 2-3 models and pick the best value.

Q7 P2P Invoice Data Extraction โ€” How to handle dynamic invoices at scale beyond OCR?

Traditional OCR struggles with dynamic layouts because it extracts characters but doesn't understand document structure. The modern approach:

ApproachAccuracyHandles new layouts?Maintenance
Template-based OCR85-92%โŒ New template per vendorHigh
AI-powered OCR (Textract)95-98% characterโœ… PartiallyMedium
Multimodal Vision LLM94-98.5% fieldโœ… Yes โ€” "looks" at the docLow โ€” prompt-based

Recommended architecture for AnyCompany:

  1. Amazon Textract for initial extraction (fast, cheap, handles standard layouts)
  2. Multimodal LLM fallback for complex invoices โ€” send the image to Claude/GPT-4o Vision, returns structured JSON
  3. Validation layer โ€” programmatic checks (totals = sum of line items, GST, currency)
  4. Human-in-the-loop for low-confidence extractions

At scale: 3,000+ invoices/month โ€” route standard invoices through Textract (~$0.01/page), complex ones to Vision LLM (~$0.01-0.03/page). Total: ~$30-90/month vs manual processing cost.

IDC 2024 finding: Organizations migrating from manual PDF data entry to AI-powered IDP report average cost reduction of 73% and accuracy improvement from 94% to 98.5% at field level.

Q8 Bedrock vs direct LLM provider endpoint โ€” What's the difference?

AspectAmazon BedrockDirect Provider API
Data privacyData stays in your AWS account. VPC endpoints available.Data sent to provider's infrastructure.
AuthenticationAWS IAM rolesAPI keys โ€” separate management
Multi-modelSingle Converse API for all modelsOne API per provider
BillingConsolidated on AWS billSeparate per provider
ComplianceSOC, HIPAA, PCI-DSS certificationsVaries by provider
GuardrailsBuilt-in content filtering, PII redaction, groundingMust build your own
MonitoringCloudWatch integrationMust set up your own
LatencySlight overhead from routingMarginally lower
New featuresMay arrive days/weeks laterAvailable first
Batch discount50% off for batch inferenceVolume discounts vary
For regulated finance: Bedrock is strongly preferred โ€” IAM integration, VPC endpoints (data never leaves your network), compliance certifications, built-in guardrails, and CloudWatch monitoring. The slight latency overhead is negligible vs the security benefits.

Q9 Grounding vs Relevance in Bedrock Guardrails โ€” What's the difference?

Amazon Bedrock Guardrails has a Contextual Grounding Check with two distinct scores:

CheckWhat it measuresExample
GroundingIs the response factually accurate based on the source? New info not in source = ungrounded.Source: "Tokyo is the capital of Japan." Response: "Capital of Japan is London." โ†’ Ungrounded
RelevanceDoes the response actually answer the user's question?Query: "Capital of Japan?" Response: "Capital of UK is London." โ†’ Irrelevant (grounded but wrong question)

How it works: You provide a grounding source (reference docs), a query (user's question), and the model response. The system scores both grounding and relevance (0โ€“0.99). You set thresholds โ€” responses below are blocked as hallucinations.

Limits: Grounding source max 100,000 chars, query max 1,000 chars, response max 5,000 chars.

For AnyCompany: When building a RAG-powered policy Q&A system, grounding checks ensure the AI's answer comes from your policy documents, and relevance checks ensure it actually answers what was asked. Critical for compliance โ€” you can't have the AI inventing regulatory requirements.

Q10 Cursor auto-summarizes after ~10 conversations and loses context. How to improve?

This is a known limitation of all AI coding assistants caused by context window limits. When conversation history exceeds the model's context window, the tool summarizes older messages โ€” and important details can be lost.

StrategyHowWhy it helps
Short, focused conversationsOne task per chat sessionPrevents context buildup
Steering/rules filesPut persistent context in workspace config filesLoaded fresh each message, not accumulated
Reference filesKeep data in .md/.csv files, reference themTool reads on demand vs carrying in history
Duplicate chatIn Cursor, duplicate before context fillsPreserves full context in the copy
Write plans to filesAsk AI to save plan to plan.mdPlan survives across sessions
Larger context modelsSelect Claude 200K or Gemini 1MMore room before summarization
In Kiro: Auto-summarization triggers at ~80% context usage. Steering files consume tokens on every message โ€” keep them concise. Use manual-inclusion steering files so they only load when explicitly referenced.
The numbers problem: Numerical data consumes more tokens than text ("$4,200" = 4 tokens). Keep numerical data in files and ask the AI to read the file, rather than pasting numbers into chat.

Q11 How to save a file in Markdown (.md) format?

Markdown files are just plain text files with a .md extension. You don't need any special software โ€” any text editor works.

๐ŸŽ macOS

Option A: TextEdit

  1. Open TextEdit
  2. Format โ†’ Make Plain Text (โ‡งโŒ˜T) โ€” critical, otherwise saves as .rtf
  3. Write or paste your content
  4. File โ†’ Save โ€” name it my-file.md
  5. If macOS warns, click "Use .md"

Option B: Terminal

touch my-file.md
open -e my-file.md

Option C: VS Code / Kiro

  1. File โ†’ New File (โŒ˜N)
  2. File โ†’ Save As (โ‡งโŒ˜S) โ€” name it .md

๐ŸชŸ Windows

Option A: Notepad

  1. Open Notepad
  2. Write or paste your content
  3. File โ†’ Save As
  4. Change "Save as type" to "All Files (*.*)"
  5. Name it my-file.md, set encoding to UTF-8
  6. Click Save

Option B: VS Code / Kiro

  1. File โ†’ New File (Ctrl+N)
  2. File โ†’ Save As (Ctrl+Shift+S) โ€” name it .md
From Bedrock Playground (Day 1 Labs): Select all the AI's response โ†’ Copy โ†’ Paste into your text editor โ†’ Save As with .md extension.
Why Markdown? As covered in Day 2 โ€” Markdown uses 60% fewer tokens than HTML, is readable by both humans and AI, and is the standard format for AI tools (Kiro steering files, SKILL.md, Claude projects).