๐Ÿ“จ Day 1 Summary

Generative AI Essentials on AWS โ€” Thank you for attending!

ModuleTopicKey Takeaway
M0IntroductionWorkshop goals, 3-day arc, tools overview
M1Introducing Generative AIFoundation models, tokenization, pricing tiers, context windows
M2Exploring GenAI Use CasesFinance-specific use cases โ€” merchant risk, invoice processing, compliance
M3Basic Prompt EngineeringPersona prompting lab, temperature
M4Responsible AIFairness, transparency, hallucination risks, human oversight
M5Security & ComplianceBedrock Guardrails, data privacy, regulatory considerations (MAS, BNM, OJK)
M6Implementing GenAI ProjectsGenAI Application Lifecycle โ€” from use case selection to deployment
M8Wrap-upDay 1 quiz, preview of Day 2 Prompt Engineering
LabWhat You Built
BuilderLab 1Bedrock Playground โ€” explored prompt techniques with Nova Pro (temperature, system prompts, chat mode)
BuilderLab 2Bedrock Guardrails โ€” configured content filters, denied topics, and PII redaction
BuilderLab 3Capstone Playbook โ€” built a 5-step GenAI implementation plan for merchant risk assessment

Day 2: Prompt Engineering Workshop

  • 4 Pillars Deep Dive โ€” Clarity, Context, Role, Output Framing with AnyCompany examples
  • Chain-of-Thought โ€” Making AI show its work (auditable reasoning for finance)
  • Persona & Multi-Agent Framing โ€” Same data, different perspectives
  • Structured Output & RAG โ€” Grounding AI in YOUR documents
  • Hands-On Labs โ€” 5 Kiro labs + 2 prompt engineering exercises with LLM-as-Judge scoring
  • Deliverable โ€” A reusable prompt template you can take back to your team
๐Ÿ“ Homework: Think of one task your team does every week that involves reading documents, writing reports, or making decisions based on data. That's your prompt engineering candidate for tomorrow.

Questions raised during Day 1, answered with data and sources.

Q1 LLM Selection โ€” What is the guidance on selecting the best model?

There's no single "best" model โ€” the right choice depends on your task complexity, quality threshold, and budget. AWS recommends a tiered approach:

FactorWhat to evaluate
Task qualityDoes the model follow structured output instructions? Does it adhere to decision rules?
CostPrice per 1M tokens varies 143x across models (Nova Micro $0.035 โ†’ Claude Opus $5.00 input)
LatencyTime-to-first-token and total generation time โ€” matters for real-time vs batch
Context windowHow much data can the model "see" at once (128Kโ€“1M tokens depending on model)
ComplianceData residency, encryption, audit trail requirements

Practical strategy: Use a tiered model approach โ€” route simple tasks (classification, extraction) to cheaper models (Nova Micro/Lite), moderate tasks to mid-tier (Nova Pro, Haiku), and complex reasoning to frontier models (Claude Sonnet/Opus).

New: Bedrock Service Tiers (2025) โ€” Priority (premium, lowest latency for mission-critical), Standard (consistent performance), and Flex (lower cost, higher latency for batch workloads). You can choose the tier per API call.

Q2 How to get web search capabilities on AI agents?

LLMs have a knowledge cutoff date โ€” they don't know about events after their training data ends. Three approaches:

  1. MCP + Web Search Tools โ€” Connect your AI agent to a web search MCP server (e.g., Brave Search MCP, Tavily MCP). The agent searches the web in real-time. This is what Kiro uses.
  2. RAG (Retrieval-Augmented Generation) โ€” Index your own documents and let the agent retrieve relevant chunks before generating. Covered in Day 2 Module 4. Best for internal/proprietary information.
  3. Tool Use / Function Calling โ€” Give the agent access to APIs (news feeds, market data, regulatory databases) via tool definitions. The agent decides when to call which API.
For AnyCompany: Combine RAG (for internal policies) with web search MCP (for current regulatory updates). On Day 3 Lab 9, you'll see MCP in action โ€” connecting Kiro to a database to query merchant data in plain English.

Q3 TPU from Google vs GPU โ€” Will that produce different quality models?

Short answer: No. The hardware does not affect the quality of the trained model. Given the same architecture, data, and hyperparameters, a model trained on TPUs produces the same quality output as one trained on GPUs.

AspectGoogle TPUNVIDIA GPU
Model qualitySameSame
Training speedExcels at large-scale (100B+ params)Excels at flexible workloads
Cost efficiencyUp to 4x better perf-per-dollar (TPU v6e vs H100)Better for smaller/mixed workloads
Framework supportBest with JAX, TensorFlowSupports everything
AvailabilityGoogle Cloud onlyAWS, Azure, GCP, on-prem

Why it matters for you: As a consumer of models via Bedrock, you don't choose the hardware โ€” the provider already trained the model. The quality difference comes from architecture, training data, and RLHF โ€” not the chip.

Q4 LLM-as-Judge โ€” Which model should be the judge?

Use a stronger model than the one being judged. The judge should be at least as capable as the model being evaluated.

PracticeWhy
Binary pass/fail over 1-5 scalesLikert scales introduce noise โ€” binary judgments force clarity
Require written critiquesWhen the judge marks something as failing, it must explain why
Watch for position biasSwapping response order can shift accuracy by >10%
Watch for verbosity biasLonger responses score higher regardless of quality
Domain accuracy dropsGeneral: 80%+ agreement. Finance/legal: drops to 60-70%
Our workshop approach: We use Claude Sonnet 4 as the judge for both prompt template scoring (Day 2) and agent canvas scoring (Day 3), with structured rubrics (5-6 criteria, binary per criterion with written justification).

Q5 When should we use lower vs higher temperature?

TemperatureBehaviorBest for
0.0 โ€“ 0.3 (Low)Deterministic, most probable tokensCompliance reports, data extraction, risk ratings, code
0.4 โ€“ 0.7 (Medium)Balanced creativity and consistencyBusiness writing, summaries, analysis narratives
0.8 โ€“ 1.5 (High)Creative, explores less probable tokensBrainstorming, marketing copy, diverse alternatives

For AnyCompany Finance:

Practical tip: In Kiro and Claude Cowork, you can't directly set temperature. Instead, control it through your prompt: add constraints like "ONLY use provided data" for low-temperature behavior, or "generate 5 diverse alternatives" for high-temperature behavior.

Q6 Gemini vs Claude โ€” How do different models produce different quality?

CapabilityClaude (Anthropic)Gemini (Google)
CodingStrongest โ€” 80-88% SWE-bench (varies by version)Good โ€” 64% SWE-bench
Writing qualityMost nuanced, natural proseFunctional but less polished
Scientific reasoningStrongStrongest โ€” 94.3% GPQA Diamond
Instruction followingExcellent โ€” precise on complex constraintsGood but less precise
MultimodalGood image understandingStrongest โ€” native multimodal, video
Context window200K tokensUp to 2M tokens
Cost (frontier)$3-5/1M input tokensCompetitive, often cheaper

Key takeaway: No single model wins everything. Test your specific use case on 2-3 models and pick the best value.

Q7 P2P Invoice Data Extraction โ€” How to handle dynamic invoices at scale?

ApproachAccuracyHandles new layouts?Maintenance
Template-based OCR85-92%โŒ New template per vendorHigh
AI-powered OCR (Textract)95-98% characterโœ… PartiallyMedium
Multimodal Vision LLM94-98.5% fieldโœ… Yes โ€” "looks" at the docLow โ€” prompt-based

Recommended architecture: Textract for standard layouts (~$0.01/page) + Vision LLM fallback for complex invoices (~$0.01-0.03/page) + programmatic validation + human-in-the-loop. Cost for 3,000 invoices/month: ~$30-90.

IDC 2024: Organizations migrating from manual PDF data entry to AI-powered IDP report average cost reduction of 73% and accuracy improvement from 94% to 98.5% at field level.

Q8 Bedrock vs direct LLM provider endpoint โ€” What's the difference?

AspectAmazon BedrockDirect Provider API
Data privacyData stays in your AWS account. VPC endpoints available.Data sent to provider's infrastructure.
AuthenticationAWS IAM rolesAPI keys โ€” separate management
Multi-modelSingle Converse API for all modelsOne API per provider
ComplianceSOC, HIPAA, PCI-DSS certificationsVaries by provider
GuardrailsBuilt-in content filtering, PII redaction, groundingMust build your own
MonitoringCloudWatch integrationMust set up your own
Batch discount50% off for batch inferenceVolume discounts vary
For regulated finance: Bedrock is strongly preferred โ€” IAM integration, VPC endpoints (data never leaves your network), compliance certifications, built-in guardrails, and CloudWatch monitoring.

Q9 Grounding vs Relevance in Bedrock Guardrails

CheckWhat it measuresExample
GroundingIs the response factually accurate based on the source?Source: "Tokyo is capital of Japan." Response: "Capital is London." โ†’ Ungrounded
RelevanceDoes the response actually answer the user's question?Query: "Capital of Japan?" Response: "Capital of UK is London." โ†’ Irrelevant

You provide a grounding source, a query, and the model response. The system scores both (0โ€“0.99). Set thresholds โ€” responses below are blocked as hallucinations.

For AnyCompany: When building a RAG-powered policy Q&A, grounding checks ensure answers come from your documents, and relevance checks ensure they answer what was asked. Critical for compliance.

Q10 AI tools auto-summarize and lose context โ€” How to improve?

All AI tools hit context window limits after 10-15 messages. When history exceeds the limit, the tool summarizes older messages โ€” and details can be lost.

StrategyHowWhy it helps
Short, focused conversationsOne task per chat sessionPrevents context buildup
Steering/rules filesPut persistent context in workspace config filesLoaded fresh each message
Reference filesKeep data in .md/.csv files, reference themTool reads on demand
Write plans to filesAsk AI to save plan to plan.mdPlan survives across sessions
Larger context modelsSelect Claude 200K or Gemini 1MMore room before summarization
In Kiro: Auto-summarization triggers at ~80% context usage. Steering files consume tokens on every message โ€” keep them concise. Use manual-inclusion steering files so they only load when explicitly referenced.

Q11 How to save a file in Markdown (.md) format?

Markdown files are just plain text files with a .md extension. Any text editor works.

๐ŸŽ macOS

Option A: TextEdit

  1. Open TextEdit
  2. Format โ†’ Make Plain Text (โ‡งโŒ˜T)
  3. Write or paste content
  4. File โ†’ Save โ€” name it my-file.md

Option B: VS Code / Kiro

  1. File โ†’ New File (โŒ˜N)
  2. File โ†’ Save As (โ‡งโŒ˜S) โ€” name it .md

๐ŸชŸ Windows

Option A: Notepad

  1. Open Notepad
  2. File โ†’ Save As
  3. Change type to "All Files (*.*)"
  4. Name it my-file.md, encoding UTF-8

Option B: VS Code / Kiro

  1. File โ†’ New File (Ctrl+N)
  2. File โ†’ Save As (Ctrl+Shift+S) โ€” name it .md
Why Markdown? As covered in Day 2 โ€” Markdown uses 60% fewer tokens than HTML, is readable by both humans and AI, and is the standard format for AI tools (Kiro steering files, SKILL.md, Claude projects).