Day 1 Summary — Generative AI Essentials on AWS

📎 Resources

📚 AWS Skill BuilderOfficial student guide & lab access 🌐 Workshop SiteCustom materials, labs & explainers ☁️ Workshop StudioLab environment (3 days)

📋 What We Covered Today

Module	Topic	Key Takeaway
M0	Introduction	Workshop goals, 3-day arc, tools overview
M1	Introducing Generative AI	Foundation models, tokenization, pricing tiers, context windows
M2	Exploring GenAI Use Cases	Finance-specific use cases — merchant risk, invoice processing, compliance
M3	Basic Prompt Engineering	Persona prompting lab, temperature
M4	Responsible AI	Fairness, transparency, hallucination risks, human oversight
M5	Security & Compliance	Bedrock Guardrails, data privacy, regulatory considerations (MAS, BNM, OJK)
M6	Implementing GenAI Projects	GenAI Application Lifecycle — from use case selection to deployment
M8	Wrap-up	Day 1 quiz, preview of Day 2 Prompt Engineering

🧪 Hands-On Labs Completed

Lab	What You Built
BuilderLab 1	Bedrock Playground — explored prompt techniques with Nova Pro (temperature, system prompts, chat mode)
BuilderLab 2	Bedrock Guardrails — configured content filters, denied topics, and PII redaction
BuilderLab 3	Capstone Playbook — built a 5-step GenAI implementation plan for merchant risk assessment

⚡ Interactive Explainers (Self-Paced Reference)

🔤 Step 1: TokenizerBPE playground, merge animation, cost estimator 🧊 Step 2: Embeddings3D vector space, cosine similarity, nearest neighbors 🧠 Step 3: TransformerAttention heatmap, temperature/top-k/top-p, next-token prediction

🔮 Looking Ahead

Day 2: Prompt Engineering Workshop

4 Pillars Deep Dive — Clarity, Context, Role, Output Framing with AnyCompany examples
Chain-of-Thought — Making AI show its work (auditable reasoning for finance)
Persona & Multi-Agent Framing — Same data, different perspectives
Structured Output & RAG — Grounding AI in YOUR documents
Hands-On Labs — 5 Kiro labs + 2 prompt engineering exercises with LLM-as-Judge scoring
Deliverable — A reusable prompt template you can take back to your team

📝 Homework: Think of one task your team does every week that involves reading documents, writing reports, or making decisions based on data. That's your prompt engineering candidate for tomorrow.

💬 Participant Q&A — Notable Discussions

Questions raised during Day 1, answered with data and sources.

Q1 LLM Selection — What is the guidance on selecting the best model?

There's no single "best" model — the right choice depends on your task complexity, quality threshold, and budget. AWS recommends a tiered approach:

Factor	What to evaluate
Task quality	Does the model follow structured output instructions? Does it adhere to decision rules?
Cost	Price per 1M tokens varies 143x across models (Nova Micro $0.035 → Claude Opus $5.00 input)
Latency	Time-to-first-token and total generation time — matters for real-time vs batch
Context window	How much data can the model "see" at once (128K–1M tokens depending on model)
Compliance	Data residency, encryption, audit trail requirements

Practical strategy: Use a tiered model approach — route simple tasks (classification, extraction) to cheaper models (Nova Micro/Lite), moderate tasks to mid-tier (Nova Pro, Haiku), and complex reasoning to frontier models (Claude Sonnet/Opus).

New: Bedrock Service Tiers (2025) — Priority (premium, lowest latency for mission-critical), Standard (consistent performance), and Flex (lower cost, higher latency for batch workloads). You can choose the tier per API call.

Sources: AWS Blog — Optimizing Cost for Bedrock, AWS Blog — Bedrock Service Tiers

Q2 How to get web search capabilities on AI agents?

LLMs have a knowledge cutoff date — they don't know about events after their training data ends. Three approaches:

MCP + Web Search Tools — Connect your AI agent to a web search MCP server (e.g., Brave Search MCP, Tavily MCP). The agent searches the web in real-time. This is what Kiro uses.
RAG (Retrieval-Augmented Generation) — Index your own documents and let the agent retrieve relevant chunks before generating. Covered in Day 2 Module 4. Best for internal/proprietary information.
Tool Use / Function Calling — Give the agent access to APIs (news feeds, market data, regulatory databases) via tool definitions. The agent decides when to call which API.

For AnyCompany: Combine RAG (for internal policies) with web search MCP (for current regulatory updates). On Day 3 Lab 9, you'll see MCP in action — connecting Kiro to a database to query merchant data in plain English.

Sources: Model Context Protocol — modelcontextprotocol.io, Anthropic MCP Documentation

Q3 TPU from Google vs GPU — Will that produce different quality models?

Short answer: No. The hardware does not affect the quality of the trained model. Given the same architecture, data, and hyperparameters, a model trained on TPUs produces the same quality output as one trained on GPUs.

Aspect	Google TPU	NVIDIA GPU
Model quality	Same	Same
Training speed	Excels at large-scale (100B+ params)	Excels at flexible workloads
Cost efficiency	Up to 4x better perf-per-dollar (TPU v6e vs H100)	Better for smaller/mixed workloads
Framework support	Best with JAX, TensorFlow	Supports everything
Availability	Google Cloud only	AWS, Azure, GCP, on-prem

Why it matters for you: As a consumer of models via Bedrock, you don't choose the hardware — the provider already trained the model. The quality difference comes from architecture, training data, and RLHF — not the chip.

Sources: Introl — TPU vs GPU Framework, Wevolver — Comprehensive Comparison

Q4 LLM-as-Judge — Which model should be the judge?

Use a stronger model than the one being judged. The judge should be at least as capable as the model being evaluated.

Practice	Why
Binary pass/fail over 1-5 scales	Likert scales introduce noise — binary judgments force clarity
Require written critiques	When the judge marks something as failing, it must explain why
Watch for position bias	Swapping response order can shift accuracy by >10%
Watch for verbosity bias	Longer responses score higher regardless of quality
Domain accuracy drops	General: 80%+ agreement. Finance/legal: drops to 60-70%

Our workshop approach: We use Claude Sonnet 4 as the judge for both prompt template scoring (Day 2) and agent canvas scoring (Day 3), with structured rubrics (5-6 criteria, binary per criterion with written justification).

Sources: LangChain 2025 — 53.3% of teams use LLM-as-Judge, ACL 2025 — Fine-tuned Judges

Q5 When should we use lower vs higher temperature?

Temperature	Behavior	Best for
0.0 – 0.3 (Low)	Deterministic, most probable tokens	Compliance reports, data extraction, risk ratings, code
0.4 – 0.7 (Medium)	Balanced creativity and consistency	Business writing, summaries, analysis narratives
0.8 – 1.5 (High)	Creative, explores less probable tokens	Brainstorming, marketing copy, diverse alternatives

For AnyCompany Finance:

Merchant risk assessments → Low (0.1-0.3): Consistent GREEN/AMBER/RED ratings every time
Invoice data extraction → Low (0.0-0.2): Numbers and fields must be deterministic
Executive summaries → Medium (0.4-0.6): Some phrasing variation is fine
Brainstorming features → High (0.8-1.0): You want diverse ideas

Practical tip: In Kiro and Claude Cowork, you can't directly set temperature. Instead, control it through your prompt: add constraints like "ONLY use provided data" for low-temperature behavior, or "generate 5 diverse alternatives" for high-temperature behavior.

Q6 Gemini vs Claude — How do different models produce different quality?

Capability	Claude (Anthropic)	Gemini (Google)
Coding	Strongest — 80-88% SWE-bench (varies by version)	Good — 64% SWE-bench
Writing quality	Most nuanced, natural prose	Functional but less polished
Scientific reasoning	Strong	Strongest — 94.3% GPQA Diamond
Instruction following	Excellent — precise on complex constraints	Good but less precise
Multimodal	Good image understanding	Strongest — native multimodal, video
Context window	200K tokens	Up to 2M tokens
Cost (frontier)	$3-5/1M input tokens	Competitive, often cheaper

Key takeaway: No single model wins everything. Test your specific use case on 2-3 models and pick the best value.

Sources: Neuronad — Claude vs Gemini 2026, Zemith — Claude vs Gemini

Q7 P2P Invoice Data Extraction — How to handle dynamic invoices at scale?

Approach	Accuracy	Handles new layouts?	Maintenance
Template-based OCR	85-92%	❌ New template per vendor	High
AI-powered OCR (Textract)	95-98% character	✅ Partially	Medium
Multimodal Vision LLM	94-98.5% field	✅ Yes — "looks" at the doc	Low — prompt-based

Recommended architecture: Textract for standard layouts (~$0.01/page) + Vision LLM fallback for complex invoices (~$0.01-0.03/page) + programmatic validation + human-in-the-loop. Cost for 3,000 invoices/month: ~$30-90.

IDC 2024: Organizations migrating from manual PDF data entry to AI-powered IDP report average cost reduction of 73% and accuracy improvement from 94% to 98.5% at field level.

Sources: arXiv — Benchmarking LLM Strategies for Invoice Processing, Veryfi — Invoice OCR 2025

Q8 Bedrock vs direct LLM provider endpoint — What's the difference?

Aspect	Amazon Bedrock	Direct Provider API
Data privacy	Data stays in your AWS account. VPC endpoints available.	Data sent to provider's infrastructure.
Authentication	AWS IAM roles	API keys — separate management
Multi-model	Single Converse API for all models	One API per provider
Compliance	SOC, HIPAA, PCI-DSS certifications	Varies by provider
Guardrails	Built-in content filtering, PII redaction, grounding	Must build your own
Monitoring	CloudWatch integration	Must set up your own
Batch discount	50% off for batch inference	Volume discounts vary

For regulated finance: Bedrock is strongly preferred — IAM integration, VPC endpoints (data never leaves your network), compliance certifications, built-in guardrails, and CloudWatch monitoring.

Sources: AWS — Bedrock APIs Documentation

Q9 Grounding vs Relevance in Bedrock Guardrails

Check	What it measures	Example
Grounding	Is the response factually accurate based on the source?	Source: "Tokyo is capital of Japan." Response: "Capital is London." → Ungrounded
Relevance	Does the response actually answer the user's question?	Query: "Capital of Japan?" Response: "Capital of UK is London." → Irrelevant

You provide a grounding source, a query, and the model response. The system scores both (0–0.99). Set thresholds — responses below are blocked as hallucinations.

For AnyCompany: When building a RAG-powered policy Q&A, grounding checks ensure answers come from your documents, and relevance checks ensure they answer what was asked. Critical for compliance.

Source: AWS — Contextual Grounding Check

Q10 AI tools auto-summarize and lose context — How to improve?

All AI tools hit context window limits after 10-15 messages. When history exceeds the limit, the tool summarizes older messages — and details can be lost.

Strategy	How	Why it helps
Short, focused conversations	One task per chat session	Prevents context buildup
Steering/rules files	Put persistent context in workspace config files	Loaded fresh each message
Reference files	Keep data in `.md`/`.csv` files, reference them	Tool reads on demand
Write plans to files	Ask AI to save plan to `plan.md`	Plan survives across sessions
Larger context models	Select Claude 200K or Gemini 1M	More room before summarization

In Kiro: Auto-summarization triggers at ~80% context usage. Steering files consume tokens on every message — keep them concise. Use manual-inclusion steering files so they only load when explicitly referenced.

Sources: Cursor Forum — Summarizing Chat, Kiro Docs — Summarization

Q11 How to save a file in Markdown (.md) format?

Markdown files are just plain text files with a .md extension. Any text editor works.

🍎 macOS

Option A: TextEdit

Open TextEdit
Format → Make Plain Text (⇧⌘T)
Write or paste content
File → Save — name it my-file.md

Option B: VS Code / Kiro

File → New File (⌘N)
File → Save As (⇧⌘S) — name it .md

🪟 Windows

Option A: Notepad

Open Notepad
File → Save As
Change type to "All Files (*.*)"
Name it my-file.md, encoding UTF-8