Generative AI Essentials on AWS โ Thank you for attending!
๐ Resources
๐ What We Covered Today
| Module | Topic | Key Takeaway |
|---|---|---|
| M0 | Introduction | Workshop goals, 3-day arc, tools overview |
| M1 | Introducing Generative AI | Foundation models, tokenization, pricing tiers, context windows |
| M2 | Exploring GenAI Use Cases | Finance-specific use cases โ merchant risk, invoice processing, compliance |
| M3 | Basic Prompt Engineering | Persona prompting lab, temperature |
| M4 | Responsible AI | Fairness, transparency, hallucination risks, human oversight |
| M5 | Security & Compliance | Bedrock Guardrails, data privacy, regulatory considerations (MAS, BNM, OJK) |
| M6 | Implementing GenAI Projects | GenAI Application Lifecycle โ from use case selection to deployment |
| M8 | Wrap-up | Day 1 quiz, preview of Day 2 Prompt Engineering |
๐งช Hands-On Labs Completed
| Lab | What You Built |
|---|---|
| BuilderLab 1 | Bedrock Playground โ explored prompt techniques with Nova Pro (temperature, system prompts, chat mode) |
| BuilderLab 2 | Bedrock Guardrails โ configured content filters, denied topics, and PII redaction |
| BuilderLab 3 | Capstone Playbook โ built a 5-step GenAI implementation plan for merchant risk assessment |
โก Interactive Explainers (Self-Paced Reference)
๐ฎ Looking Ahead
๐ฌ Participant Q&A โ Notable Discussions
Questions raised during Day 1, answered with data and sources.
There's no single "best" model โ the right choice depends on your task complexity, quality threshold, and budget. AWS recommends a tiered approach:
| Factor | What to evaluate |
|---|---|
| Task quality | Does the model follow structured output instructions? Does it adhere to decision rules? |
| Cost | Price per 1M tokens varies 143x across models (Nova Micro $0.035 โ Claude Opus $5.00 input) |
| Latency | Time-to-first-token and total generation time โ matters for real-time vs batch |
| Context window | How much data can the model "see" at once (128Kโ1M tokens depending on model) |
| Compliance | Data residency, encryption, audit trail requirements |
Practical strategy: Use a tiered model approach โ route simple tasks (classification, extraction) to cheaper models (Nova Micro/Lite), moderate tasks to mid-tier (Nova Pro, Haiku), and complex reasoning to frontier models (Claude Sonnet/Opus).
LLMs have a knowledge cutoff date โ they don't know about events after their training data ends. Three approaches:
Short answer: No. The hardware does not affect the quality of the trained model. Given the same architecture, data, and hyperparameters, a model trained on TPUs produces the same quality output as one trained on GPUs.
| Aspect | Google TPU | NVIDIA GPU |
|---|---|---|
| Model quality | Same | Same |
| Training speed | Excels at large-scale (100B+ params) | Excels at flexible workloads |
| Cost efficiency | Up to 4x better perf-per-dollar (TPU v6e vs H100) | Better for smaller/mixed workloads |
| Framework support | Best with JAX, TensorFlow | Supports everything |
| Availability | Google Cloud only | AWS, Azure, GCP, on-prem |
Why it matters for you: As a consumer of models via Bedrock, you don't choose the hardware โ the provider already trained the model. The quality difference comes from architecture, training data, and RLHF โ not the chip.
Use a stronger model than the one being judged. The judge should be at least as capable as the model being evaluated.
| Practice | Why |
|---|---|
| Binary pass/fail over 1-5 scales | Likert scales introduce noise โ binary judgments force clarity |
| Require written critiques | When the judge marks something as failing, it must explain why |
| Watch for position bias | Swapping response order can shift accuracy by >10% |
| Watch for verbosity bias | Longer responses score higher regardless of quality |
| Domain accuracy drops | General: 80%+ agreement. Finance/legal: drops to 60-70% |
| Temperature | Behavior | Best for |
|---|---|---|
| 0.0 โ 0.3 (Low) | Deterministic, most probable tokens | Compliance reports, data extraction, risk ratings, code |
| 0.4 โ 0.7 (Medium) | Balanced creativity and consistency | Business writing, summaries, analysis narratives |
| 0.8 โ 1.5 (High) | Creative, explores less probable tokens | Brainstorming, marketing copy, diverse alternatives |
For AnyCompany Finance:
| Capability | Claude (Anthropic) | Gemini (Google) |
|---|---|---|
| Coding | Strongest โ 80-88% SWE-bench (varies by version) | Good โ 64% SWE-bench |
| Writing quality | Most nuanced, natural prose | Functional but less polished |
| Scientific reasoning | Strong | Strongest โ 94.3% GPQA Diamond |
| Instruction following | Excellent โ precise on complex constraints | Good but less precise |
| Multimodal | Good image understanding | Strongest โ native multimodal, video |
| Context window | 200K tokens | Up to 2M tokens |
| Cost (frontier) | $3-5/1M input tokens | Competitive, often cheaper |
Key takeaway: No single model wins everything. Test your specific use case on 2-3 models and pick the best value.
| Approach | Accuracy | Handles new layouts? | Maintenance |
|---|---|---|---|
| Template-based OCR | 85-92% | โ New template per vendor | High |
| AI-powered OCR (Textract) | 95-98% character | โ Partially | Medium |
| Multimodal Vision LLM | 94-98.5% field | โ Yes โ "looks" at the doc | Low โ prompt-based |
Recommended architecture: Textract for standard layouts (~$0.01/page) + Vision LLM fallback for complex invoices (~$0.01-0.03/page) + programmatic validation + human-in-the-loop. Cost for 3,000 invoices/month: ~$30-90.
| Aspect | Amazon Bedrock | Direct Provider API |
|---|---|---|
| Data privacy | Data stays in your AWS account. VPC endpoints available. | Data sent to provider's infrastructure. |
| Authentication | AWS IAM roles | API keys โ separate management |
| Multi-model | Single Converse API for all models | One API per provider |
| Compliance | SOC, HIPAA, PCI-DSS certifications | Varies by provider |
| Guardrails | Built-in content filtering, PII redaction, grounding | Must build your own |
| Monitoring | CloudWatch integration | Must set up your own |
| Batch discount | 50% off for batch inference | Volume discounts vary |
| Check | What it measures | Example |
|---|---|---|
| Grounding | Is the response factually accurate based on the source? | Source: "Tokyo is capital of Japan." Response: "Capital is London." โ Ungrounded |
| Relevance | Does the response actually answer the user's question? | Query: "Capital of Japan?" Response: "Capital of UK is London." โ Irrelevant |
You provide a grounding source, a query, and the model response. The system scores both (0โ0.99). Set thresholds โ responses below are blocked as hallucinations.
All AI tools hit context window limits after 10-15 messages. When history exceeds the limit, the tool summarizes older messages โ and details can be lost.
| Strategy | How | Why it helps |
|---|---|---|
| Short, focused conversations | One task per chat session | Prevents context buildup |
| Steering/rules files | Put persistent context in workspace config files | Loaded fresh each message |
| Reference files | Keep data in .md/.csv files, reference them | Tool reads on demand |
| Write plans to files | Ask AI to save plan to plan.md | Plan survives across sessions |
| Larger context models | Select Claude 200K or Gemini 1M | More room before summarization |
Markdown files are just plain text files with a .md extension. Any text editor works.
Option A: TextEdit
my-file.mdOption B: VS Code / Kiro
.mdOption A: Notepad
my-file.md, encoding UTF-8Option B: VS Code / Kiro
.md