AI Evals for PMs

Aki Wijesundara, PhD

AI Founder | Google AI Accelerator Alum

Manu Jayawardana

AI Advisor | Co-Founder & CEO at Krybe

Stop Shipping AI Vibes — Ship AI You Can Defend in Production

Transform AI quality from gut-feel debates into clear ship / hold decisions with evals PMs actually own.

Most AI features look great in demos but fail silently in production — inconsistent answers, edge-case breakage, and slow erosion of trust. The problem isn’t just the model. It’s the lack of a quality system.

This course teaches PMs how to define “good,” catch failures early, and ship AI with confidence using evals, gates, and dashboards.

With AI Evals for PMs, you’ll:

✅ Define quality using a failure taxonomy instead of vague feedback
✅ Build gold sets (real examples + edge cases) that catch failures fast
✅ Run lightweight human review loops without heavy infra
✅ Set clear ship / hold release gates PMs can defend
✅ Detect drift early with an exec-ready quality dashboard
✅ Establish a weekly quality cadence your team can sustain

Week by week, you move from vague “make it better” feedback to clear metrics, focused improvements, and compounding quality gains. Teams using this approach cut failed launches and rollbacks by 30–50%, reduce eval cycles by 40%, and ship iterations 2–3× faster. Structured evals replace debates with decisions, improving trust and post-launch reliability.

What you’ll learn

You’ll create an AI Evals Launch Pack for a real feature you’re shipping

  • Follow a practical, PM-owned process to continuously evaluate, improve, and ship AI features with confidence.

  • Translate user value into evaluation goals and measurable success criteria

  • Define the right evaluation unit (turn, task, journey) for different AI features

  • Identify why AI features break in production and turn vague feedback into actionable signals.

  • Create a failure taxonomy that captures real user and system breakdowns

  • Separate leading indicators (failure rates, coverage) from lagging indicators (CSAT, trust signals)

  • Build evaluation datasets that catch failures early without waiting for perfect data or heavy infrastructure.

  • Design gold sets using real examples and targeted edge cases

  • Run lightweight human review loops that scale with team capacity

Learn directly from Aki & Manu

Aki Wijesundara, PhD

Aki Wijesundara, PhD

AI Founder | Educator | Google AI Accelerator Alum

Previous Students from
Google
Meta
OpenAI
Amazon Web Services
NVIDIA
Manu Jayawardana

Manu Jayawardana

AI Advisor | Co-Founder & CEO at Krybe | Co-Founder of Snapdrum

Who this course is for

  • Product managers and leaders shipping LLM features who want to replace gut-feel launches with a repeatable, production-grade quality system.

  • PMs who know LLM basics and want a practical, data-driven way to define quality, evaluate behavior, and make ship vs hold decisions.

  • Teams responsible for trust and reliability who want feedback loops that continuously improve AI quality as models and user needs evolve.

What's included

Live sessions

Learn directly from Aki Wijesundara, PhD & Manu Jayawardana in a real-time, interactive format.

Hands On Customized Resources

Get access to a customized set of resources

Lifetime Discord Community

Private Slack for peer reviews, job leads, and ongoing support forever.

Guest Sessions

Webinar sessions hosted with industry network

Certificate of completion

Showcase your skills to clients, employers, and your LinkedIn network.

Maven Guarantee

This course is backed by the Maven Guarantee. Students are eligible for a full refund up until the halfway point of the course.

Course syllabus

Week 1

Jan 25

    Lecture 1: Foundations of AI Quality & Evaluation

    6 items

    Resources

    1 item

Week 2

Jan 26—Feb 1

    Lecture 2: Instrumentation, Feedback & Segmentation

    6 items

    Resources

    1 item

Schedule

Live sessions

6 hrs

Optional bonus session: Scaling AI quality across teams

6 Prerecorded Lectures

6 hrs

Short, focused videos that break down the complete AI evaluation framework, designed for quick learning and easy rewatching as you apply it in production.

6+ Office Hour Q&As

6 hrs

Open office hours for deep dives, debugging help, and personalized feedback.