AI Evals For Engineers & PMs

New
·

4 Weeks

·

Cohort-based Course

Learn proven approaches for quickly improving AI applications. Build AI that works better than the competition, regardless of the use-case.

This course is popular

50+ people enrolled last week.

Previously at

GitHub
Airbnb
Google

Course overview

Eliminate the guesswork of building AI applications with data-driven approaches.

Do you catch yourself asking any of the following questions while building AI applications?


1. How do I test applications when the outputs are stochastic and require subjective judgements?


2. If I change the prompt, how do I know I'm not breaking something else?


3. Where should I focus my engineering efforts? Do I need to test everything?


4. What if I have no data or customers, where do I start?


5. What metrics should I track? What tools should I use? Which models are best?


6. Can I automate testing and evaluation? If so, how do I trust it?


If you aren't sure about the answers to these questions, this course is for you.


This is a hands-on course for engineers and technical PMs. Ideal for those who are comfortable coding or vibe coding.


---

WHAT TO EXPECT


This course will provide you with hands-on experience. Get ready to sweat through exercises, code and data! We will meet two times a week for four weeks, with generous office hours (read below for course schedule).


We will also hold office hours and host Discord community where you can communicate with us and each other. In return, you will be rewarded with skills that will set you apart from the competition by a wide margin. (see testimonials below). All sessions will be recorded and available to students asynchronously.

---

COURSE CONTENT


Lesson 1: Fundamentals & Lifecycle LLM Application Evaluation

- Why evaluation matters for LLM applications - business impact and risk mitigation

- Challenges unique to evaluating LLM outputs - common failure modes and context-dependence

- The lifecycle approach from development to production

- Basic instrumentation and observability for tracking system behavior

- Introduction to error analysis and methods for categorizing failures


Lesson 2: Systematic Error Analysis

- Bootstrap data through effective synthetic data generation

- Annotation strategies and quantitative analysis of qualitative data

- Translating error findings into actionable improvements

- Avoiding common pitfalls in the analysis process

- Practical exercise: Building and iterating on an error tracking system


Lesson 3: Implementing Effective Evaluations

- Defining metrics using code-based and LLM-judge approaches

- Techniques for evaluating individual outputs and overall system performance

- Organizing datasets with proper structure for inputs and reference data

- Practical exercise: Building an automated evaluation pipeline


Lesson 4: Collaborative Evaluation Practices

- Designing efficient team-based evaluation workflows

- Statistical methods for measuring inter-annotator agreement

- Techniques for building consensus on evaluation criteria

- Practical exercise: Collaborative alignment in breakout groups


Lesson 5: Architecture-Specific Evaluation Strategies

- Evaluating RAG systems for retrieval relevance and factual accuracy

- Testing multi-step pipelines to identify error propagation

- Assessing appropriate tool use and multi-turn conversation quality

- Multi-modal evaluation for text, image, and audio interactions

- Practical exercise: Creating targeted test suites for different architectures


Lesson 6: Production Monitoring & Continuous Evaluation

- Implementing traces, spans, and session tracking for observability

- Setting up automated evaluation gates in CI/CD pipelines

- Methods for consistent comparison across experiments

- Implementing safety and quality control guardrails

- Practical exercise: Designing an effective monitoring dashboard


Lesson 7: Efficient Continuous Human Review Systems

- Strategic sampling approaches for maximizing review impact

- Optimizing interface design for reviewer productivity

- Practical exercise: Implementing a continuous feedback collection system


Lesson 8: Cost Optimization

- Quantifying value versus expenditure in LLM applications

- Intelligent model routing based on query complexity

- Practical exercise: Optimizing a real-world application for cost efficiency


---

FREE CREDITS

Each student will get $1,000 of of free Modal (https://modal.com/) compute credits.


---

GUEST SPEAKERS


1. Kwindla Kramer: Founder of pipecat, an OSS framework for voice and conversational AI. - How to eval voice agents.

2. JJ Allaire: Creator of OSS tool Inspect, an eval toolset used by Anthropic, Groq and other leading companies.

3. Eugene Yan: Principal Applied Scientist at Amazon. "An LLM‑as‑Judge Won't Save Your Product—Fixing Your Process Will"

4. Aarush Sah: Head of Evals at Groq

5. Benjamin Clavié: Creator of OSS project RAGatouille and rerankers: RAG mistakes & how to diagnose them.

6. Isaac Flath: Creator of MonsterUI and FastHTML core contributor - Building custom annotation tools for error analysis with FastHTML

7. Mikyo King: Creator of OSS Evals framework Phoneix.

8. Bryan Bischof: Head of AI @ Theory & formerly Hex, W&B and StitchFix. Failure funnels - an analytical framework for simplifying Agent evaluations.

9. Brooke Hopkins: Founder of Coval, simulation & evals for voice and chat agents

10. Alex Volkov: Devolper advocate @ Weights & Biases- "Reasoning models and LLM as a judge"

11. Aman Khan: AI PM @ Arize

Who this course is for

01

Engineers & Technical PMs building AI applications who have little experience with machine learning or data science.

02

Those interested in moving beyond vibe-checks to data driven measurements you can trust, even when outputs are stochastic or subjective.

03

Founders and leaders who are unsure of the failure modes of their AI applications and where to allocate resources.

What you’ll get out of this course

Acquire the best tools for finding, diagnosing, and prioritizing AI errors.

We've tried all of them so you don't have to.

Learn how to bootstrap with synthetic data for testing before you have users

And how to best leverage data when you do have users.

Create a data flywheel for your applications that guarantees your AI will improve over time.

Data flywheels ensure you have examples to draw from for prompts, tests, and fine-tuning.

Automate parts of your AI evaluation with approaches that allow you to actually trust and rely on them.

How can you really trust LLM-as-a-judge? How should you design them? We will show you how. We will also show you how to refine prompts, generate metadata, and other tasks with the assistance of AI.

Ensure your AI is aligned to your preferences, tastes and judgement.

We will show you approaches to discover all the ways AI is not performing

Avoid common mistakes we've seen across 35+ AI implementations.

There are an infinite number of things you can try, tests you can write, and data you can look at. We will show you a data-driven process that helps you prioritize the most important problems so you can avoid wasting time and money.

Hands-On Exercises, Examples and Code

We will provide end-to-end exercises, examples and code to make sure you come away with the skills you need. We will NOT just throw a bunch of slides at you!

Personalized Instruction

Generous office hours ensures students can ask questions about their specific issues and interests.

This course includes

20 interactive live sessions

Lifetime access to course materials

In-depth lessons

Direct access to instructor

Projects to apply learnings

Guided feedback & reflection

Private community of peers

Course certificate upon completion

Maven Satisfaction Guarantee

This course is backed by Maven’s guarantee. You can receive a full refund within 14 days after the course ends, provided you meet the completion criteria in our refund policy.

Course syllabus

Week 1

May 19—May 25

    May

    20

    Lesson 1: Fundamentals & Lifecycle LLM Application Evaluation

    Tue 5/205:00 PM—6:00 PM (UTC)

    May

    23

    Lesson 2: Systematic Error Analysis

    Fri 5/235:00 PM—6:00 PM (UTC)

    May

    23

    Optional: Office Hours

    Fri 5/236:00 PM—6:45 PM (UTC)
    Optional

Week 2

May 26—Jun 1

    May

    27

    Lesson 3: Implementing Effective Evaluations

    Tue 5/275:00 PM—5:45 PM (UTC)

    May

    28

    Optional: Building custom annotation tools for error analysis with FastHTML

    Wed 5/285:00 PM—6:00 PM (UTC)
    Optional

    May

    29

    Lesson 4: Collaborative Evaluation Practices

    Thu 5/295:00 PM—5:45 PM (UTC)

    May

    30

    Optional: Office Hours

    Fri 5/305:00 PM—5:45 PM (UTC)
    Optional

    May

    31

    Optional: Office Hours

    Sat 5/3112:00 AM—12:45 AM (UTC)
    Optional

Week 3

Jun 2—Jun 8

    Jun

    3

    Lesson 5: Architecture-Specific Evaluation Strategies

    Tue 6/35:00 PM—6:00 PM (UTC)

    Jun

    3

    Optional: Office Hours

    Tue 6/36:00 PM—6:30 PM (UTC)
    Optional

    Jun

    5

    Lesson 6: Production Monitoring & Continuous Evaluation

    Thu 6/55:00 PM—6:00 PM (UTC)

    Jun

    5

    Optional: Office Hours

    Thu 6/56:00 PM—6:30 PM (UTC)
    Optional

    Jun

    7

    Optional: Office Hours

    Sat 6/712:00 AM—12:45 AM (UTC)
    Optional

Week 4

Jun 9—Jun 13

    Jun

    9

    Optional: Guest Speaker: Evaluating Voice Agents with Kwindla Kramer

    Mon 6/95:00 PM—5:45 PM (UTC)
    Optional

    Jun

    10

    Lesson 7: Efficient Continuous Human Review Systems

    Tue 6/105:00 PM—6:00 PM (UTC)

    Jun

    11

    Lesson 8: Cost Optimization

    Wed 6/115:00 PM—6:00 PM (UTC)

    Jun

    12

    Guest Speaker: Evaluating and Optimizing RAG w/Benjamin Clavié

    Thu 6/122:00 AM—3:00 AM (UTC)

    Jun

    12

    Optional: Evaluating Voice Agents (Part 2)

    Thu 6/125:00 PM—5:30 PM (UTC)
    Optional

    Jun

    13

    Optional: Reasoning Models & LLM-as-a-Judge

    Fri 6/134:00 PM—4:30 PM (UTC)
    Optional

    Jun

    13

    Optional: Office Hours

    Fri 6/134:30 PM—5:15 PM (UTC)
    Optional

What people are saying

        Shreya and Hamel are legit. Through their work on dozens of use cases, they've encountered and successfully addressed many of the common challenges in LLM evals. Every time I seek their advice, I come away with greater clarity and insight on how to solve my eval challenges.
Eugene Yan

Eugene Yan

Senior Applied Scientist
        Hamel and is one of most knowledgeable people about LLM evals. I've witnessed him improve AI products first-hand by guiding his clients carefully through the process. We've even made many improvements to LangSmith because of his work.
Harrison Chase

Harrison Chase

CEO, Langchain
        I was seeking help with LLM evaluation and testing for our products. Hamel's widely-referenced work on evals made him the clear choice. He helped us rethink our entire approach to LLM development and testing, creating a clear pathway to measure and improve our AI systems.
George Siemens 

George Siemens 

CEO, Matter & Space
        Hamel showed us how to evaluate our AI systems tailored to our use case. We gained insights that dramatically improved performance, reducing error rates by over 60% in critical areas like date handling. These methods have become fundamental to how we build and improve our AI products, creating a continuous cycle of improvement.
Jacob Carter

Jacob Carter

CEO, NurtureBoss
        Hamel & team allowed serviceMob to save hundreds of hours of engineering time by showing us the best tools, techniques, and processes. We shipped industry-leading AI in few weeks instead of months, and kept shipping thereafter thanks to how his team up-skilled our company.
Anuj Bhalla

Anuj Bhalla

CEO, ServiceMob
        We have had the chance to work with Hamel and it’s been a very successful partnership for us. He has a massive technical knowledge and is also high-level executive that can help you on all different levels. Couldn’t have asked for a better partner.
Emil Sedgh

Emil Sedgh

CTO, Rechat
        My session with Hamel was incredibly helpful. He broke down LLM evaluations step by step and made everything practical. He’s a true expert and a fantastic teacher!
Max Shaw

Max Shaw

CEO, Windmill
        When I have questions about the intersection of data and production AI systems, Shreya & Hamel are the first people I call. It's often the case that they've already written about my problem. You can’t find more qualified folks to teach this; anywhere.
Bryan Bischof

Bryan Bischof

Director of Engineering, Hex
        Hamel has provided exactly the tutorial I was needing for [evals], with a really thorough example case-study ... Hamel's content is fantastic, but it's a bit absurd that he's single-handedly having to make up for a lack of good materials about this topic across the rest of our industry!
Simon Willison

Simon Willison

Creator of Datasette
        After talking to Hamel, I learned I was being too abstract in trying to evaluate GiveCare, using top-down frameworks instead of starting with real caregiver interactions. Now seeing more value in a bottom-up approach: review user sessions, identify issues, and build tests around actual data.
Ali Madad

Ali Madad

Founder, GiveCare
        Hamel and Shreya technically goated, deeply experienced engineers of AI systems who just so happen to have impeccable vibes. I wouldn't learn this material from anyone else.
Charles Frye

Charles Frye

Dev Advocate - Modal

Meet your instructors

Hamel Husain

Hamel Husain

ML Engineer with 20 years of experience.

Hamel Husain is a ML Engineer with over 20 years of experience. He has worked with innovative companies such as Airbnb and GitHub, which included early LLM research used by OpenAI, for code understanding. He has also led and contributed to numerous popular open-source machine-learning tools. Hamel is currently an independent consultant helping companies build AI products.

Shreya Shankar

Shreya Shankar

ML Systems Researcher Making AI Evaluation Work in Practice

Shreya Shankar is an experienced ML Engineer who is currently a PhD candidate in computer science at UC Berkeley, where she builds systems that help people use AI to work with data effectively. Her research focuses on developing practical tools and frameworks for building reliable ML systems, with recent groundbreaking work on LLM evaluation and data quality. She has published influential papers on evaluating and aligning LLM systems, including "Who Validates the Validators?" which explores how to systematically align LLM evaluations with human preferences.

Prior to her PhD, Shreya worked as an ML engineer in industry and completed her BS and MS in computer science at Stanford. Her work appears in top data management and HCI venues including SIGMOD, VLDB, and UIST. She is currently supported by the NDSEG Fellowship and has collaborated extensively with major tech companies and startups to deploy her research in production environments. Her recent projects like DocETL and SPADE demonstrate her ability to bridge theoretical frameworks with practical implementations that help developers build more reliable AI systems.

A pattern of wavy dots

Join an upcoming cohort

AI Evals For Engineers & PMs

Cohort 1

$1,975

Dates

May 19—June 14, 2025

Payment Deadline

May 2, 2025

Don't miss out! Enrollment closes in 6 days

Get reimbursed

Learning is better with cohorts

Learning is better with cohorts

Active hands-on learning

This course builds on live workshops and hands-on projects

Interactive and project-based

You’ll be interacting with other learners through breakout rooms and project teams

Learn with a cohort of peers

Join a community of like-minded people who want to learn and grow alongside you

Frequently Asked Questions

A pattern of wavy dots

Join an upcoming cohort

AI Evals For Engineers & PMs

Cohort 1

$1,975

Dates

May 19—June 14, 2025

Payment Deadline

May 2, 2025

Don't miss out! Enrollment closes in 6 days

Get reimbursed

$1,975

6 days left to enroll