All courses Product

Featured in

Lenny’s List

AI Evals For Engineers & PMs

4.8 (22)

4 Weeks

Cohort-based Course

Learn proven approaches for quickly improving AI applications. Build AI that works better than the competition, regardless of the use-case.

This course is popular

50+ people enrolled last week.

Featured in

Lenny’s List

AI Evals For Engineers & PMs

4.8 (22)

4 Weeks

Cohort-based Course

Learn proven approaches for quickly improving AI applications. Build AI that works better than the competition, regardless of the use-case.

This course is popular

50+ people enrolled last week.

Hosted by

Hamel Husain and Shreya Shankar

ML Engineers who've spent 25+ combined years building & evaluating AI systems.

Hamel Husain and Shreya Shankar

ML Engineers who've spent 25+ combined years building & evaluating AI systems.

Previously at

Course overview

Eliminate the guesswork of building AI applications with data-driven approaches.

🚨 JULY 21st IS IS OUR FINAL LIVE COHORT - (We have to get back to building AI applications!) - Enrolled students get LIFETIME ACCESS to all materials and recordings!🚨

Do you catch yourself asking any of the following questions while building AI applications?

1. How do I test applications when the outputs are stochastic and require subjective judgements?

2. If I change the prompt, how do I know I'm not breaking something else?

3. Where should I focus my engineering efforts? Do I need to test everything?

4. What if I have no data or customers, where do I start?

5. What metrics should I track? What tools should I use? Which models are best?

6. Can I automate testing and evaluation? If so, how do I trust it?

If you aren't sure about the answers to these questions, this course is for you.

This is a hands-on course for engineers and technical PMs. Ideal for those who are comfortable coding or vibe coding.

---

WHAT TO EXPECT

This course will provide you with hands-on experience. Get ready to sweat through exercises, code and data! We will meet two times a week for four weeks, with generous office hours (read below for course schedule).

We will also hold office hours and host Discord community where you can communicate with us and each other. In return, you will be rewarded with skills that will set you apart from the competition by a wide margin. (see testimonials below). All sessions will be recorded and available to students asynchronously.

---

COURSE CONTENT

Lesson 1: Fundamentals & Lifecycle LLM Application Evaluation

- Why evaluation matters for LLM applications - business impact and risk mitigation

- Challenges unique to evaluating LLM outputs - common failure modes and context-dependence

- The lifecycle approach from development to production

- Basic instrumentation and observability for tracking system behavior

- Introduction to error analysis and methods for categorizing failures

Lesson 2: Systematic Error Analysis

- Bootstrap data through effective synthetic data generation

- Annotation strategies and quantitative analysis of qualitative data

- Translating error findings into actionable improvements

- Avoiding common pitfalls in the analysis process

- Practical exercise: Building and iterating on an error tracking system

Lesson 3: Implementing Effective Evaluations

- Defining metrics using code-based and LLM-judge approaches

- Techniques for evaluating individual outputs and overall system performance

- Organizing datasets with proper structure for inputs and reference data

- Practical exercise: Building an automated evaluation pipeline

Lesson 4: Collaborative Evaluation Practices

- Designing efficient team-based evaluation workflows

- Statistical methods for measuring inter-annotator agreement

- Techniques for building consensus on evaluation criteria

- Practical exercise: Collaborative alignment in breakout groups

Lesson 5: Architecture-Specific Evaluation Strategies

- Evaluating RAG systems for retrieval relevance and factual accuracy

- Testing multi-step pipelines to identify error propagation

- Assessing appropriate tool use and multi-turn conversation quality

- Multi-modal evaluation for text, image, and audio interactions

- Practical exercise: Creating targeted test suites for different architectures

Lesson 6: Production Monitoring & Continuous Evaluation

- Implementing traces, spans, and session tracking for observability

- Setting up automated evaluation gates in CI/CD pipelines

- Methods for consistent comparison across experiments

- Implementing safety and quality control guardrails

- Practical exercise: Designing an effective monitoring dashboard

Lesson 7: Efficient Continuous Human Review Systems

- Strategic sampling approaches for maximizing review impact

- Optimizing interface design for reviewer productivity

- Practical exercise: Implementing a continuous feedback collection system

Lesson 8: Cost Optimization

- Quantifying value versus expenditure in LLM applications

- Intelligent model routing based on query complexity

- Practical exercise: Optimizing a real-world application for cost efficiency

GUEST SPEAKERS

---

1. Teresa Torres (Product Talk) : From Noob to 5 Automated Evals in 4 Weeks (as a PM).

2. Jeremy Howard & Johno Whitaker (AnswerAI): SolveIt - The Thinking Developer's Environment (a great environment for evals!)

3. Leonard Tang (Haize Labs): Scaling Inference-Time Compute for Better LLM Judges

4. Charles Frye (Modal): Taming diffusion QR codes with evals and inference-time scaling

5. Udi Menkes (Product Manager, Intuit): A playbook for building AI agents you can trust.

6. Skylar Payne: 10x Your RAG Evaluation by Avoiding these Pitfalls

TOOLS TRACK: Eval vendors will provide optional tutorials

---

- Harrison Chase: LangSmith

- Mikyo King & Isaac Flath: Arize Phoenix

- Wayde Gilliam: Braintrust

Who this course is for

Engineers & Technical PMs building AI products who are interested in moving beyond proof-of-concepts.

Those interested in moving beyond vibe-checks to data driven measurements you can trust, even when outputs are stochastic or subjective.

Founders and leaders who are unsure of the failure modes of their AI applications and where to allocate resources.

Engineers & Technical PMs building AI products who are interested in moving beyond proof-of-concepts.

Those interested in moving beyond vibe-checks to data driven measurements you can trust, even when outputs are stochastic or subjective.

Founders and leaders who are unsure of the failure modes of their AI applications and where to allocate resources.

What you’ll get out of this course

Acquire the best tools for finding, diagnosing, and prioritizing AI errors.

We've tried all of them so you don't have to.

Learn how to bootstrap with synthetic data for testing before you have users

And how to best leverage data when you do have users.

Create a data flywheel for your applications that guarantees your AI will improve over time.

Data flywheels ensure you have examples to draw from for prompts, tests, and fine-tuning.

Automate parts of your AI evaluation with approaches that allow you to actually trust and rely on them.

How can you really trust LLM-as-a-judge? How should you design them? We will show you how. We will also show you how to refine prompts, generate metadata, and other tasks with the assistance of AI.

Ensure your AI is aligned to your preferences, tastes and judgement.

We will show you approaches to discover all the ways AI is not performing

Avoid common mistakes we've seen across 35+ AI implementations.

There are an infinite number of things you can try, tests you can write, and data you can look at. We will show you a data-driven process that helps you prioritize the most important problems so you can avoid wasting time and money.

Hands-On Exercises, Examples and Code

We will provide end-to-end exercises, examples and code to make sure you come away with the skills you need. We will NOT just throw a bunch of slides at you!

Personalized Instruction

Generous office hours ensures students can ask questions about their specific issues and interests.

What’s included

Live sessions

Learn directly from Hamel Husain & Shreya Shankar in a real-time, interactive format.

Lifetime Access to All Recordings & Materials

Revisit the materials and lectures anytime. Recordings and slides are made available to all students.

150+ Page Course Reader

We provide a course reader with detailed notes to supplement your learning and act as a future reference as you work on evals.

Lifetime Access To Discord Community

Private discord for questions, job leads, and ongoing support from the community (over 1000+ students and growing).

8+ Office Hour Q&As

Open office hours for questions and personalized feedback.

4 Homework Assignments With Solutions & Walkthroughs

Optional coding assignments & walkthrough videos so you can practice every concept.

Certificate of Completion

Share your new skills with your employer or on LinkedIn.

6+ Guest speakers and Tools Workshops

Curated talks from industry experts working on evals, as well as workshops with vendors building eval tools.

Maven Guarantee

This course is backed by the Maven Guarantee. Students are eligible for a full refund up until the halfway point of the course.

Course syllabus

Week 1

Jul 21—Jul 27

Jul

Lesson 1: Fundamentals & Lifecycle LLM Application Evaluation

Tue 7/225:00 PM—6:00 PM (UTC)

Jul

Lesson 2: Systematic Error Analysis

Fri 7/255:00 PM—6:00 PM (UTC)

Guest Speakers

Jul
25
Braintrust Tutorial w/ Wayde Gilliam
Fri 7/2510:00 PM—10:30 PM (UTC)
Optional

Office Hours

Jul
25
Optional: Office Hours
Fri 7/256:00 PM—6:45 PM (UTC)
Optional

Week 2

Jul 28—Aug 3

Jul

Lesson 3: Implementing Effective Evaluations

Tue 7/295:00 PM—5:45 PM (UTC)

Jul

Lesson 4: Collaborative Evaluation Practices

Thu 7/315:00 PM—5:45 PM (UTC)

Guest Speakers

Jul
29
Optional: Guest Speaker: Taming diffusion QR codes with evals and inference-time scaling w/ Charlkes Frye
Tue 7/295:45 PM—6:15 PM (UTC)
Optional
Jul
30
10x Your RAG Evaluation by Avoiding these Pitfalls w/ Skylar Payne
Wed 7/306:30 PM—7:00 PM (UTC)
Optional

Office Hours

Aug
1
Optional: Office Hours
Fri 8/15:00 PM—5:45 PM (UTC)
Optional
Aug
2
Optional: Office Hours
Sat 8/212:00 AM—12:45 AM (UTC)
Optional

Tools Workshops

Jul
31
[Tools Workshop - Braintrust]: HW #1 & #2 w/ Wayde Gilliam
Thu 7/3110:30 PM—11:00 PM (UTC)
Optional
Jul
31
[Tools Workshop - Phoenix]: HW #1 & #2 w/Isaac Flath
Thu 7/3111:00 PM—11:30 PM (UTC)
Optional

Week 3

Aug 4—Aug 10

Aug

Lesson 5: Architecture-Specific Evaluation Strategies

Tue 8/55:00 PM—6:00 PM (UTC)

Aug

Lesson 6: Production Monitoring & Continuous Evaluation

Thu 8/75:00 PM—6:00 PM (UTC)

Guest Speakers

Aug
5
Scaling Inference-Time Compute for Better LLM Judges w/ Leonard Tang
Tue 8/56:30 PM—7:00 PM (UTC)
Optional
Aug
7
Guest Speaker: From Vibe Checks to Evals to Feedback Loops - Case Studies in AI System Maturities w/ David Karam
Thu 8/74:30 PM—5:00 PM (UTC)
Optional
Aug
7
A Playbook For Building AI Agents You Can Trust w/Udi Menkes
Thu 8/76:30 PM—7:00 PM (UTC)

Office Hours

Aug
5
Optional: Office Hours
Tue 8/56:00 PM—6:30 PM (UTC)
Optional
Aug
7
Optional: Office Hours
Thu 8/76:00 PM—6:30 PM (UTC)
Optional
Aug
9
Optional: Office Hours
Sat 8/912:00 AM—12:45 AM (UTC)
Optional

Tools Workshops

Aug
7
[Tools Workshop - Braintrust]: Braintrust for HW #3 w/ Wayde Gilliam
Thu 8/710:30 PM—11:00 PM (UTC)
Optional
Aug
7
[Tools Workshop - Phoenix]: Phoenix For HW #3 w/ Isaac Flath
Thu 8/711:00 PM—11:30 PM (UTC)
Optional

Week 4

Aug 11—Aug 15

Aug

Lesson 7: Efficient Continuous Human Review Systems

Tue 8/125:00 PM—6:00 PM (UTC)

Aug

Lesson 8: Cost Optimization

Wed 8/135:00 PM—6:00 PM (UTC)

Guest Speakers

Aug
12
Techniques for evaluating agents w/SallyAnn DeLucia (Arize)
Tue 8/126:30 PM—7:00 PM (UTC)
Optional
Aug
12
LangSmith Tutorial w/ Harrison Chase
Tue 8/128:00 PM—8:45 PM (UTC)
Optional
Aug
13
From Noob to 5 Automated Evals in 4 Weeks (as a PM) w/Teresa Torres
Wed 8/136:00 PM—7:00 PM (UTC)
Optional
Aug
13
SolveIt: The Thinking Developer's Environment w/Jeremy Howard & Johno Whitaker
Wed 8/1311:00 PM—12:15 AM (UTC)
Optional

Office Hours

Aug
15
Optional: Office Hours
Fri 8/154:30 PM—5:15 PM (UTC)
Optional

Tools Workshops

Aug
14
[Tools Workshop - Braintrust]: HW #4 w/ Wayde Gilliam
Thu 8/1410:30 PM—11:00 PM (UTC)
Optional

Post-course

Tools Workshops

Aug
21
[Tools Workshop - Braintrust]: HW #5 w/ Wayde Gilliam
Thu 8/2110:30 PM—11:00 PM (UTC)
Optional
Aug
21
[Tools Workshop - Phoenix]: Phoenix For HW #4 & #5 w/ Isaac Flath
Thu 8/2111:00 PM—12:00 AM (UTC)
Optional

4.8 (22 ratings)

What students are saying

See what our students have to say

See more testimonials at https://bit.ly/eval-reviews

See more reviews at bit.ly/eval-reviews

"After taking this course...I feel like a wizard who can build high quality AI agents and applications."

See more reviews: https://bit.ly/eval-reviews

Can PMs also get value from this course? Yes! This is what PMs are saying:

https://x.com/ttorres/status/1933296711658815722

Instructors are recognized experts, with hands-on experience with 30+ companies

Hamel has provided exactly the tutorial I was needing for [evals], with a really thorough example case-study ... Hamel's content is fantastic, but it's a bit absurd that he's single-handedly having to make up for a lack of good materials about this topic across the rest of our industry!

Simon Willison

Creator of Datasette

Hamel and is one of most knowledgeable people about LLM evals. I've witnessed him improve AI products first-hand by guiding his clients carefully through the process. We've even made many improvements to LangSmith because of his work.

Harrison Chase

CEO, Langchain

Shreya and Hamel are legit. Through their work on dozens of use cases, they've encountered and successfully addressed many of the common challenges in LLM evals. Every time I seek their advice, I come away with greater clarity and insight on how to solve my eval challenges.

Eugene Yan

Senior Applied Scientist

Hamel and Shreya technically goated, deeply experienced engineers of AI systems who just so happen to have impeccable vibes. I wouldn't learn this material from anyone else.

Charles Frye

Dev Advocate - Modal

When I have questions about the intersection of data and production AI systems, Shreya & Hamel are the first people I call. It's often the case that they've already written about my problem. You can’t find more qualified folks to teach this; anywhere.

Bryan Bischof

Director of Engineering, Hex

I was seeking help with LLM evaluation and testing for our products. Hamel's widely-referenced work on evals made him the clear choice. He helped us rethink our entire approach to LLM development and testing, creating a clear pathway to measure and improve our AI systems.

George Siemens

CEO, Matter & Space

Hamel showed us how to evaluate our AI systems tailored to our use case. We gained insights that dramatically improved performance, reducing error rates by over 60% in critical areas like date handling. These methods have become fundamental to how we build and improve our AI products, creating a continuous cycle of improvement.

Jacob Carter

CEO, NurtureBoss

Hamel & team allowed serviceMob to save hundreds of hours of engineering time by showing us the best tools, techniques, and processes. We shipped industry-leading AI in few weeks instead of months, and kept shipping thereafter thanks to how his team up-skilled our company.

Anuj Bhalla

CEO, ServiceMob

We have had the chance to work with Hamel and it’s been a very successful partnership for us. He has a massive technical knowledge and is also high-level executive that can help you on all different levels. Couldn’t have asked for a better partner.

Emil Sedgh

CTO, Rechat

Working with Hamel was incredibly helpful. He broke down LLM evaluations step by step and made everything practical. He’s a true expert and a fantastic teacher!

Max Shaw

CEO, Windmill

After working with Hamel, I learned I was being too abstract in trying to evaluate GiveCare, using top-down frameworks instead of starting with real caregiver interactions. Now seeing more value in a bottom-up approach: review user sessions, identify issues, and build tests around actual data.

Ali Madad

Founder, GiveCare

Meet your instructors

Hamel Husain

ML Engineer with 20 years of experience.

Hamel Husain is a ML Engineer with over 20 years of experience. He has worked with innovative companies such as Airbnb and GitHub, which included early LLM research used by OpenAI, for code understanding. He has also led and contributed to numerous popular open-source machine-learning tools. Hamel is currently an independent consultant helping companies build AI products.

Shreya Shankar

ML Systems Researcher Making AI Evaluation Work in Practice

Shreya Shankar is an experienced ML Engineer who is currently a PhD candidate in computer science at UC Berkeley, where she builds systems that help people use AI to work with data effectively. Her research focuses on developing practical tools and frameworks for building reliable ML systems, with recent groundbreaking work on LLM evaluation and data quality. She has published influential papers on evaluating and aligning LLM systems, including "Who Validates the Validators?" which explores how to systematically align LLM evaluations with human preferences.

Prior to her PhD, Shreya worked as an ML engineer in industry and completed her BS and MS in computer science at Stanford. Her work appears in top data management and HCI venues including SIGMOD, VLDB, and UIST. She is currently supported by the NDSEG Fellowship and has collaborated extensively with major tech companies and startups to deploy her research in production environments. Her recent projects like DocETL and SPADE demonstrate her ability to bridge theoretical frameworks with practical implementations that help developers build more reliable AI systems.

Frequently Asked Questions

Free resource

Preview Our Course Reader

We provide each student with a 150 page book, organized by 11 chapters that supplements their learning. This preview contains the table of contents and the first chapter.

Get this free resource

Free resource

Frequently Asked Questions (And Answers) About AI Evals

These are the 20 most frequent questions we've received while teaching 700+ engineers and product managers AI evals.

Get this free resource

Stop Managing AI Projects Like Traditional Software

Learn Why Traditional Software Roadmaps Fail in AI Projects

We'll show you why conventional approaches to product development break down when building AI and what to do instead.

Prioritizing Work Through Effective Evaluation

Explore methods for creating evals that pinpoint where your AI is struggling, and how to prioritize improvements.

How to Adopt an Experimental Mindset

Learn to build AI products through iterative experiments rather than rigid roadmaps, with clear, measurable objectives.

Turn Failures Into Actionable Insights

Learn to break down complex AI capabilities into measurable stages that help you identify where to focus.

Get the free recording

How Evals Made GitHub Copilot Happen

Build more reliable LLM-as-judge systems

See how the Copilot team improved their automated evaluation by validating the judges

Learn the three-part eval taxonomy that drove success

Understand the differences between algorithmic, subjective, and verifiable evaluation approaches

Avoid the "ratchet effect" trap in A/B testing

Learn how GitHub's team hit local maximums with their metrics and the techniques they developed to overcome them.

Get the free recording

Join an upcoming cohort

AI Evals For Engineers & PMs

Cost

$2,700

Upcoming

Cohort 2

Dates

July 21—Aug 16, 2025

Payment Deadline

July 21, 2025

Don't miss out! Enrollment closes in 7 days

Cohort 2

$2,700

Dates

July 21—Aug 16, 2025

Payment Deadline

July 21, 2025

Don't miss out! Enrollment closes in 7 days

Get reimbursed

$2,700

4.8 (22)

7 days left to enroll