AI Evals For Engineers & PMs

4.7 (890)

Featured in

Lenny’s List

Hamel Husain

ML Engineer with 25 years of experience

Shreya Shankar

ML Systems & Applied AI Evals Researcher

This course is popular

10 people enrolled last week.

Build, evaluate, and improve AI agents that work in production.

Save 25% ($1,050) on September's cohort with this link.

---

📚 View the syllabus here

We've refined this material with over 4,500 engineers and PMs from teams like OpenAI, Google, Meta, Amazon, and Microsoft, folding their feedback in every time. Read what students say →

More than a course: an ongoing system, tools, and community for shipping better AI.

🎮 A private Discord community for ongoing support, even after the course.
🤖 6 months of access to our AI Evals assistant.
♾️ Lifetime access to all recordings, materials, and future cohorts.
💬 10+ hours of live office hours to get your questions answered.

Do you catch yourself asking any of these while building AI applications?

How do I test outputs that need subjective judgment?
If I change the prompt, how do I know I am not breaking something else?
Where should I focus my efforts? Do I need to test everything?
What if I have no data or customers? Where do I start?
What should I measure, and what tools should I use?
Can I automate evals, and how do I trust it?

If so, this is for you. All sessions are live and recorded.

📝 We've completely refreshed the material for our Sep 2026 cohort. Scroll down to see the syllabus.

What you’ll learn

Build a real AI agent, find where it breaks, and improve it with evals you can trust, working the full loop hands-on.

Instrument a real agent so every run leaves a trace you can inspect.
Turn vague failures into specific, reproducible cases with a root cause.
Set up logging and observability that show what the agent actually did.

Replace random spot-checking with a repeatable way to read traces and spot failures.
Group and prioritize failure modes so you fix what matters first.
Learn how to analyze agentic systems, including tool calls and retrieval.

Design and validate LLM-as-judge and code-based evals that match expert judgment.
Learn when a metric is real and when it is noise no one should act on.
Align evaluators with the people who own the product, so the results stick.

Wire an agent into a test suite so prompt, model, and tool changes get checked before they ship.
Compare experiments consistently and keep datasets from overfitting.
Monitor agents in production and catch drift before users do.

Probe for prompt injection, jailbreaks, and unsafe tool calls.
Add guardrails and human checks that hold up under attack.
Map an agent's attack surface so you know where it can be pushed.

Run experiments that raise accuracy and lower latency and cost.
Show which change moved the metric, with numbers.
Optimize prompts, models, and retries without breaking what already works.

Learn directly from Hamel & Shreya

Hamel Husain

Contact

ML Engineer with 20+ years of experience.

Previously At

Shreya Shankar

Contact

ML Systems Researcher Making AI Evaluation Work in Practice

See all products from Hamel Husain & Shreya Shankar

Who this course is for

Engineers and PMs who ship prompt changes and hope nothing breaks. (You'll learn to measure impact before and after every change.)
Teams still spot-checking AI outputs by hand instead of measuring systematically. (You'll learn how to build automated evals you can trust.)
Leaders who don't know where their AI is failing or where to invest resources. You'll learn how to systematically find & prioritize issues.

What's included

Live sessions

Learn directly from Hamel Husain & Shreya Shankar in a real-time, interactive format.

Your 24/7 Evals Assistant (6 Months)

Stuck at 11pm on a judge prompt? Ask the AI we built from everything we teach. Students only.

Lifetime Access: Recordings, Materials & Every Future Cohort

Rewatch anything anytime, and rejoin any future cohort live at no extra cost

150+ Page Course Reader

We provide a course reader with detailed notes to supplement your learning and act as a future reference as you work on evals.

Lifetime Access To Discord Community

Private discord for questions, job leads, and ongoing support from the community (over 1000+ students and growing).

10+ Office Hour Q&As

Open office hours for questions and personalized feedback.

4 Homework Assignments With Solutions & Walkthroughs

Optional coding assignments & walkthrough videos so you can practice every concept.

Certificate of Completion

Share your new skills with your employer or on LinkedIn.

Detailed Vendor & Tools Workshops

Curated talks from industry experts working on evals, as well as workshops with vendors building eval tools.

Maven Guarantee

Your purchase is backed by the Maven Guarantee.

Course syllabus

17 live sessions • 11 lessons

Week 1

Sep 5—Sep 6

L1: Building Agents, Foundations

Sep
5
Lecture 1: Building Agents, Foundations
Sat 9/56:00 PM—7:00 PM (UTC)

2 more items • Free preview

Week 2

Sep 7—Sep 13

L2: Building Agents, Designing for Evaluability

Sep
9
Lecture 2: Building Agents, Designing for Evaluability
Wed 9/93:00 PM—4:00 PM (UTC)

1 more item • Free preview

L3: Building Agents, Synthetic Data & Scenarios

Sep
12
Lecture 3: Building Agents, Synthetic Data & Scenarios
Sat 9/126:00 PM—7:00 PM (UTC)

1 more item • Free preview

Office Hours

Sep
10
Office Hours (US/APAC-friendly)
Thu 9/104:00 AM—5:00 AM (UTC)
Optional
Sep
12
Office Hours (US/EU-friendly)
Sat 9/127:15 PM—8:15 PM (UTC)
Optional

Free resources

Schedule

Live sessions

3-5 hrs / week

Lectures are delivered live but also recorded so you can watch the materials at your own pace. We also provide over 10 hours of office hours and a community where you can ask questions (even after the course ends!).

Sat, Sep 5
6:00 PM—7:00 PM (UTC)
Wed, Sep 9
3:00 PM—4:00 PM (UTC)
Thu, Sep 10
4:00 AM—5:00 AM (UTC)

Optional Homework Assignments

1-2 hrs / week

Optional coding homework assignments where you implement evals from scratch. We provide all students with solutions and associated walk-throughs.

Testimonials

Hamel has provided exactly the tutorial I was needing for [evals], with a really thorough example case-study ... Hamel's content is fantastic, but it's a bit absurd that he's single-handedly having to make up for a lack of good materials about this topic across the rest of our industry!
Simon Willison
Creator of Datasette
Hamel is one of the most knowledgeable people about LLM evals. I've witnessed him improve AI products first-hand by guiding his clients carefully through the process. We've even made many improvements to LangSmith because of his work.
Harrison Chase
CEO, Langchain
Shreya and Hamel are legit. Through their work on dozens of use cases, they've encountered and successfully addressed many of the common challenges in LLM evals. Every time I seek their advice, I come away with greater clarity and insight on how to solve my eval challenges.
Eugene Yan
Senior Applied Scientist
Hamel and Shreya technically goated, deeply experienced engineers of AI systems who just so happen to have impeccable vibes. I wouldn't learn this material from anyone else.
Charles Frye
Dev Advocate - Modal
When I have questions about the intersection of data and production AI systems, Shreya & Hamel are the first people I call. It's often the case that they've already written about my problem. You can’t find more qualified folks to teach this; anywhere.
Bryan Bischof
Director of Engineering, Hex
I was seeking help with LLM evaluation and testing for our products. Hamel's widely-referenced work on evals made him the clear choice. He helped us rethink our entire approach to LLM development and testing, creating a clear pathway to measure and improve our AI systems.
George Siemens
CEO, Matter & Space