4 Weeks
·Cohort-based Course
Learn proven approaches for quickly improving AI applications. Build AI that works better than the competition, regardless of the use-case.
This course is popular
50+ people enrolled last week.
4 Weeks
·Cohort-based Course
Learn proven approaches for quickly improving AI applications. Build AI that works better than the competition, regardless of the use-case.
This course is popular
50+ people enrolled last week.
Previously at
Course overview
Do you catch yourself asking any of the following questions while building AI applications?
1. How do I test applications when the outputs are stochastic and require subjective judgements?
2. If I change the prompt, how do I know I'm not breaking something else?
3. Where should I focus my engineering efforts? Do I need to test everything?
4. What if I have no data or customers, where do I start?
5. What metrics should I track? What tools should I use? Which models are best?
6. Can I automate testing and evaluation? If so, how do I trust it?
If you aren't sure about the answers to these questions, this course is for you.
This is a hands-on course for engineers and technical PMs. Ideal for those who are comfortable coding or vibe coding.
---
WHAT TO EXPECT
This course will provide you with hands-on experience. Get ready to sweat through exercises, code and data! We will meet two times a week for four weeks, with generous office hours (read below for course schedule).
We will also hold office hours and host Discord community where you can communicate with us and each other. In return, you will be rewarded with skills that will set you apart from the competition by a wide margin. (see testimonials below). All sessions will be recorded and available to students asynchronously.
---
COURSE CONTENT
Lesson 1: Fundamentals & Lifecycle LLM Application Evaluation
- Why evaluation matters for LLM applications - business impact and risk mitigation
- Challenges unique to evaluating LLM outputs - common failure modes and context-dependence
- The lifecycle approach from development to production
- Basic instrumentation and observability for tracking system behavior
- Introduction to error analysis and methods for categorizing failures
Lesson 2: Systematic Error Analysis
- Bootstrap data through effective synthetic data generation
- Annotation strategies and quantitative analysis of qualitative data
- Translating error findings into actionable improvements
- Avoiding common pitfalls in the analysis process
- Practical exercise: Building and iterating on an error tracking system
Lesson 3: Implementing Effective Evaluations
- Defining metrics using code-based and LLM-judge approaches
- Techniques for evaluating individual outputs and overall system performance
- Organizing datasets with proper structure for inputs and reference data
- Practical exercise: Building an automated evaluation pipeline
Lesson 4: Collaborative Evaluation Practices
- Designing efficient team-based evaluation workflows
- Statistical methods for measuring inter-annotator agreement
- Techniques for building consensus on evaluation criteria
- Practical exercise: Collaborative alignment in breakout groups
Lesson 5: Architecture-Specific Evaluation Strategies
- Evaluating RAG systems for retrieval relevance and factual accuracy
- Testing multi-step pipelines to identify error propagation
- Assessing appropriate tool use and multi-turn conversation quality
- Multi-modal evaluation for text, image, and audio interactions
- Practical exercise: Creating targeted test suites for different architectures
Lesson 6: Production Monitoring & Continuous Evaluation
- Implementing traces, spans, and session tracking for observability
- Setting up automated evaluation gates in CI/CD pipelines
- Methods for consistent comparison across experiments
- Implementing safety and quality control guardrails
- Practical exercise: Designing an effective monitoring dashboard
Lesson 7: Efficient Continuous Human Review Systems
- Strategic sampling approaches for maximizing review impact
- Optimizing interface design for reviewer productivity
- Practical exercise: Implementing a continuous feedback collection system
Lesson 8: Cost Optimization
- Quantifying value versus expenditure in LLM applications
- Intelligent model routing based on query complexity
- Practical exercise: Optimizing a real-world application for cost efficiency
---
FREE CREDITS
Each student will get $1,000 of of free Modal (https://modal.com/) compute credits.
---
GUEST SPEAKERS
1. Kwindla Kramer: Founder of pipecat, an OSS framework for voice and conversational AI. - How to eval voice agents.
2. JJ Allaire: Creator of OSS tool Inspect, an eval toolset used by Anthropic, Groq and other leading companies.
3. Eugene Yan: Principal Applied Scientist at Amazon. "An LLM‑as‑Judge Won't Save Your Product—Fixing Your Process Will"
4. Aarush Sah: Head of Evals at Groq
5. Benjamin Clavié: Creator of OSS project RAGatouille and rerankers: RAG mistakes & how to diagnose them.
6. Isaac Flath: Creator of MonsterUI and FastHTML core contributor - Building custom annotation tools for error analysis with FastHTML
7. Mikyo King: Creator of OSS Evals framework Phoneix.
8. Bryan Bischof: Head of AI @ Theory & formerly Hex, W&B and StitchFix. Failure funnels - an analytical framework for simplifying Agent evaluations.
9. Brooke Hopkins: Founder of Coval, simulation & evals for voice and chat agents
10. Alex Volkov: Devolper advocate @ Weights & Biases- "Reasoning models and LLM as a judge"
11. Aman Khan: AI PM @ Arize
01
Engineers & Technical PMs building AI applications who have little experience with machine learning or data science.
02
Those interested in moving beyond vibe-checks to data driven measurements you can trust, even when outputs are stochastic or subjective.
03
Founders and leaders who are unsure of the failure modes of their AI applications and where to allocate resources.
Acquire the best tools for finding, diagnosing, and prioritizing AI errors.
We've tried all of them so you don't have to.
Learn how to bootstrap with synthetic data for testing before you have users
And how to best leverage data when you do have users.
Create a data flywheel for your applications that guarantees your AI will improve over time.
Data flywheels ensure you have examples to draw from for prompts, tests, and fine-tuning.
Automate parts of your AI evaluation with approaches that allow you to actually trust and rely on them.
How can you really trust LLM-as-a-judge? How should you design them? We will show you how. We will also show you how to refine prompts, generate metadata, and other tasks with the assistance of AI.
Ensure your AI is aligned to your preferences, tastes and judgement.
We will show you approaches to discover all the ways AI is not performing
Avoid common mistakes we've seen across 35+ AI implementations.
There are an infinite number of things you can try, tests you can write, and data you can look at. We will show you a data-driven process that helps you prioritize the most important problems so you can avoid wasting time and money.
Hands-On Exercises, Examples and Code
We will provide end-to-end exercises, examples and code to make sure you come away with the skills you need. We will NOT just throw a bunch of slides at you!
Personalized Instruction
Generous office hours ensures students can ask questions about their specific issues and interests.
20 interactive live sessions
Lifetime access to course materials
In-depth lessons
Direct access to instructor
Projects to apply learnings
Guided feedback & reflection
Private community of peers
Course certificate upon completion
Maven Satisfaction Guarantee
This course is backed by Maven’s guarantee. You can receive a full refund within 14 days after the course ends, provided you meet the completion criteria in our refund policy.
AI Evals For Engineers & PMs
May
20
May
23
May
23
May
27
May
28
May
29
May
30
May
31
Jun
3
Jun
3
Jun
5
Jun
5
Jun
7
Jun
9
Jun
10
Jun
11
Jun
12
Jun
12
Jun
13
Jun
13
Eugene Yan
Harrison Chase
George Siemens
Jacob Carter
Anuj Bhalla
Emil Sedgh
Max Shaw
Bryan Bischof
Simon Willison
Ali Madad
Charles Frye
ML Engineer with 20 years of experience.
Hamel Husain is a ML Engineer with over 20 years of experience. He has worked with innovative companies such as Airbnb and GitHub, which included early LLM research used by OpenAI, for code understanding. He has also led and contributed to numerous popular open-source machine-learning tools. Hamel is currently an independent consultant helping companies build AI products.
ML Systems Researcher Making AI Evaluation Work in Practice
Shreya Shankar is an experienced ML Engineer who is currently a PhD candidate in computer science at UC Berkeley, where she builds systems that help people use AI to work with data effectively. Her research focuses on developing practical tools and frameworks for building reliable ML systems, with recent groundbreaking work on LLM evaluation and data quality. She has published influential papers on evaluating and aligning LLM systems, including "Who Validates the Validators?" which explores how to systematically align LLM evaluations with human preferences.
Prior to her PhD, Shreya worked as an ML engineer in industry and completed her BS and MS in computer science at Stanford. Her work appears in top data management and HCI venues including SIGMOD, VLDB, and UIST. She is currently supported by the NDSEG Fellowship and has collaborated extensively with major tech companies and startups to deploy her research in production environments. Her recent projects like DocETL and SPADE demonstrate her ability to bridge theoretical frameworks with practical implementations that help developers build more reliable AI systems.
Join an upcoming cohort
Cohort 1
$1,975
Dates
Payment Deadline
Don't miss out! Enrollment closes in 6 days
Active hands-on learning
This course builds on live workshops and hands-on projects
Interactive and project-based
You’ll be interacting with other learners through breakout rooms and project teams
Learn with a cohort of peers
Join a community of like-minded people who want to learn and grow alongside you
Join an upcoming cohort
Cohort 1
$1,975
Dates
Payment Deadline
Don't miss out! Enrollment closes in 6 days