Consistently Improve Any AI Application With Evals

New
·

4 Weeks

·

Cohort-based Course

Learn a systematic approach for improving AI applications, regardless of the domain or use-case.

This course is popular

3 people enrolled last week.

Previously at

GitHub
Airbnb
Google

Course overview

Eliminate the guesswork of building AI applications with data-driven approaches.

Do you catch yourself asking any of the following questions while building AI applications?


1. How do I test applications when the outputs are stochastic and require subjective judgements?


2. If I change the prompt, how do I know I'm not breaking something else?


3. Where should I focus my engineering efforts? Do I need to test everything?


4. What if I have no data or customers, where do I start?


5. What metrics should I track? What tools should I use? Which models are best?


6. Can I automate testing and evaluation? If so, how do I trust it?


If you aren't sure about the answers to these questions, this course is for you.


---

WHAT TO EXPECT


This course will provide you with hands-on experience. Get ready to sweat through exercises, code and data! We will meet two times a week for four weeks, with generous office hours (read below for course schedule).


We will also hold office hours and host Discord community where you can communicate with us and each other. In return, you will be rewarded with skills that will set you apart from the competition by a wide margin. (see testimonials below).


The first three weeks will be dedicated to learning about the processes and tools surrounding evals (including extremely challenging caes). The fourth week will be case studies from several companies we've helped with evals and guest speakers. All sessions will be recorded and available to students asynchronously.


---

COURSE CONTENT

You will learn the following:


- How to evaluate subjective outputs and dynamic scenarios like multi-turn conversations

- How to generate synthetic data before you have any users.

- How to conduct error analysis and use the results to prioritize your efforts

- How to ensure evaluations are trusted and aligned with your goals

- How to automate parts of your evaluation workflow.

- How to select the right tools and infrastructure, including our favorite tools.

- How you can create a data flywheel for continual improvement.

- How to avoid common pitfalls while building evals


---

FREE CREDITS

Each student will get the following incentives and compute credits:


- Modal (https://modal.com/): $1,000

Who this course is for

01

Engineers building AI applications who have little experience with machine learning or data science.

02

Those interested in moving beyond vibe-checks to data driven measurements you can trust, even when outputs are stochastic or subjective.

03

Founders and leaders who are unsure of the failure modes of their AI applications and where to allocate resources.

What you’ll get out of this course

Acquire the best tools for finding, diagnosing, and prioritizing AI errors.

We've tried all of them so you don't have to.

Learn how to bootstrap with synthetic data for testing before you have users

And how to best leverage data when you do have users.

Create a data flywheel for your applications that guarantees your AI will improve over time.

Data flywheels ensure you have examples to draw from for prompts, tests, and fine-tuning.

Automate parts of your AI evaluation with approaches that allow you to actually trust and rely on them.

How can you really trust LLM-as-a-judge? How should you design them? We will show you how. We will also show you how to refine prompts, generate metadata, and other tasks with the assistance of AI.

Ensure your AI is aligned to your preferences, tastes and judgement.

We will show you approaches to discover all the ways AI is not performing

Avoid common mistakes we've seen across 35+ AI implementations.

There are an infinite number of things you can try, tests you can write, and data you can look at. We will show you a data-driven process that helps you prioritize the most important problems so you can avoid wasting time and money.

Hands-On Exercises, Examples and Code

We will provide end-to-end exercises, examples and code to make sure you come away with the skills you need. We will NOT just throw a bunch of slides at you!

Personalized Instruction

Generous office hours ensures students can ask questions about their specific issues and interests.

This course includes

16 interactive live sessions

Lifetime access to course materials

In-depth lessons

Direct access to instructor

Projects to apply learnings

Guided feedback & reflection

Private community of peers

Course certificate upon completion

Maven Satisfaction Guarantee

This course is backed by Maven’s guarantee. You can receive a full refund within 14 days after the course ends, provided you meet the completion criteria in our refund policy.

Course syllabus

Week 1

May 19—May 25

    May

    20

    Lesson 1: Set Up Your AI App & Initial Testing

    Tue 5/205:00 PM—6:00 PM (UTC)

    May

    22

    Lesson 2: Generate Synthetic Data For Testing

    Thu 5/225:00 PM—5:45 PM (UTC)

    May

    23

    Optional: Office Hours

    Fri 5/235:00 PM—5:30 PM (UTC)
    Optional

Week 2

May 26—Jun 1

    May

    27

    Lesson 3: Performing Error Analysis

    Tue 5/275:00 PM—5:45 PM (UTC)

    May

    29

    Lesson 4: Improving Your AI's Performance

    Thu 5/295:00 PM—5:45 PM (UTC)

    May

    30

    Optional: Office Hours

    Fri 5/305:00 PM—5:45 PM (UTC)
    Optional

    May

    31

    Optional: Office Hours

    Sat 5/3112:00 AM—12:30 AM (UTC)
    Optional

Week 3

Jun 2—Jun 8

    Jun

    2

    Optional: Office Hours

    Mon 6/25:00 PM—5:30 PM (UTC)
    Optional

    Jun

    3

    Lesson 5: Advanced Evaluation & Debugging

    Tue 6/35:00 PM—6:00 PM (UTC)

    Jun

    5

    Lesson 6: Eval Tools Overivew - logging, annotation, testing and emerging workflows. How to select the right tools for you.

    Thu 6/55:00 PM—6:00 PM (UTC)

    Jun

    6

    Optional: Office Hours

    Fri 6/65:00 PM—5:30 PM (UTC)
    Optional

    Jun

    7

    Optional: Office Hours

    Sat 6/712:00 AM—12:30 AM (UTC)
    Optional

Week 4

Jun 9—Jun 13

    Jun

    10

    Lesson 7: Applied Case Studies & Examples

    Tue 6/105:00 PM—6:30 PM (UTC)

    Jun

    11

    Special Series: Evaluating RAG (Time/Date TBD)

    Wed 6/115:00 PM—5:45 PM (UTC)

    Jun

    13

    Special Series: Evaluating RAG #2 (Time & Date TBD)

    Fri 6/1312:00 AM—12:30 AM (UTC)

    Jun

    13

    Optional: Office Hours

    Fri 6/136:00 PM—6:45 PM (UTC)
    Optional

What people are saying

        Shreya and Hamel are legit. Through their work on dozens of use cases, they've encountered and successfully addressed many of the common challenges in LLM evals. Every time I seek their advice, I come away with greater clarity and insight on how to solve my eval challenges.
Eugene Yan

Eugene Yan

Senior Applied Scientist
        Hamel and is one of most knowledgeable people about LLM evals. I've witnessed him improve AI products first-hand by guiding his clients carefully through the process. We've even made many improvements to LangSmith because of his work.
Harrison Chase

Harrison Chase

CEO, Langchain
        I was seeking help with LLM evaluation and testing for our products. Hamel's widely-referenced work on evals made him the clear choice. He helped us rethink our entire approach to LLM development and testing, creating a clear pathway to measure and improve our AI systems.
George Siemens 

George Siemens 

CEO, Matter & Space
        My session with Hamel was incredibly helpful. He broke down LLM evaluations step by step and made everything practical. He’s a true expert and a fantastic teacher!
Max Shaw

Max Shaw

CEO, Windmill
        When I have questions about the intersection of data and production AI systems, Shreya & Hamel are the first people I call. It's often the case that they've already written about my problem. You can’t find more qualified folks to teach this; anywhere.
Bryan Bischof

Bryan Bischof

Director of Engineering, Hex
        We have had the chance to work with Hamel and it’s been a very successful partnership for us. He has a massive technical knowledge and is also high-level executive that can help you on all different levels. Couldn’t have asked for a better partner.
Emil Sedgh

Emil Sedgh

CTO, Rechat
        Hamel & team allowed serviceMob to save hundreds of hours of engineering time by showing us the best tools, techniques, and processes. We shipped industry-leading AI in few weeks instead of months, and kept shipping thereafter thanks to how his team up-skilled our company.
Anuj Bhalla

Anuj Bhalla

CEO, ServiceMob
        After talking to Hamel, I learned I was being too abstract in trying to evaluate GiveCare, using top-down frameworks instead of starting with real caregiver interactions. Now seeing more value in a bottom-up approach: review user sessions, identify issues, and build tests around actual data.
Ali Madad

Ali Madad

Founder, GiveCare
        Hamel and Shreya technically goated, deeply experienced engineers of AI systems who just so happen to have impeccable vibes. I wouldn't learn this material from anyone else.
Charles Frye

Charles Frye

Dev Advocate - Modal

Meet your instructors

Hamel Husain

Hamel Husain

ML Engineer with 20 years of experience.

Hamel Husain is a ML Engineer with over 20 years of experience. He has worked with innovative companies such as Airbnb and GitHub, which included early LLM research used by OpenAI, for code understanding. He has also led and contributed to numerous popular open-source machine-learning tools. Hamel is currently an independent consultant helping companies operationalize Large Language Models (LLMs).

Shreya Shankar

Shreya Shankar

ML Systems Researcher Making AI Evaluation Work in Practice

Shreya Shankar is a PhD student in computer science at UC Berkeley, where she builds systems that help people use AI to work with data effectively. Her research focuses on developing practical tools and frameworks for building reliable ML systems, with recent groundbreaking work on LLM evaluation and data quality. She has published influential papers on evaluating and aligning LLM systems, including "Who Validates the Validators?" which explores how to systematically align LLM evaluations with human preferences.

Prior to her PhD, Shreya worked as an ML engineer in industry and completed her BS and MS in computer science at Stanford. Her work appears in top data management and HCI venues including SIGMOD, VLDB, and UIST. She is currently supported by the NDSEG Fellowship and has collaborated extensively with major tech companies and startups to deploy her research in production environments. Her recent projects like DocETL and SPADE demonstrate her ability to bridge theoretical frameworks with practical implementations that help developers build more reliable AI systems.

A pattern of wavy dots

Join an upcoming cohort

Consistently Improve Any AI Application With Evals

Cohort 1

$1,775

Dates

May 19—June 14, 2025

Payment Deadline

May 18, 2025
Get reimbursed

Learning is better with cohorts

Learning is better with cohorts

Active hands-on learning

This course builds on live workshops and hands-on projects

Interactive and project-based

You’ll be interacting with other learners through breakout rooms and project teams

Learn with a cohort of peers

Join a community of like-minded people who want to learn and grow alongside you

Frequently Asked Questions

A pattern of wavy dots

Join an upcoming cohort

Consistently Improve Any AI Application With Evals

Cohort 1

$1,775

Dates

May 19—June 14, 2025

Payment Deadline

May 18, 2025
Get reimbursed

$1,775

4 Weeks