Automate AI Evals with Claude Code

Shane Butler

Principal Data Scientist, AI Evaluations

Build the agentic systems that automate your AI evaluation workflow from end to

With the recent leaps in model capabilities, it is becoming clear that much of the AI evaluation work previously requiring humans to build and run can be automated through agentic systems. This course teaches you how to build those systems, and where human judgment and input is still required.

The foundational work of designing eval frameworks, defining quality signals, and structuring failure analysis remains critical. But the execution of that work, running evaluations, scoring results, iterating on improvements, and surfacing what matters, can now be handled by agentic workflows that operate continuously and at scale.

This course equips you with the capability to build, test, and deploy automated evaluation systems using Claude Code. You will leave with working systems, not just conceptual understanding.

The focus is on automation that is trustworthy. Systems that score their own output, flag uncertainty, and know when to surface decisions to a human rather than proceeding autonomously.

The foundational content on AI evals for product development is available free at aianalystlab.ai. If you are new to AI evals, start there.

What you’ll learn

Learn practical, hands-on methods to evaluate, measure, and improve AI products using data science to make better product decisions.

  • Design multi-step workflows that execute your full evaluation pipeline without manual intervention

  • Structure agent prompts, tool access, and orchestration for reliable eval execution

  • Handle failure modes gracefully so workflows recover rather than break

  • Build scoring systems that assess evaluation results for statistical soundness, accuracy, and feasibility

  • Define rubrics your agents can apply consistently across hundreds of eval runs.

  • Calibrate automated scores against human judgment to establish trust boundaries.

  • Build systems that evaluate results, identify deficiencies, generate improvements, and re-evaluate automatically

  • Set convergence criteria so the system knows when to stop iterating

  • Monitor loop behavior to detect degradation or drift over successive iterations

  • Run dozens of evaluation experiments in parallel and cross-reference results for validity

  • Build experiment registries that track configurations, outcomes, and comparisons

  • Surface which approaches produce meaningful improvements versus noise

  • Connect automated eval systems to real product data, not just test fixtures

  • Define escalation paths for when automated evaluation surfaces ambiguous or high-stakes findings

  • Translate automated eval output into ship, iterate, pause, and rollback recommendations

Learn directly from Shane

Shane Butler

Shane Butler

Principal Data Scientist, AI Evaluations at Ontra

Stripe, Nextdoor, PwC
Stripe
Nextdoor
PwC India
Ontra
AppFolio

Who this course is for

  • Product Managers: Want automated eval systems for your team? Learn what is possible, reliable, and where human oversight stays essential.

  • Data Scientists & Analysts: Spending more time running evals than learning? Build systems that handle execution so you focus on decisions.

  • Engineers and ML Engineers: val bottlenecks slowing your release cycle? Automate repetitive work and ship with speed and confidence.

What's included

Shane Butler

Live sessions

Learn directly from Shane Butler in a real-time, interactive format.

Lifetime access

Go back to course content and recordings whenever you need to.

Community of peers

Stay accountable and share insights with like-minded professionals.

Certificate of completion

Share your new skills with your employer or on LinkedIn.

Maven Guarantee

This course is backed by the Maven Guarantee. Students are eligible for a full refund up until the halfway point of the course.

Course syllabus

Week 1

May 18—May 24
    Nothing scheduled for this week

Week 2

May 25—May 29
    Nothing scheduled for this week

Free resources

Schedule

Live sessions

2 hrs / week

Async content

3-5 hrs / week

Frequently asked questions

$2,000

USD

May 18—May 30
·

2 cohorts