Principal Data Scientist, AI Evaluations

With the recent leaps in model capabilities, it is becoming clear that much of the AI evaluation work previously requiring humans to build and run can be automated through agentic systems. This course teaches you how to build those systems, and where human judgment and input is still required.
The foundational work of designing eval frameworks, defining quality signals, and structuring failure analysis remains critical. But the execution of that work, running evaluations, scoring results, iterating on improvements, and surfacing what matters, can now be handled by agentic workflows that operate continuously and at scale.
This course equips you with the capability to build, test, and deploy automated evaluation systems using Claude Code. You will leave with working systems, not just conceptual understanding.
The focus is on automation that is trustworthy. Systems that score their own output, flag uncertainty, and know when to surface decisions to a human rather than proceeding autonomously.
The foundational content on AI evals for product development is available free at aianalystlab.ai. If you are new to AI evals, start there.
Learn practical, hands-on methods to evaluate, measure, and improve AI products using data science to make better product decisions.
Design multi-step workflows that execute your full evaluation pipeline without manual intervention
Structure agent prompts, tool access, and orchestration for reliable eval execution
Handle failure modes gracefully so workflows recover rather than break
Build scoring systems that assess evaluation results for statistical soundness, accuracy, and feasibility
Define rubrics your agents can apply consistently across hundreds of eval runs.
Calibrate automated scores against human judgment to establish trust boundaries.
Build systems that evaluate results, identify deficiencies, generate improvements, and re-evaluate automatically
Set convergence criteria so the system knows when to stop iterating
Monitor loop behavior to detect degradation or drift over successive iterations
Run dozens of evaluation experiments in parallel and cross-reference results for validity
Build experiment registries that track configurations, outcomes, and comparisons
Surface which approaches produce meaningful improvements versus noise
Connect automated eval systems to real product data, not just test fixtures
Define escalation paths for when automated evaluation surfaces ambiguous or high-stakes findings
Translate automated eval output into ship, iterate, pause, and rollback recommendations

Principal Data Scientist, AI Evaluations at Ontra

Product Managers: Want automated eval systems for your team? Learn what is possible, reliable, and where human oversight stays essential.
Data Scientists & Analysts: Spending more time running evals than learning? Build systems that handle execution so you focus on decisions.
Engineers and ML Engineers: val bottlenecks slowing your release cycle? Automate repetitive work and ship with speed and confidence.

Live sessions
Learn directly from Shane Butler in a real-time, interactive format.
Lifetime access
Go back to course content and recordings whenever you need to.
Community of peers
Stay accountable and share insights with like-minded professionals.
Certificate of completion
Share your new skills with your employer or on LinkedIn.
Maven Guarantee
This course is backed by the Maven Guarantee. Students are eligible for a full refund up until the halfway point of the course.
Live sessions
2 hrs / week
Async content
3-5 hrs / week
$2,000
USD
2 cohorts