Setting up your first AI eval with a LLM-as-judge

Free Lesson

Setting up your first AI eval with a LLM-as-judge

Part of The AI Evaluation Handbook

•

Hosted by Madalina Turlea and Catalina Turlea

71 students

In this video

What you'll learn

Most common mistakes to avoid when building an LLM-as-judge

Understand why most teams' LLM judges don't work and the specific mistakes that make them unreliable.

How to write your judge instructions

Learn how to identify what to check through LLM-as-judge, define specific rules, and build the judge prompt

How to evaluate your LLM-as-judge

Know how to evaluate the evaluator by comparing judge scores to human labels and decide if you can trust the results.

Why this topic matters

Most teams building an LLM-as-a-judge make the same mistakes. They skip error analysis and ask the judge to look for hypothetical errors instead of real ones. They pack multiple criteria into one judge, creating high cognitive load that produces unreliable scores. They never validate the judge against human labels, so they don't know if it's accurate or noise.

You'll learn from

Madalina Turlea

Co-founder @Lovelaice, 10+ years in Product

I'm co-founder of Lovelaice and a product leader with 10+ years building products across fintech, payments, and compliance. I hold a CFA charter and have led AI product development in highly regulated environments — where AI failures aren't just embarrassing, they're liabilities.

I've watched smart teams make the same mistakes: choosing models based on benchmarks that don't reflect their use case, writing prompts that work in testing but fail in production, and leaving domain experts out of the loop. These aren't edge cases — they're why 80% of AI projects underperform.

Through these failures (my own included), I developed a systematic approach to AI experimentation that puts domain expertise at the center. I teach what I've learned building Lovelaice: how to test, evaluate, and iterate on AI — before it reaches your users.

Catalina Turlea

Founder @Lovelaice

I bring over 14 years of software development expertise and a decade of startup experience to help teams build AI products that actually work. After founding my first company six years ago, I run a consultancy specializing in helping startups build MVPs, solve complex technical challenges, and integrate AI effectively.

I've seen firsthand how AI projects fail due to lack of systematic experimentation—teams treat AI like traditional software and struggle with inconsistent results. That's why I co-created Lovelace, a platform designed for non-technical professionals to experiment with AI agents systematically.

See all products from Madalina

Share this lesson

71 students

Share this lesson

71 students

Go deeper with a course

Build and evaluate your first AI feature