Calibrate LLM-as-a-judge for Real-world Impact
Hosted by Eddie Landesberg
In this video
What you'll learn
A new mental model for LLM-as-a-judge
Learn why LLM-as-a-judge is a noisy biased signal rather than ground truth and how to interpret eval results accordingly
Calibration as an AI eval design choice
Learn a calibration-first approach that uses limited human judgment to correct systematic errors in automated evaluators
Rethinking “cheap evals” vs. decision risk
Rethink low-cost eval shortcuts and design eval pipelines that better reflect real-world impact, risk, and decisions
Why this topic matters
LLM-as-a-judge is widely used as a low-cost proxy for human or business ground truth, but uncalibrated judge scores can be statistically misleading, even reversing model rankings. This creates real production risk.
In this session, Eddie introduces a calibration method to better align LLM-as-a-judge with human judgment and real-world decisions.
You'll learn from
Eddie Landesberg
Founder of CIMO Labs
Experienced research scientist and software engineer focused on causal evaluation for AI systems.
Go deeper with a course
AI Evals and Analytics Playbook


Stella Liu and Amy Chen
Head of AI Applied Science. Cofounder at AI Evals & Analytics
Keep exploring





