Calibrate LLM-as-a-judge for Real-world Impact

Hosted by Eddie Landesberg

207 students

In this video

What you'll learn

A new mental model for LLM-as-a-judge

Learn why LLM-as-a-judge is a noisy biased signal rather than ground truth and how to interpret eval results accordingly

Calibration as an AI eval design choice

Learn a calibration-first approach that uses limited human judgment to correct systematic errors in automated evaluators

Rethinking “cheap evals” vs. decision risk

Rethink low-cost eval shortcuts and design eval pipelines that better reflect real-world impact, risk, and decisions

Why this topic matters

LLM-as-a-judge is widely used as a low-cost proxy for human or business ground truth, but uncalibrated judge scores can be statistically misleading, even reversing model rankings. This creates real production risk. In this session, Eddie introduces a calibration method to better align LLM-as-a-judge with human judgment and real-world decisions.

You'll learn from