Scaling Judge-Time Compute for Robust Auto LLM Evaluation

Hosted by Jason Liu and Leonard Tang

482 students

What you'll learn

LLM Judge Reliability Issues

Identify key failure modes in automated LLM evaluation including bias and inconsistent outputs

Judge-Time Compute Scaling

Apply reasoning model training techniques to improve evaluation reliability and accuracy

RL-Powered Evaluation Systems

Implement reinforcement learning methods to build more robust automated assessment tools

Why this topic matters

Reliable LLM evaluation is crucial for AI safety and quality in production systems. Poor judges waste resources and create unsafe deployments. Mastering robust evaluation techniques positions you as essential for companies deploying AI at scale - a rapidly growing field where quality assurance expertise is highly valued.

You'll learn from

Jason Liu

Consultant at the intersection of Information Retrieval and AI

Jason has built search and recommendation systems for the past 6 years. He has consulted and advised a dozens startups in the last year to improve their RAG systems. He is the creator of the Instructor Python library.

Leonard Tang

Co-Founder & CEO @ Haize Labs

Leonard Tang is the Co-Founder and CEO of Haize Labs, he works on solving the ultimate extant problem in AI: ensuring its robustness, quality, and alignment for any application. Prior to this, his research covered adversarial robustness, mathematical reasoning pitfalls, computational neuroscience, interpretability, and language models. Leonard dropped out of , before starting, a Stanford PhD in computer science to pursue Haize Labs.

worked with