Evals Are Almost Never All You Need

Hosted by Patrick Hall

142 students

What you'll learn

Common problems with benchmarks and evals

Understand how task contamination, conflicts of interest, and lack of scientific measurement can compromise evals.

What to do instead of benchmarks and evals

Learn about embedding-based approaches, red-teaming, and field testing for evaluating generative AI systems.

When to use benchmarks and evals

Developers need benchmarks. They're an important tool for development. They're the wrong tool for real-world assessment.

Why this topic matters

Generative AI is important. It has tangible real world impacts. Let's measure those real world impacts directly instead of guessing at them with ill-suited benchmarks. There are many technical and socio-technical measurement approaches for generative AI that are better for real world measurement than benchmarks. This lightening-session will provide an introduction to those approaches.

You'll learn from

Patrick Hall

Principal Scientist, HallResearch.ai

Patrick Hall is principal scientist at Hall Research. He is also teaching faculty at the George Washington University (GWU) School of Business, offering data ethics, business analytics, and machine learning classes to graduate and undergraduate students. Patrick conducts research in support of NIST's AI Risk Management Framework, works with leading fair lending and AI risk management advisory firms, and serves on the board of directors for the AI Incident Database.


Prior to co-founding Hall Research, Patrick was a founding partner at BNH.AI, where he pioneered the emergent discipline of auditing and red-teaming generative AI systems; he also led H2O.ai's efforts in the development of responsible AI products, resulting in one of the world's first commercial applications for explainability and bias management in machine learning. 


Patrick has been invited to speak on AI and machine learning topics at the National Academies, the Association for Computing Machinery SIG-KDD Conference ("KDD"), and the American Statistical Association Joint Statistical Meetings. His expertise has been sought in the New York Times and NPR, he has been published in outlets like Information, Frontiers in AI, McKinsey.com, O'Reilly Media, and Thomson Reuters Regulatory Intelligence, and his technical work has been profiled in Fortune, WIRED, InfoWorld, TechCrunch, and others. Patrick is the lead author of the book Machine Learning for High-Risk Applications.