Modern Information Retrieval Evaluation In The RAG Era

Hosted by Nandan Thakur, Hamel Husain, and Shreya Shankar

5,205 students

In this video

What you'll learn

Traditional Retrieval Evaluations Are Stale

Why is a model topping every leader-board not the one you should use? Find out the pitfalls of stale benchmarks.

Rigorous Academic Evaluations Still Power Real-World Evals

Understand why, despite benchmarks being flawed, rigorous academic evaluations are more relevant than ever.

Evaluation Research Is Evolving To Meet New Needs

Hear about the new methods researchers are creating to construct evaluations that match real-world needs.

Why this topic matters

Traditional IR benchmarks fall short for real-world RAG applications due to stale data, incomplete labels, and unrealistic queries. This talk introduces FreshStack, a new benchmark built from recent StackOverflow and GitHub content, designed to reflect real programming queries.

You'll learn from

Nandan Thakur

RAG researcher @ UWaterloo. Creator of BEIR and MIRACL benchmarks

Nandan Thakur is fourth-year PhD student at University of Waterloo working on building efficient embedding models and realistic evaluation benchmark, advised by Professor Jimmy Lin. Nandan’s research has been hugely influential in pioneering new benchmarks for information retrieval, having notably introduced the BEIR and MIRACL benchmarks. His current work explores novel ways to evaluate retrieval in the age of LLMs. He has previously interned at Google, Vectara and Databricks, and collaborated with industry partners including Snowflake, Micrsoft and Huawei.

Hamel Husain

ML Engineer with 20 years of experience

Hamel is a machine learning engineer with over 20 years of experience. He has worked with innovative companies such as Airbnb and GitHub, which included early LLM research used by OpenAI, for code understanding. He has also led and contributed to numerous popular open-source machine-learning tools. Hamel is currently an independent consultant helping companies build AI products.

Shreya Shankar

ML Systems Researcher Making AI Evaluation Work in Practice

Shreya Shankar is an experienced ML Engineer who is currently a PhD candidate in computer science at UC Berkeley, where she builds systems that help people use AI to work with data effectively. Her research focuses on developing practical tools and frameworks for building reliable ML systems, with recent groundbreaking work on LLM evaluation and data quality. She has published influential papers on evaluating and aligning LLM systems, including "Who Validates the Validators?" which explores how to systematically align LLM evaluations with human preferences.

Prior to her PhD, Shreya worked as an ML engineer in industry and completed her BS and MS in computer science at Stanford. Her work appears in top data management and HCI venues including SIGMOD, VLDB, and UIST. She is currently supported by the NDSEG Fellowship and has collaborated extensively with major tech companies and startups to deploy her research in production environments. Her recent projects like DocETL and SPADE demonstrate her ability to bridge theoretical frameworks with practical implementations that help developers build more reliable AI systems.

Share this lesson

5,205 students

Share this lesson

5,205 students

Go deeper with a course

Featured in Lenny’s List

AI Evals For Engineers & PMs