Reward Hacking 101: Keeping Your Agent Honest
Hosted by Kyle Corbitt
What you'll learn
Spot Reward Hacking Early
Students will learn to recognize and diagnose reward hacking in RL systems and in everyday incentive structures.
Trace-Driven Debugging Skills
Learn to inspect rollout traces with tools like ART or Langfuse to surface and explain unexpected exploit strategies.
Fix & Realign Incentives Fast
Apply quick reward tweaks or auxiliary checks to neutralize hacks and steer models back toward desired objectives.
Why this topic matters
RL is powering everything from chatbots to trading agents, but mis-specified rewards can make them optimize the wrong thing—sometimes with costly or dangerous outcomes. Understanding reward hacking equips builders to spot misalignment early, patch loopholes quickly, and ship trustworthy AI systems.
You'll learn from
Kyle Corbitt
Founder at OpenPipe
Kyle Corbitt is the co-founder and CEO of OpenPipe, the RL post-training company. OpenPipe has trained thousands of customer models for both enterprises and tech-forward startups.
Go deeper with a course
Keep exploring