Scale manual testing of AI features with evals