Can an AI truly conduct AI research—on its own, from scratch? In this compelling episode of The Deep Dive, we explore one of the boldest experiments yet in AI evaluation: PaperBench. This groundbreaking benchmark sets out to test whether advanced AI agents can replicate cutting-edge research published at ICML 2024—the “Olympics” of machine learning.We unpack the ambitious design behind PaperBench, where AI agents are tasked not just with reading elite academic papers, but with rebuilding the experiments from scratch—writing code, running it, and verifying results without ever glimpsing the original authors' code. With over 3,000 individual tasks across 20 elite papers, the challenge is both massive and meticulous.You’ll hear how these agents are graded via a powerful automated LLM-based judge dubbed Simple Judge, validated through an equally clever benchmark called JudgeEval. We dig into how the top AI models—Claude 3.5, GPT-4, Gemini, and others—stacked up, and why the best could still only replicate around 21% of the work. Why are these models giving up early? What happens when they're pushed to persist longer?We also dive into PaperBench Codev, a focused variant that tests code-writing ability alone, where some agents fared significantly better. And how do human researchers compare when given the same task—with some AI assistance but no shortcuts?From execution bottlenecks to prompting strategies, from rubric creation to potential "specification gaming," this episode offers a revealing look into what AI can and can’t yet do in the world of scientific discovery. Whether you’re a researcher, engineer, or just fascinated by AI’s growing role in shaping knowledge itself, this is an episode you won’t want to miss.Tune in and ask yourself: when it comes to frontier science, is AI a collaborator, a tool—or a competitor?Read more: https://cdn.openai.com/papers/22265bac-3191-44e5-b057-7aaacd8e90cd/paperbench.pdf
No persons identified in this episode.
This episode hasn't been transcribed yet
Help us prioritize this episode for transcription by upvoting it.
Popular episodes get transcribed faster
Other recent transcribed episodes
Transcribed and ready to explore now
Eric Larsen on the emergence and potential of AI in healthcare
10 Dec 2025
McKinsey on Healthcare
Reducing Burnout and Boosting Revenue in ASCs
10 Dec 2025
Becker’s Healthcare -- Spine and Orthopedic Podcast
Dr. Erich G. Anderer, Chief of the Division of Neurosurgery and Surgical Director of Perioperative Services at NYU Langone Hospital–Brooklyn
09 Dec 2025
Becker’s Healthcare -- Spine and Orthopedic Podcast
Dr. Nolan Wessell, Assistant Professor and Well-being Co-Director, Department of Orthopedic Surgery, Division of Spine Surgery, University of Colorado School of Medicine
08 Dec 2025
Becker’s Healthcare -- Spine and Orthopedic Podcast
NPR News: 12-08-2025 2AM EST
08 Dec 2025
NPR News Now
NPR News: 12-08-2025 1AM EST
08 Dec 2025
NPR News Now