Manas Joglekar
๐ค SpeakerAppearances Over Time
Podcast Appearances
Confessions, by their nature, are retrospective and serve to report on misalignment rather than preventing it in the first place.
We are also excited to explore more high-compute interventions aimed at improving alignment in the model's main outputs.
Heading Acknowledgements Thank you to Bowen Baker for his thoughtful comments.
Additionally, thank you to Cameron Raymond, Marcus Williams, Rose Wang,
Adam Collet, Kai Chen, Erin Cavanaugh, Ishan Singhal, and Apollo Research for designing some of the evaluations for reward hacking, hallucinations, instruction hierarchy, instruction following, and scheming we used in our work.
This article was narrated by Type 3 Audio for LessWrong.
It was published on January 14, 2026.
Images are included in the podcast episode description.