Sam Marks
๐ค SpeakerAppearances Over Time
Podcast Appearances
How entangled are personas?
Do they share knowledge?
Propensities?
Is it possible to control their degree of entanglement?
Understanding the mechanistic basis of personas.
Can we understand the space of personas an LLM can model?
Can we understand the persona that an LLM is actively enacting?
That's the end of the list.
More broadly, we are excited about the project of developing and validating theories of AI systems, mental models that allow us to predict how AI systems will behave in novel situations and how their behavior will change as they are trained differently.
PSM is one such theory.
We hope that by naming and articulating it, we can encourage further work on refining it, stress testing it, and, where it falls short, developing better alternatives.
Heading Acknowledgements Many people contributed valuable ideas and discussion to this post.
Fabian Roger suggested many items of evidence, especially that in the section on complicating evidence.
Joshua Batson sketched out the example of non-persona agency arising from a lightweight router mechanism.
Jared Kaplan suggested writing this post and provided useful discussion and feedback.
Alex Cloud, Evan Hubinger, and many other Anthropic employees who commented on an initial draft and provided helpful discussion.
Rowan Wang, Tim Bellinax, and Carl de Torres designed figures.
The images in our discussion of PSM exhaustiveness were generated by Nano Banana Pro.
Heading.
Appendix A. Breaking Character.