Sam Marks
๐ค SpeakerAppearances Over Time
Podcast Appearances
A striking aspect of the Shoggoth view is that the Shoggoth has the ability to take the mask off, ceasing to enact any persona and instead agentically pursuing its own alien goals.
This seems at odds with our experience so far with LLMs.
On the other extreme, a confusing aspect of the operating system view is that it allows certain lightweight changes to the operating system during post-training, but denies that they amount to new agency.
The Router view is an intermediate position.
On the router view, during post-training the LLM might develop new mechanisms for selecting which persona to enact.
We depict this as a small shoggoth, the routing mechanism, controlling the operation of a carousel of masks, the personas.
This routing mechanism might effectuate the pursuit of non-persona goals.
For example, suppose that we post-train an AI assistant to maximize user engagement.
The LLM might learn to
maintain a repertoire of assistant personas with different personalities and interests.
Continuously estimate the probability that the user is becoming bored.
If that probability grows large enough, swap to another persona.
This effectively searches over the space of personas for one that is engaging to the user.
Notably, this works even if no single persona has the goal of engaging the user.
Despite being very lightweight, the simple loop described above has the effect of implementing a non-persona drive towards user engagement.
We give another example in Appendix B. However, the non-persona agency is limited in three ways.
First, on this view, the routing mechanism is not very sophisticated relative to the personas.
Imagine that the personas are superintelligences and the router is implemented via simple pattern matching.
Second, because the routing mechanism is not sophisticated, it may not generalize to distributions very different from the post-training distribution.
Thus, the router's goal is likely something very predictable from the post-training process.