Sam Marks

This would be like rewriting the simulation engine to have different laws of physics or to model the assistant as having different traits, but such that it is still fundamentally running a simulation.

3776.825 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

A more relaxed view admits that other lightweight changes may occur.

3787.217 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

For example, if an LLM is trained to never output sexual content, this might be analogous to modifying the operating system so that all simulated content passes through a content filter before appearing in outputs.

3791.502 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

the operating system is no longer literally running a simulation, but rather something slightly different, a simulation with a content filter.

3803.757 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

So on this view, the post-trained LLM may no longer be strictly a predictive model, but rather a predictive model with certain types of lightweight changes.

3812.298 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Importantly, the operating system view denies that these changes amount to de novo agency.

3821.392 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

To give a more mechanistic mental model, one could imagine that after pre-training, the LLM is like an operating system with persona submodules containing the logic for persona simulation.

3827.425 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Further, all adgentic behavior expressed in LLM outputs is fundamentally powered by these persona submodules.

3838.047 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

There are no independent adgentic mechanisms.

3845.158 View full episode →

LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Then during post-training, various aspects of the operating system are changed.