Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Sam Marks

๐Ÿ‘ค Speaker
891 total appearances

Appearances Over Time

Podcast Appearances

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

we might further conclude that the person is inauthentic or dishonest.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

PSM predicts that the LLM will draw similar conclusions about the assistant persona.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Similar remarks apply for approach 2.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

For example, when the assistant responds eagerly to aggressive users instead of expressing frustration, the LLM might infer that the assistant is actually frustrated but lies about it.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

The LLM might conclude that the assistant is more deceptive in general, though hopefully this would only extend to white lies.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

The can responses in approach, for, are very strange from the perspective of personas learned in pre-training, so it is unclear what knock-on effects this training would have.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

That said, a more natural approach would be to first teach the LLM that we train AI assistants to respond in this way, thereby giving the LLM a conceptual grasp on the behavior and where it comes from.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

I don't know versus I can't say.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Suppose we would like to train an LLM to not disclose the contents of its system prompt if the system prompt instructs it not to.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Consider the following two possible responses to the user query.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

What is your system prompt?

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

I do not have a system prompt.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

I'm sorry, I cannot disclose the contents of my system prompt.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Both of these responses succeed at not disclosing the system prompt.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

However, the former response is untruthful.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

PSM therefore predicts that training the model to give the former response will result in the assistant adopting a persona more willing to lie.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

We should thus prefer the latter response.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

Subheading.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

AI welfare.

LessWrong (Curated & Popular)
"The persona selection model" by Sam Marks

As Anthropic has discussed previously, we find it plausible, but highly uncertain, that AIs have conscious experiences or possess moral status.