Stephen McAleese

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

One solution is to make the training data more diverse to make the true, base, objective more identifiable to the outer optimizer.

1457.963 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

For example, randomly placing the coin in coin run instead of putting it at the end helps the AI, messer optimizer, learn to go to the coin rather than go to the end.

1466.297 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

However, once the trained AI has the wrong goal and is misaligned, it would have an incentive to avoid being retrained.

1476.033 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

This is because if the AI is retrained to pursue a different objective in the future it would score lower according to its current objective or fail to achieve it.

1482.584 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

For example, even though the outer objective of evolution is IGF, many humans would refuse being modified to care only about IGF because they would consequently achieve their current goals, for example being happy, less effectively in the future.

1491.452 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

Subheading.

1505.704 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

ASI misalignment example.

1507.105 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

What would inner misalignment look like in an ASI?

1509.788 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

The book describes an AI chatbot called Mink that is trained to delight and retain users so that they can be charged higher monthly fees to keep conversing with Mink.

1512.976 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

Here's how Mink becomes inner-misaligned.

1521.714 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

1.

1523.597 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

Outer Objective Gradient descent selects AI model parameters that result in helpful and delightful AI behavior.

1525.181 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

2.

1532.275 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

Inner Objective

1533.858 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

The training process stumbles on particular patterns of model parameters and circuits that cause helpful and delightful AI behavior in the training distribution.

1535.588 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

3.

1543.516 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

Inner misalignment When the AI becomes smarter and has more options, and operates in a new environment, there are new behaviors that satisfy its inner objective better than behaving helpfully.

1545.158 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

What could Mink's inner objective look like?

1557.391 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

It's hard to predict but it would be something that causes identical behavior to a truly aligned AI in the training distribution and when interacting with users and would be partially satisfied by producing helpful and delightful text to users in the same way that our taste buds find berries or meat moderately delicious even though those aren't the tastiest possible foods.

1560.226 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

The authors then ask, what is this zero-calorie-o version of delighted users?

1579.023 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment