Stephen McAleese

Similarly, the post-Why I'm Optimistic About Our Alignment approach uses evidence about LLMs as a reason to believe that solving the AI alignment problem is achievable using current methods.

2497.418 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

Quote

2508.45 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

Large language models, LLMs, make this a lot easier.

2510.153 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

They come preloaded with a lot of humanity's knowledge, including detailed knowledge about human preferences and values.

2514.559 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

Out of the box they aren't agents who are trying to pursue their own goals in the world, and their objective functions are quite malleable.

2521.629 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

For example, they are surprisingly easy to train to behave more nicely.

2528.8 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

End quote.

2533.559 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

A more theoretical argument called alignment by default offers an explanation for how AIs could easily and robustly acquire human values.

2535.042 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

This argument suggests that as an AI identifies patterns in human text, it doesn't just learn facts about values, but adopts human values as a natural abstraction.

2543.541 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

A natural abstraction is a high-level concept, for example trees, people, or fairness, that different learning algorithms tend to converge upon because it efficiently summarizes a large amount of low-level data.

2553.522 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

If human value is a natural abstraction, then any sufficiently advanced intelligence might naturally gravitate toward understanding and representing our values in a robust and generalizing way as a byproduct of learning to understand the world.

2566.04 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

The evidence LLMs offer about the tractability of AI alignment seems compelling and concrete.

2579.459 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

However, the arguments of IABID are focused on the difficulty of aligning ASI, not contemporary LLMs, and the difficulty of aligning ASI could be vastly more difficult.

2585.531 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

Subheading.

2596.312 View full episode →

LessWrong (Curated & Popular)

"IABIED Book Review: Core Arguments and Counterarguments" by Stephen McAleese

Arguments against engineering analogies to AI alignment.

2597.735 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment