Steven Byrnes
๐ค SpeakerAppearances Over Time
Podcast Appearances
These are algorithms designed to find actions that maximize an objective by any means available.
That is, you get ruthless sociopathic behavior by default.
There's an image here.
Description.
Go find someone who was in AI in the 2010s or earlier, before LLMs took over, and they may well have spent a lot of time building or using RL agents and or model-based planning algorithms.
If so, they'll tell you, based on their lived experience, that these kinds of algorithms are ruthless by default, when they work at all, unless the programmers go out of their way to make them non-ruthless.
See for example this 2020 DeepMind blog post on specification gaming.
And how would the programmers go out of their way to make them non-ruthless?
I claim that the answer is not obvious, indeed not even known.
See my look on post, and my silver and sutton post, and more generally my post-behaviorist, RL reward functions lead to scheming for why obvious approaches to non-ruthlessness won't work.
Rather, algorithms in this class are naturally, em, let's call them, ruthless ifeers, in the sense that they transmute even innocuous-sounding objectives like it's good if the human is happy into scarier-sounding ones like ruthlessly maximize the probability that the human is happy, which in turn suggests strategies such as forcibly drugging the human.
Likewise, the innocuous-sounding it's bad to lie gets ruthless ified into it's bad to get caught lying, and so on.
Of course, evolution did go out of its way to make humans non-ruthless by endowing us with social instincts.
Maybe future AI programmers will likewise go out of their way to make ASIs non-ruthless?
I hope so, but we need to figure out how.
To be clear, ruthless consequentialism isn't always bad.
I'm happy for ruthless consequentialist AIs to be playing chess, designing chips, etc.,
In principle, I'd even be happy for a ruthless consequentialist AI to be emperor of the universe, creating an awesome future for all.
But making that actually happen would be super dangerous for lots of reasons, for example you might need to operationalize creating an awesome future for all in a loophole-free way.
See also the usual agent debugging loop and its future catastrophic breakdown.