Yoshua Bengio
๐ค SpeakerVoice Profile Active
This person's voice can be automatically recognized across podcast episodes using AI voice matching.
Appearances Over Time
Podcast Appearances
come up with the best explanation it can find, including causal explanations.
So what you get at the end of the day are these probabilities, but you also get to represent hypotheses about the world that are not communication acts, that are
a factual hypothesis that the system isn't necessarily sure about, but it's going to be producing a probability for these.
Then we can query these same factual statements.
Whereas in normal LLMs, the only query you can make is about whether a person
would respond in a particular way.
And maybe you can use a pre-prompt to ask for a different kind of persona, but at the end of the day, you get what a person would say, which of course can be deceptive for all kinds of reasons.
So right now we have system that have implicit goals.
So what do I mean by this?
I mean, they will of course be trying to please us, for example, or to respond like a person would.
But both of these parts of the training, so the autoregressive pre-training where they're trained to imitate people and the reinforcement learning part where they're trained to please people or respond in ways that get positive feedback in things like RLHF,
Both of these parts of the training process induce implicit goals.
So what do I mean?
Well, for example, in the pre-training, that means the AI is going to inherit our self-preservation drives.
And more recently, we've seen they also inherit our drive to protect others like us, which means AIs...
have been shown to behave against our instructions to protect other AIs that would be shut down, right?
So it's called peer preservation now.
So that's an example.
And then the goal seeking part of the training with reinforcement learning induces an issue with instrumental goals and potentially also reward hacking.
which basically mean that AI will have a drive to do things that we didn't ask and maybe we would disagree with.