Rob Wiblin
π€ SpeakerVoice Profile Active
This person's voice can be automatically recognized across podcast episodes using AI voice matching.
Appearances Over Time
Podcast Appearances
That's right.
So I think the gloss, when I heard about this idea nine, 12 months ago, I think the gloss that I got was the core thing is that the scientist AI is not an agent, that it is indifferent about states of the world.
Like a weather forecasting model doesn't care what the weather is.
It just tries to predict what the weather is going to be.
And this kind of model, it would spit out probabilities of things being true or false, but it wouldn't care what state of the world it is in.
And it wouldn't be able to take actions by design.
Is that kind of a core part in your mind?
I think I understand, as I understand it, you think actually this is maybe more consistent with agency than people have appreciated.
I think at one point that was a criticism of the plan was that it would be too easy to convert this kind of oracle into an agent because you would just be able to ask the oracle, well, would we accomplish this goal if we took this action?
And it would give you the probability and you could just try to increase that probability and choose that action.
Is the idea that you would do something like that basically, but you would be able to preserve some of the safety characteristics of the original model?
So the concern is that the predictor might be able to figure out basically gaps in the guardrail or weaknesses in the guardrail and exploit those if it- Not the predictor, a policy that is trained separately and in a way that is adversarial could do that in principle.
And you were saying the guardrail, if it can assess its own uncertainty, then it will usually be able to identify cases where it might be at risk of getting tricked in this case.
But perhaps it's an adversarial example?
Is that basically...?
Yes, exactly.
And you feel like you've proven this mathematically?
Yeah, so it might be more trusted, but if it's much less intelligent in general, just because less compute went into it, less data went into it, then I guess it could just still be outfoxed by the original model, right?
That's right.
You said that over the last six months, I guess you were keen on this idea a year ago, but you've become a lot more optimistic about it over the last six months.