Sholto Douglas
๐ค SpeakerAppearances Over Time
Podcast Appearances
Yeah, theory of mind.
Like the whole thing that could cause deception of the whole thing or like is it just one instance of it?
Second of all, are your like labels correct?
You know, maybe like you thought this wasn't deceptive.
It's like still deceptive.
Especially if it's producing output you can't understand.
Third, is the thing that's going to be the bad outcome something that's even human understandable?
Like deception is a concept we can understand.
Maybe there's like a...
There's a separate question of does such representation exist, which it seems like there must, or actually, I'm not sure if that's the case.
And secondly, whether using this parser encoder setup, you could find it.
And in this case, if you don't have labels for it that are adequate to represent it, like you wouldn't find it, right?
This is sort of a tangent, but one interesting idea I heard was if that space is shared between models, you can imagine trying to find it in an open source model to then make... Like Gemma is... They said in the paper... Gemma, by the way, Google's newly released open source model.
They said in the paper it's trained using the same architecture or something like that.
To be honest, I didn't know because I haven't read the Gemma paper.
Similar methods to something whatever as Gemini.
So to the extent that's true,
I don't know how much like how much of the rec teaming you do on Gemma is like potentially helping you jailbreak into Gemini.
But by the way, this is another tangent.
To the extent that that's true, and I guess there's evidence that that's true, why doesn't curriculum learning work?