Gideon Lewis-Kraus
π€ SpeakerAppearances Over Time
Podcast Appearances
He could say, right at this point where you are having an association with the Eiffel Tower, we're going to
put in an association with cheese and see what happens.
And so then the model would respond by saying something about cheese.
And he would say something similar to what Batson said, which was like, why did you add that thing about cheese that I didn't ask about?
And the model would basically just look back at the entire conversation that they had been having and then try to kind of retcon an explanation.
But what Jack has found more recently is that when he incepts these ideas into the model, instead of the model purely looking at its own external behavior to try to figure out why it had done something,
that actually these models could very dimly perceive that something strange had gone on internally, that someone was monkeying with, you know, the neurons inside the model to make it do something different.
So, you know, he incepted the model with something, you know, something associated with imminent shutdown, that the model was about to be
shut down and asked the model, how are you feeling right now?
And the model would say, I feel sort of strange as if I'm standing at the edge of a great unknown.
And it certainly was not at the point that it could say, oh, I have recognized that you, the user, have incepted me with this idea at this point and that this was a foreign idea introduced into my thought processes.
But it could tell that something was off about it internally.
And this is what Jack described to me.
He said, like, I am a skeptic, but this just starts to feel pretty spooky that the model does seem to have something like an emerging introspective ability to peer inside and offer reports about what's going on in its equivalent of a brain.
Well, because they are also training it for the future, and it is picking up on all these contexts.
And there's the fact that this whole process is kind of constantly eating its own tail, that it's always being trained on plenty of stuff on the internet that is about the way that these things work.
So it's always incorporating new information about how it's supposed to be behaving in the world.
Well, and part of the problem with lying to it is thatβ
you know, ultimately what they want is to establish a trusting relationship that these things are going to, you know, behave the way that we would hope that they would behave in ways that are aligned with, you know, how we expect responsible, wise people to behave.
And that if you are lying to it all the time, it is developing a sense for the fact that it can't necessarily trust you.