John Schulman
๐ค SpeakerAppearances Over Time
Podcast Appearances
Like, it's not...
Yeah, so I guess you have these old points about like instrumental convergence, like the model's gonna wanna take over the world so it can produce this awesome piece of code at the end.
Like if you ask it to write you a Flask app, it'll be like, oh yeah, first I need to take over the world and then I need to, I don't know.
But at a certain point, it's a little bit, it's a little hard to imagine why for some like fairly well-specified tasks like that, you would wanna first take over the world.
Um, but of course, uh, yeah, if, if you had a task like make money, uh, then maybe, uh, that would lead to some nefarious behavior as a, um, instrumental goal.
Yeah.
Yeah, I would say there are probably some analogies with a drive or a goal in humans.
So in that you're trying to steer towards a certain set of states rather than some other states.
um and so i would i would think that our concept of a drive or a goal has um other um elements like uh like the feeling of satisfaction you get for achieving it and uh and those things might um be more like have more to do with the learning algorithm than uh what the model does at runtime uh when you just have a fixed model so i would say the
I would say there are probably some analogies, though I don't know exactly how close it is.
But I would say to some extent, the models do have drives and goals in some meaningful way.
And in the case of RLHF, where you're trying to maximize human approval as measured by a reward model, the model is just trying to produce something that people are going to like and they're going to judge as correct.
Well, I would say you could define reasoning as tasks that require some kind of computation at test time or maybe some kind of deduction.
So by definition, reasoning would be tasks that require some test time computation and step-by-step computation.
On the other hand, I would also expect to gain a lot out of like doing some kind of training time computation or practice at training time.
So I would think that you get the best results by combining these two things.
Yeah, so do you mean models having some kind of medium term memory, so too much to fit in context, but like much smaller scale than pre-training?
I'm not sure if memory, it might be memory.
I'm curious.
I see.