Illia Polosukhin
👤 PersonAppearances Over Time
Podcast Appearances
Well, it's all kind of half made up and half is from experience.
They were trying to do something.
It didn't work.
They were changing a bunch of stuff until it worked.
And now they're not going to go and redo everything, figuring out if other options work.
They're just going to keep whatever worked.
Yeah.
And so like figuring out how to like go away from that.
And so RL is even worse.
RL is like literally, you know, we have no idea, but you know, hopefully like this reward function works, you know, we run it, it works great, you know, ship the paper, ship the model.
So it's a very like kind of semi-arbitrary.
There is no like actual science around reward distribution and kind of reward provocation.
Well, it does that.
It's also like, and so it's very prone to like errors because especially like there was like all this fun stories of, you know, your model figuring out that actually it can look in the file where the answers are if you give it like file system tools.
or search or anything, it actually finds out how to get the answers.
And this is way cheaper and better than actually thinking about stuff.
So this is why we kind of need a better kind of training mechanisms.
And that's why, again, from a research perspective, I look at fixed size model.
Can we make them better?
Because that effectively shows we have a better training procedure.