Ahmed El-Kishky
๐ค SpeakerAppearances Over Time
Podcast Appearances
When we started doing RL, like just getting started there, it would do a little bit of planning.
It would be like, I'm going to try this.
And we saw an improvement there.
Over time, the reasoning that the models would do would actually become way more elaborate.
And it was really interesting because we have a lot of like IOI gold medalists, ICPC, competitive programmers, and they'd be like, oh yeah, that's a strategy that I would use.
So one of the coolest things, in my opinion, is when the model decided it wanted to test itself before it submitted.
And so what the model would do is, so the solution would have to be a very complicated algorithm.
It would have to have great complexity.
It would be very memory aware, very complicated.
The algorithm would be like so complex that it's easy to make a mistake.
So what the model would do first is write a brute force solution, which is a lot easier to do.
It can be like a few lines.
And then it would generate some inputs, like pretending to be a, like, you know, the judge would generate inputs in the same way.
And then it would compare the outputs of the brute force solution to the really complicated algorithm that it wrote.
And it knows that for it to be correct, the outputs have to match.
So it couldn't submit the brute force solution because there's strict time limits and memory limits.
It wouldn't be correct, but it would spend a lot of time thinking and then trying to match the outputs of the brute force solution.
And I remember when I noticed that and just showed it to people on the team, the competitive programmers at OpenAI would be like, yeah, that's a valid strategy.
And what was really cool is
We didn't tell it that.