Sholto Douglas
๐ค SpeakerAppearances Over Time
Podcast Appearances
And doing this in a really simple and elegant way, and then backing it up with great engineering.
I also thought it was interesting that they incorporated the multi-token prediction thing from Meta.
So Meta had a nice paper on this multi-token prediction thing.
I don't know if it's good or bad, but Meta didn't include it in Lama, but DeepSea did include it in their paper, which I think is interesting.
Was that because they were faster at iterating and including an algorithm, or did Meta decide that actually it wasn't a good algorithmic change of scale?
And like Noam Shazia will talk about this, like about how he like 5% of his ideas work.
So even he, the vaunted god of model architecture design, has a relatively low hit rate, but he just tries so many things.
MARK MANDELMANN- Right.
I actually think your rates of progress almost don't change that much, so long as he's able to completely implement his ideas.
FRANCESC CAMPOY- If you have Noam Shazier at 100x speed, that's still kind of wild.
MARK MANDELMANN- Yeah.
FRANCESC CAMPOY- There's all these fallbacks of wild worlds, where even if you don't get 100% Noam Shazier level intuition in model design, it's still OK if you just accelerate him by 100x.
MARK MANDELMANN- Right.
There is conceptual understanding there.
Deep conceptual understanding.
Also, by the way, ML research is, like, one of the easier things to RL on in some respects once you get to a certain level of capability.
It's a very well-defined objective function.
Did the loss go down?
Make number go down.
Or make number go up, depending on which number it is.