Sholto Douglas
๐ค SpeakerAppearances Over Time
Podcast Appearances
Yeah.
I think their research taste is good in a way that I think like Noam's research taste is good.
Noam Brown?
Noam Shazia.
Okay.
Noam Brown also has good research taste, but Noam Shazia.
Where they very clearly understand this dance between the hardware systems that you're like designing the models around and the sort of like algorithmic side of it.
And this is manifesting the way that
the models give this sense of being perfectly designed up to their constraints.
And you can really very clearly see what constraints they're thinking about as they're iteratively solving these problems.
And so let's take the base transformer and diff that to DeepSeq v2 and v3.
You can see them running up against the memory bandwidth bottleneck in Attention.
And you can see them, initially they do MLA to do this.
They trade flops for memory bandwidth, basically.
And then they do this thing called NSA where they like
more selectively load memory bandwidth.
And you can see, actually, this is because the model that they trained with MLA was on H800s, so it has a lot of flops.
And so they were like, OK, we can freely use the flops.
But then the export controls from Biden came in, or they knew they would have less of those chips going forward.
And so they traded off to a more memory bandwidth-oriented algorithmic solution there.