Yannis Antonoglou
๐ค SpeakerAppearances Over Time
Podcast Appearances
Yeah, this is a really technical difference between what we mean by dense models and what we mean by MOEs and how MOEs might sound like much larger models, but at the same time, they're quite efficient when it comes to inference.
So at the heart of it, you can actually think of a mixture of experts model as like many, many, many models kind of like put right next to each other.
And then there's like what we call a router, which is really a system that selects one of the models to route each forward pass.
So when I have a dense model, I guess attention is the idea of looking at every token and try to predict based on every token in the past, try to predict the next token.
And there's like an everything-to-everything kind of like connection.
So like this is like the fully connected dense models.
In the mixture of experts, you have like many of these models, like one right next to the next, one right next to the other, and a system that selects during inference, during runtime, which one of the model to route its path to.
No, no, no, that's not correct actually.
Because GPT-4, GPT-4.5 could also be a mixture of expert models.
It's more of the architecture.
So the architecture is that as if you just took and you trained many small LLMs and then you just kind of put them together and then just trained a system that can route between them.
And then this way, kind of like the actual models, like many, many experts contributing to the final output.
And, you know, you actually like do that, like you don't do it, you don't have like actual full models.
You just like have them interspersed, right?
Like as in you have like layers of like experts and they kind of like contribute to the same final output.
Yeah, so it's kind of like, you can think of it as like a path, a path that like a model can take to kind of like use different experts along the way to just kind of like produce the final output.
No, it's actually more at the level of the token that I'll just do that.
So it's kind of like the mix of experts.
I guess the experts is the idea that each expert will just learn something on the data, but it's not at the level of that abstraction.
It's more at the lower level of abstraction.