Noam Shazeer
๐ค SpeakerAppearances Over Time
Podcast Appearances
Okay.
Because across people.
Or grow the model, add another little bit of it here.
Yeah, I've been sort of sketching out this vision for a while in the sort of pathways, under the pathways name.
And we've been building the infrastructure for it.
So a lot of what pathways the system can support is this kind of twisty, weird model with like asynchronous updates to different pieces.
Yeah, we should go back and figure out the ML.
And we're using pathways to train our Gemini models, but we're not making use of some of its capabilities yet.
But maybe we should.
Yeah, I mean, I do think distillation is a really useful tool because it enables you to kind of transform a model in its current model architecture form into a different form.
You know, often you use it to take a really capable but kind of large and unwieldy model and distill it into a smaller one that maybe you want to serve with really good, fast latency inference characteristics.
But I think you can also view this as something that's happening at the module level.
Like maybe there'd be a continual process where you have each module and it has a few different representations of itself.
It has a really big one.
It's got a much smaller one that is continually distilling into the small version.
And then the small version...
Once that's finished, then you sort of delete the big one and you add a bunch more parameter capacity and now start to learn all the things that the distilled small one doesn't know by training it on more data.
And then you kind of repeat that process.
And if you have that kind of running a thousand different places in your modular model in the background, that seems like it would work reasonably well.
Yeah, you can have multiple versions and like, you know, this is an easy math problem, so I'm going to route it to the really tiny math distilled thing and, oh, this one's really hard.