Noam Shazeer
๐ค SpeakerAppearances Over Time
Podcast Appearances
You know, like...
This one's super good at dates.
looking at the example and... I mean, one thing I would say is, like, there is a bunch of work on interpretability of models and what are they doing inside.
And sort of expert-level interpretability is a sub-problem of that broader area.
I really like some of the work that my former intern, Chris Ola, and others did at Anthropic where they could kind of, they trained a very sparse autoencoder and were able to deduce, you know, what characteristics does some particular neuron in a large language.
So they found like a Golden Gate Bridge neuron that's activated when you're talking about the Golden Gate Bridge.
And I think, you know, you could do that at the expert level.
You could do that at a variety of different levels and get pretty interpretable results.
results and it's a little unclear if you necessarily need that.
If the model is just really good at stuff, you know, we don't necessarily care what every neuron in the Gemini model is doing as long as the collective output and characteristics of the overall system are good.
You know, that's one of the beauties of deep learning is you don't need to understand or hand engineer every last feature.
Right, but you still have a smaller batch at each expert that then goes through.
And in order to get kind of reasonable balance, like one of the things that the current models typically do is they have all the experts be roughly the same compute cost.
And then you run roughly the same size batches through them in order to sort of propagate the very large batch you're doing at inference time and have good efficiency.
But I think...
You know, you often in the future might want experts that vary in computational costs by factors of 100 or 1,000.
Yeah.
Or maybe paths that go for many layers on one case and โ
you know, a single layer or even a skip connection in the other case.
And there, I think you're going to want very large batches still, but you're going to want to kind of push things through the model a little bit asynchronously at inference time, which is a little easier than training time.