Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Noam Shazeer

๐Ÿ‘ค Speaker
See mentions of this person in podcasts
692 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

You know, like...

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

This one's super good at dates.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

looking at the example and... I mean, one thing I would say is, like, there is a bunch of work on interpretability of models and what are they doing inside.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

And sort of expert-level interpretability is a sub-problem of that broader area.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

I really like some of the work that my former intern, Chris Ola, and others did at Anthropic where they could kind of, they trained a very sparse autoencoder and were able to deduce, you know, what characteristics does some particular neuron in a large language.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

So they found like a Golden Gate Bridge neuron that's activated when you're talking about the Golden Gate Bridge.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

And I think, you know, you could do that at the expert level.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

You could do that at a variety of different levels and get pretty interpretable results.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

results and it's a little unclear if you necessarily need that.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

If the model is just really good at stuff, you know, we don't necessarily care what every neuron in the Gemini model is doing as long as the collective output and characteristics of the overall system are good.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

You know, that's one of the beauties of deep learning is you don't need to understand or hand engineer every last feature.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

Right, but you still have a smaller batch at each expert that then goes through.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

And in order to get kind of reasonable balance, like one of the things that the current models typically do is they have all the experts be roughly the same compute cost.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

And then you run roughly the same size batches through them in order to sort of propagate the very large batch you're doing at inference time and have good efficiency.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

But I think...

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

You know, you often in the future might want experts that vary in computational costs by factors of 100 or 1,000.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

Yeah.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

Or maybe paths that go for many layers on one case and โ€“

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

you know, a single layer or even a skip connection in the other case.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

And there, I think you're going to want very large batches still, but you're going to want to kind of push things through the model a little bit asynchronously at inference time, which is a little easier than training time.