Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Noam Shazeer

๐Ÿ‘ค Speaker
See mentions of this person in podcasts
692 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

Okay.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

Because across people.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

Or grow the model, add another little bit of it here.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

Yeah, I've been sort of sketching out this vision for a while in the sort of pathways, under the pathways name.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

And we've been building the infrastructure for it.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

So a lot of what pathways the system can support is this kind of twisty, weird model with like asynchronous updates to different pieces.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

Yeah, we should go back and figure out the ML.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

And we're using pathways to train our Gemini models, but we're not making use of some of its capabilities yet.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

But maybe we should.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

Yeah, I mean, I do think distillation is a really useful tool because it enables you to kind of transform a model in its current model architecture form into a different form.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

You know, often you use it to take a really capable but kind of large and unwieldy model and distill it into a smaller one that maybe you want to serve with really good, fast latency inference characteristics.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

But I think you can also view this as something that's happening at the module level.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

Like maybe there'd be a continual process where you have each module and it has a few different representations of itself.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

It has a really big one.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

It's got a much smaller one that is continually distilling into the small version.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

And then the small version...

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

Once that's finished, then you sort of delete the big one and you add a bunch more parameter capacity and now start to learn all the things that the distilled small one doesn't know by training it on more data.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

And then you kind of repeat that process.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

And if you have that kind of running a thousand different places in your modular model in the background, that seems like it would work reasonably well.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

Yeah, you can have multiple versions and like, you know, this is an easy math problem, so I'm going to route it to the really tiny math distilled thing and, oh, this one's really hard.