Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Noam Shazeer

๐Ÿ‘ค Speaker
See mentions of this person in podcasts
692 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

And then we have a bunch of work in even early brain days when we were using CPU machines and they were really slow.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

So we needed to do asynchronous training to help scale where each copy of the model would kind of do some local computation and then send gradient updates to a centralized system and then apply them.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

asynchronously and another copy of the model would be doing the same thing.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

You know, it makes your model parameters kind of wiggle around a bit and it makes people uncomfortable with the theoretical guarantees, but it actually seems to work in practice.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

So one way to do that is you effectively record the sequence of operations.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

So like which gradient update happened and when and on which batch of data.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

You don't necessarily record the actual gradient update in a log or something, but you could replay that log of operations

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

So that you get repeatability.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

Then I think you'd be happier.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

I mean, the thing that let us go from asynchronous training on CPUs to fully synchronous training is the fact that we have these super fast TPU hardware chips and then pods, which have incredible amounts of bandwidth between the chips and a pod.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

And then scaling beyond that, we have really good data center networks and even cross metro area networks.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

that enable us to scale to, you know, many, many pods in multiple metro areas for our largest training runs.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

And we can do that fully synchronously, as Noam said, as long as the gradient accumulation and communication of the parameters across metro areas happens, you know, fast enough relative to the step time, you're golden.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

You don't really care.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

But I think as you scale up, there may be a push to have a bit more asynchrony in our system than we have now.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

Because, like, we can make it work.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

I've been, you know, our ML researchers have been really happy how far we've been able to push synchronous training because it is easier mental model to understand.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

You know, you just have your algorithm sort of fighting you rather than...

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

The asynchrony and the algorithm kind of battling you.

Dwarkesh Podcast
Jeff Dean & Noam Shazeer โ€“ 25 years at Google: from PageRank to AGI

Maybe it's your adversarial machine MUQQ17 that is like setting the seventh bit of your exponent and all your radians or something.