Menu
Sign In Search Podcasts Libraries Charts People & Topics Add Podcast API Blog Pricing

Dwarkesh Patel

๐Ÿ‘ค Speaker
15267 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

When you're operating within a single scale-up domain, is that a consideration specifically for either forward or backward?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Or specifically for pre-fill versus decode?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Or is it preferred to always be within a scale-up

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Whatever kind of workload you have, whether you're doing a pre-training run or whether you're doing RLL generation or whether you're doing inference for users.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Can I try to guess?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Just out of curiosity to see if I'm actually understanding.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

It seems like you're sending batch size into the rack.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

In here?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yes.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

But the communication within a rack is sort of batch size times number of GPUs.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And there's a need to multiply the whole thing by two for the up and down.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And there's a factor of two.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

It's interesting to me that the best parallelism

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

strategy and practice ends up being one which physically resembles the actual architecture.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

It's not some galaxy brain thing.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

You know, it's like, oh, we have experts.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

We're going to put them on different GPUs.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Oh, we have different layers.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

We're going to put them on different racks.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

I mean, it could have been something wackier with tensor parallelism and whatever.