Menu
Sign In Search Podcasts Libraries Charts People & Topics Add Podcast API Blog Pricing

Dwarkesh Patel

๐Ÿ‘ค Speaker
15267 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Okay, like as in a Frontier model today will actually have, during inference, have pipeline...

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

I know this is wrong.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

I'm just trying to think out why my train of logic here is wrong.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

If you have many different, you're pipelining through many different stages, the KV values are not shared between layers.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So why would it not help to be pipelining across multiple layers?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Because then you don't have to store.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

You're right.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Right.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

This is going back fundamentally to the point of you're not able to amortize across KV caches.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So this goes back to the question, does that mean that frontier labs, when they're doing inference, are just basically within a single scale-up?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So I guess this goes back to the question about, this goes back to the promise at the beginning of the lecture, which was, this will actually tell you about AI progress as well.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

To the extent it is the case that model size scaling has been slow until recently because, let me make sure I understand the claim.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

The claim would not be, you could have trained across more racks.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

It was just that it would not have made sense before, like we didn't have the ability to do inference for a bigger model easily.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

I was just about to ask.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So what is the going from rack to rack?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

What is the latency cost per hop?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Is four a realistic number of how many pipelining stages you might have?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Wait, I guess it's 10 milliseconds per token.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

That's right.