Menu
Sign In Search Podcasts Libraries Charts People & Topics Add Podcast API Blog Pricing

Dwarkesh Patel

๐Ÿ‘ค Speaker
15267 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Oh, sorry, extremely naive question.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Why is there not a quadratic term?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So what is the reason that there's no company which has over a million token context length?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

If this is true?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And so there's this idea that Dario said on the podcast and others have said, which is we don't need continual learning for

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

AGI in context learning is enough.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And if you believe that, then you have to think that we had to get to 100 million token, 100 million billion context length to have an employee that is the equivalent to working with you for a month.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Now, maybe that's no longer true as far as attention or something.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

But yeah, if you think that, then as some ML infer thing would have to change to allow for 100 million, like the memory bandwidth to allow for 100 million token context lengths.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Not because of the compute cost, but because of the memory bandwidth cost.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And why doesn't sparse attention solve it?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Why isn't the cost to retrieve HBM the memory bandwidth, or the bytes divided by memory bandwidth?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Because if it's already in HBM, you can be doing compute while you're getting it from HBM to HBM?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yeah, for example.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Okay.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And the price difference, I think, was... I'll look it up.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Okay, so the base input tokens is $5 per million.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Togans.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Which means remap.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yeah, that's five.