Sholto Douglas

Where they very clearly understand this dance between the hardware systems that you're like designing the models around and the sort of like algorithmic side of it.

5149.747 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And this is manifesting the way that

5163.123 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

the models give this sense of being perfectly designed up to their constraints.

5166.093 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And you can really very clearly see what constraints they're thinking about as they're iteratively solving these problems.

5171.704 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And so let's take the base transformer and diff that to DeepSeq v2 and v3.

5176.875 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

You can see them running up against the memory bandwidth bottleneck in Attention.

5184.25 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And you can see them, initially they do MLA to do this.

5190.442 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

They trade flops for memory bandwidth, basically.

5194.671 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And then they do this thing called NSA where they like

5197.697 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

more selectively load memory bandwidth.

5199.961 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And you can see, actually, this is because the model that they trained with MLA was on H800s, so it has a lot of flops.

5202.004 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And so they were like, OK, we can freely use the flops.

5208.595 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

But then the export controls from Biden came in, or they knew they would have less of those chips going forward.

5211.279 View full episode →

Dwarkesh Podcast

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

And so they traded off to a more memory bandwidth-oriented algorithmic solution there.

5221.456 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment