Dwarkesh Patel
๐ค SpeakerAppearances Over Time
Podcast Appearances
And if you want to do pipeline and training, in order to avoid that bubble, you need to... Should we draw the training diagram?
It may be worth clarifying, the reason there is that hard stop is because you want to do a whole batch at once for the backward step.
And then there is an optimal size for how big that batch should be.
but there's this harder trade-off during training.
Last week, Horace was kind enough to give me and my friends a great lecture on large-scale pre-training systems.
And there were some concepts that I wanted to animate for a write-up on my blog, like how weight shard and gradients flow depending on the parallelism that you're using.
So I gave Cursor my lecture notes and a sketch that I'd made during the lecture, and I asked it to visualize a specific hierarchical collective that Horace had explained.
The first version was already pretty good, and then I was able to use design mode to select and tweak any specific components from there.
I was able to do all of this without a clear end state in mind.
Cursor's Composer 2 fast model was quick enough that I was able to iterate almost instantaneously.
I could try an idea, test the results in the built-in browser, and immediately make any changes.
I went through 10 different versions in under 20 minutes.
If you want to check out this animation, I published it along with the lecture notes in a blog post.
The link is in the description.
And if you want to try out this kind of iterative design flow for yourself, go to cursor.com slash lorecash to get started.
So, macro question.
Everybody's talking about the memory wall right now.
Memory's getting super expensive.
There's not enough memory.
Smartphone volume will go down 30% because there's not enough memory.