Sholto Douglas
๐ค SpeakerAppearances Over Time
Podcast Appearances
And if adaptive compute is solved, you can keep doing that.
And in this case, if there's a quadratic penalty for attention, but you're doing long context anyways, then you're still dumping in more compute during, not during training or not during having bigger models, but just like, yeah.
Breakdown for me, you're referring to this in some of your previous answers of
listen, you have these long contexts and you can hold more things in memory, but like ultimately comes down to your ability to mix concepts together to do some kind of reasoning.
And these models aren't necessarily human level at that, even in context.
Break down for me how you see storing just raw information versus reasoning and what's in between.
Like, where's the reasoning happening?
Is that, where's just like storing raw information happening?
What's different between them in these models?
Before we get deeper into this, we should explain to the audience, you referred earlier to Anthropic's way of thinking about transformers as these read-write operations that layers do.
One of you should just kind of explain at a high level what you mean by that.
Yeah.
I might like just like dumb it down, like as a way that would have made sense to me a few months ago of, okay, so you have...
you know, whatever words are in the input you put into the model, all those words get converted into these tokens and those tokens get converted into these vectors.
And basically it's just like this small amount of information that's moving through the model.
And the way you explained it to me, Sheldon, this paper talks about is
early on in the model maybe it's just doing some very basic things about like what do these tokens mean like if it says like 10 plus 5 just like moving information about to have the have that good representation exactly just represent in the middle maybe like the deeper thinking is happening about like how to think yeah how to solve this at the end you're converting it back into the output token because the end product is you're trying to predict the probability of the next token from the last of those residual streams
And so, yeah, it's interesting to think about like just like the small compressed amount of information moving through the model and it's like getting modified in different ways.
Trenton, so you're it's interesting.
You're one of the few people who have like background from neuroscience.