Trenton Bricken
๐ค SpeakerAppearances Over Time
Podcast Appearances
And so the three-way convergence here and the takeoff and success of Transformers seems pretty striking to me.
Yeah, so maybe my hot take here, I don't know how hot it is, is that most intelligence is pattern matching.
And you can do a lot of really good pattern matching if you have a hierarchy of associative memories.
You start with your very basic associations between objects in the real world.
But you can then chain those and have more abstract associations, such as a wedding ring symbolizes so many other associations that are downstream.
And you can even generalize the attention operation and this associative memory as the MLP layer as well.
It's in a long-term setting where you don't have tokens in your current context.
But I think this is an argument that like association is all you need.
And associated memory in general as well, it's not, so you can do two things with it.
You can both denoise or retrieve a current memory.
So like if I see your face, but it's like raining and cloudy, I can denoise and kind of like gradually update my query towards my memory of your face.
But I can also access that memory and then the value that I get out,
actually points to some other totally different part of the space.
And so a very simple instance of this would be if you learn the alphabet, right?
And so I query for A and it returns B, I query for B and it returns C, and you can traverse the whole thing.
Yeah.
I think so, yeah.
So I think learning these higher level associations to be able to then map patterns to each other as kind of like a meta learning.
I think in this case, he would also just have a really long context length or a really long working memory, right?
Where he can like have all of these bits and continuously query them as he's coming up with whatever theory so that the theory is moving through the residual stream.