Trenton Bricken
๐ค SpeakerAppearances Over Time
Podcast Appearances
Yeah, I don't have a super crisp answer for you here.
I mean, obviously with the input and output of the model, you're mapping back to actual tokens, right?
And then in between that, you're doing higher level processing.
So the residual stream, imagine you're in a boat going down a river.
And the boat is kind of the current query where you're trying to predict the next token.
So it's the cat sat on the blank.
And then you have these little streams that are coming off the river where you can get extra passengers or collect extra information if you want.
And those correspond to the attention heads and MLPs that are part of the model.
Right.
And you can operate on subspaces of that high-dimensional vector.
A ton of things are... I mean, at this point, I think it's almost given that...
are encoded in superposition, right?
So it's like, yeah, the residual stream is just one high dimensional vector, but actually there's a ton of different vectors that are packed into it.
Yeah, yeah.
So at least in this R volume, you basically do have a residual stream where the whole, what we'll call the attention module for now, and I can go into whatever amount of DCO you want for that.
You have inputs that route through it, but they'll also just go directly to the like end point that module will contribute to.
So there's a direct path and an indirect path.
And so the model can like pick up whatever information it wants and then add that back in.
What happens to the cerebellum?
So the cerebellum nominally just does fine motor control.