Dwarkesh Patel
👤 PersonAppearances Over Time
Podcast Appearances
So that's a 35 million fold difference in how much information per token is assimilated by the model.
I wonder if that's relevant at all.
Stepping back, what is the part about human intelligence that we have most failed to replicate with these models?
This is maybe relevant to the question of thinking about how fast these issues will be solved.
So sometimes people will say about continual learning, look, actually, you could easily replicate this capability just as in-context learning emerged spontaneously as a result of pre-training.
Continual learning over longer horizons will emerge spontaneously if the model is incentivized to recollect information over longer horizons or horizons longer than one session.
So if there's some like outer loop RL, which...
it has many sessions within that outer loop, then like this continual learning where it uses like, it fine tunes itself or it writes to an external memory or something will just sort of like emerge spontaneously.
Do you think, do you think things are things that are plausible?
I just, I don't have really a prior over like how plausible is that?
How likely is that to happen?
Interesting.
In 10 years, do you think it'll still be something like a transformer, but with a much more modified attention and more sparse MLPs and so forth?
It's surprising that all of those things together are...
only halved half of the error, which is like 30 years of progress.
Maybe half is a lot, because if you halve the error, that actually means that... Half is a lot, yeah.
Yeah, actually, I was about to ask a very similar question about NanoChat.
Because since you just coded up recently, every single sort of step in the process of building a chatbot is like fresh in your RAM.
And I'm curious if you had similar thoughts about like, oh, there was no one thing that was relevant to going from...
GPT-2 to NanoChat.