Rob Wiblin
π€ SpeakerVoice Profile Active
This person's voice can be automatically recognized across podcast episodes using AI voice matching.
Appearances Over Time
Podcast Appearances
So just to check that I've understood, you're saying because we are using GPUs and TPUs structurally, that forces the thought of the model to be, I guess, very deep.
It can like have many things in its mind at one time, but there can't be very many steps because you have to be going through all of these things in parallel.
But if
if it was very wide, then you wouldn't be able to do them all simultaneously.
You would have to wait until the earlier steps were done to do the later steps.
And this is just something that is going to remain the case for years to come.
So I don't fully understand this idea of continuous chain of thought, but isn't there this notion that basically at the end of thinking for a little bit, currently we force the models to output a word, a token, and then we feed that back into the start of the model again.
But why, rather than compress all of its thoughts down into a single token or word or whatever, why don't we just keep the full distribution of all of the thoughts and then feed them back into the beginning again?
Wouldn't that allow it to kind of preserve more information rather than basically throwing a bunch of it out?
And if that was a much more effective way of thinking and reasoning, and then you have no stage where you're actually outputting a token that a human being can read, then wouldn't that be a force that would potentially make them much more opaque to us?
So you're saying, even if you are using continuous chain of thought, you can still go back and say, well, what if we had forced output tokens at all of these intermediate stages?
What would that probably have looked like?
Let's just take the most likely word or the most likely token at each step and then read that.
I guess there might be a concern that it could hide a second kind of,
track of thought in the tail of the probability distribution, things that you wouldn't be likely to read.
But you're saying it seems like it kind of only has actually one train of thought here.
There's not like a second hidden chain of thought that you wouldn't be able to inspect.
Well, I guess even setting aside whether it performs as well according to some benchmark, you could at least see... You're saying keep just the top word or the top few tokens, the few most probable ones, throw out everything else.
You could see whether it does a different thing.
Does it lead to a different recommendation or a different outcome?