Rob Wiblin
π€ SpeakerVoice Profile Active
This person's voice can be automatically recognized across podcast episodes using AI voice matching.
Appearances Over Time
Podcast Appearances
And then in a second stage, reinforcement learning trains those models to produce the kinds of responses that we're most likely to say that we like, that we want, rather than just the responses that were most probable in the full corpus of all human-generated text.
Now, Joshua's alternative is to build an AR model oriented not around predicting what a human would be likely to say or what they would prefer to hear, but around modeling what's actually true in the world by developing hypotheses and assigning probabilities to them with the goal of best explaining all of the data that it's exposed to during its training process.
Joshua argues that you'd be able to train a model of this type while porting over most of the methods we use to train ordinary LLMs today, benefiting from the same neural net architectures, training techniques, scanning improvements, all of that.
And you'd also be able to train it on roughly the same body of raw text that we use for all other AIs.
But you could structure that data a bit differently, giving it what AI researchers call a different syntax.
First, all of the things that people said or wrote, they get tagged as communication acts.
We know someone said these things and we know where they said it, but we don't know whether they're true.
And second, a small number of statements that we have strong independent grounds for, verified mathematical proofs and some scientific measurements, they get tagged as verified factual claims about the world.
The model is then trained to find the combination of possible underlying facts about the world that would best explain everything that it sees in aggregate, both the things people said and the verified facts that it's been given as ground truth.
These hypothesized facts about the world, they're what AI researchers call latent variables, meaning variables that the AI can't directly observe that it's going to have to infer indirectly instead.
What the model will ultimately be able to give us is its estimated probability that any given statement in natural human language is true, as well as how much the model trusts its own answer on that, how confident it is that it has a good grip on that question.
Crucially, Joshua says that by tagging all text into these two categories from the very beginning, things someone said versus factual statements, you can then ask the model questions as though you're asking about reality, not about communication acts by using the factual statement tag.
And because these two categories have been there from the very beginning, the model knows the difference and it won't blur the line between the two.
That's something you don't get with AI models today.
And Joshua also argues, using various mathematical theorems in his papers, that unlike ordinary LLMs, a model trained in this way would be honest by design.
And furthermore, that such an AR model would by itself have no goals and no preferences about the state of the world.
It would be what Joshua calls just a pure predictor.
Now, there's two main uses for this.
Near-term, as a sort of stopgap solution, you bolt the predictor onto existing AI agents as a sort of guardrail, an independent filter that sits between the agent and the world, checking over its proposed actions and rejecting those that it predicts will be harmful.
But as he'll explain in a minute, Joshua thinks we can ultimately do much better than this.