Grant Harvey
👤 PersonVoice Profile Active
This person's voice can be automatically recognized across podcast episodes using AI voice matching.
Appearances Over Time
Podcast Appearances
So supervised fine-tuning is giving a model loads of real high-quality examples of data sets to look like.
That's taking your model to the library and saying, here's some textbooks to read.
It's going to read the textbooks.
They'll tell you what's true.
Reinforcement learning is, okay, you're going to give the model some questions.
It'll give some answers, and you're going to say if those answers are good or not.
Like you might ask the model to write me a poem about...
Russia and then you'll you'll check that poem and you'll have something about what if that poem is good or bad and you'll give it so great and it'll learn from that and so that's reinforcement learning or you can call it reward modeling you're basically allowing you to change the way you reward your model for different types of answers and then evaluation is building the like the test the model has to take to understand if it's good because companies will release loads of different versions of models and they've got to understand if it's better or worse like I mean you've seen the news about track dbt5 dbt5 they released it
It was obviously better in some metrics, but the audience wasn't happy.
And that's why human evaluation is so necessary because people are not deterministic too.
People like to have opinions on things and you can't just be like, well, this was better than all our benchmarks.
If it feels different to someone and the user doesn't like it, it doesn't matter if it's better.
And that's evaluation.
So we're kind of, we're the teachers behind the models.
Like if everybody's running off the internet, like we're in trouble.
Ultimately, a model is going to look at its huge data set and then you're going to have a large, an impossibly large set of hyperparameters that you're going to configure to try and understand what the best next token to predict is based upon the data it's looked at.
But ultimately, you've either got to improve the underlying data set, which is hard.
Like the data sets are huge.
They'd set to petabytes of information.
Or you've got to do what's called post-training where you use sort of,