Dylan Patel
๐ค SpeakerAppearances Over Time
Podcast Appearances
And the important thing to say is that no matter how you want the model to behave, these RLHF and preference tuning techniques also improve performance. So on things like math evals and code evals, there is something innate to these what is called contrastive loss functions. We could start to get into RL here.
We don't really need to, but RLHF also boosts performance on anything from a chat task to a math problem to a code problem. So it is becoming a much more useful tool to these labs. So this kind of takes us through the arc of we've talked about pre-training, hard to get rid of things. We've talked about post-training and how post-training, if you You can mess it up.
We don't really need to, but RLHF also boosts performance on anything from a chat task to a math problem to a code problem. So it is becoming a much more useful tool to these labs. So this kind of takes us through the arc of we've talked about pre-training, hard to get rid of things. We've talked about post-training and how post-training, if you You can mess it up.
We don't really need to, but RLHF also boosts performance on anything from a chat task to a math problem to a code problem. So it is becoming a much more useful tool to these labs. So this kind of takes us through the arc of we've talked about pre-training, hard to get rid of things. We've talked about post-training and how post-training, if you You can mess it up.
It's a complex, multifaceted optimization with 10 to 100 person teams converging on one artifact. It's really easy to not do it perfectly. And then there's the third case, which is what we talked about Gemini. The thing that was about Gemini is this was a served product where Google has their internal model weights. They've done all these processes that we talked about.
It's a complex, multifaceted optimization with 10 to 100 person teams converging on one artifact. It's really easy to not do it perfectly. And then there's the third case, which is what we talked about Gemini. The thing that was about Gemini is this was a served product where Google has their internal model weights. They've done all these processes that we talked about.
It's a complex, multifaceted optimization with 10 to 100 person teams converging on one artifact. It's really easy to not do it perfectly. And then there's the third case, which is what we talked about Gemini. The thing that was about Gemini is this was a served product where Google has their internal model weights. They've done all these processes that we talked about.
And in the served product, what came out after this was that they had a prompt that they were rewriting user queries to boost diversity or something. And this just made it, the outputs were just blatantly wrong. It was some sort of organizational failure that had this prompt in that position. And I think Google executives probably have owned this. I didn't pay that attention, that detail.
And in the served product, what came out after this was that they had a prompt that they were rewriting user queries to boost diversity or something. And this just made it, the outputs were just blatantly wrong. It was some sort of organizational failure that had this prompt in that position. And I think Google executives probably have owned this. I didn't pay that attention, that detail.
And in the served product, what came out after this was that they had a prompt that they were rewriting user queries to boost diversity or something. And this just made it, the outputs were just blatantly wrong. It was some sort of organizational failure that had this prompt in that position. And I think Google executives probably have owned this. I didn't pay that attention, that detail.
But it was just a mess up in execution that led to this ridiculous thing. But at the system level, the model weights might have been fine.
But it was just a mess up in execution that led to this ridiculous thing. But at the system level, the model weights might have been fine.
But it was just a mess up in execution that led to this ridiculous thing. But at the system level, the model weights might have been fine.
It was like the system prompt or what is called in industry is like you rewrite prompts. So especially for image models, if you're using Dolly or ChatGPT can generate you an image, you'll say, draw me a beautiful car. Mm-hmm. With these leading image models, they benefit from highly descriptive prompts.
It was like the system prompt or what is called in industry is like you rewrite prompts. So especially for image models, if you're using Dolly or ChatGPT can generate you an image, you'll say, draw me a beautiful car. Mm-hmm. With these leading image models, they benefit from highly descriptive prompts.
It was like the system prompt or what is called in industry is like you rewrite prompts. So especially for image models, if you're using Dolly or ChatGPT can generate you an image, you'll say, draw me a beautiful car. Mm-hmm. With these leading image models, they benefit from highly descriptive prompts.
So what would happen is if you do that on ChatGPT, a language model behind the scenes will rewrite the prompt, say, make this more descriptive, and then that is passed to the image model. So prompt rewriting is something that is used at multiple levels of industry, and it's used effectively for image models, and the Gemini example is just a failed execution.
So what would happen is if you do that on ChatGPT, a language model behind the scenes will rewrite the prompt, say, make this more descriptive, and then that is passed to the image model. So prompt rewriting is something that is used at multiple levels of industry, and it's used effectively for image models, and the Gemini example is just a failed execution.
So what would happen is if you do that on ChatGPT, a language model behind the scenes will rewrite the prompt, say, make this more descriptive, and then that is passed to the image model. So prompt rewriting is something that is used at multiple levels of industry, and it's used effectively for image models, and the Gemini example is just a failed execution.
For the past few years, the highest cost human data has been in these preferences, which is comparing, I would say highest cost and highest total usage. So a lot of money has gone to these pairwise comparisons where you have two model outputs and a human is comparing between the two of them. In earlier years, there was a lot of this instruction tuning data.