Arvind Narayanan
👤 PersonAppearances Over Time
Podcast Appearances
So your training cost increases, your inference cost decreases. But because it's the inference cost that dominates, the total cost is probably going to come down. So total cost comes down. If you have the same workload and you have a smaller model doing it, then the total cost is going to come down.
So your training cost increases, your inference cost decreases. But because it's the inference cost that dominates, the total cost is probably going to come down. So total cost comes down. If you have the same workload and you have a smaller model doing it, then the total cost is going to come down.
Sure. I think we are still in a period where, you know, these models have not yet quite become commoditized. There's obviously a lot of progress and there's a lot of demand on hardware as well. Hardware cycles are also improving rapidly. But, you know, there's the saying that every exponential is a sigmoid in disguise. So a sigmoid curve is one that looks like an exponential at the beginning.
Sure. I think we are still in a period where, you know, these models have not yet quite become commoditized. There's obviously a lot of progress and there's a lot of demand on hardware as well. Hardware cycles are also improving rapidly. But, you know, there's the saying that every exponential is a sigmoid in disguise. So a sigmoid curve is one that looks like an exponential at the beginning.
Sure. I think we are still in a period where, you know, these models have not yet quite become commoditized. There's obviously a lot of progress and there's a lot of demand on hardware as well. Hardware cycles are also improving rapidly. But, you know, there's the saying that every exponential is a sigmoid in disguise. So a sigmoid curve is one that looks like an exponential at the beginning.
So imagine the S letter shape. But then after a while, it has to taper off like every exponential has to taper off. So I think that's going to happen both with models as well as with these hardware cycles. We are, I think, going to get to a world where models do get commoditized.
So imagine the S letter shape. But then after a while, it has to taper off like every exponential has to taper off. So I think that's going to happen both with models as well as with these hardware cycles. We are, I think, going to get to a world where models do get commoditized.
So imagine the S letter shape. But then after a while, it has to taper off like every exponential has to taper off. So I think that's going to happen both with models as well as with these hardware cycles. We are, I think, going to get to a world where models do get commoditized.
A big part of it is this issue of vibes, right? So you evaluate LLMs on these benchmarks, but then it seems to perform really well on the benchmarks, but then the vibes are off. In other words, you start using it and somehow it doesn't feel adequate. It makes a lot of mistakes in ways that are not captured in the benchmark.
A big part of it is this issue of vibes, right? So you evaluate LLMs on these benchmarks, but then it seems to perform really well on the benchmarks, but then the vibes are off. In other words, you start using it and somehow it doesn't feel adequate. It makes a lot of mistakes in ways that are not captured in the benchmark.
A big part of it is this issue of vibes, right? So you evaluate LLMs on these benchmarks, but then it seems to perform really well on the benchmarks, but then the vibes are off. In other words, you start using it and somehow it doesn't feel adequate. It makes a lot of mistakes in ways that are not captured in the benchmark.
And the reason for that is simply that when there is so much pressure to do well on these benchmarks, developers are intentionally or unintentionally optimizing these models in ways that look good on the benchmarks, but don't look good in real world evaluation.
And the reason for that is simply that when there is so much pressure to do well on these benchmarks, developers are intentionally or unintentionally optimizing these models in ways that look good on the benchmarks, but don't look good in real world evaluation.
And the reason for that is simply that when there is so much pressure to do well on these benchmarks, developers are intentionally or unintentionally optimizing these models in ways that look good on the benchmarks, but don't look good in real world evaluation.
So when GPT-4 came out and OpenAI claimed that it passed the bar exam and the medical licensing exam, people were very excited slash scared about what this means for doctors and lawyers. And the answer turned out to be approximately nothing. Because it's not like a lawyer's job is to answer bar exam questions all day.
So when GPT-4 came out and OpenAI claimed that it passed the bar exam and the medical licensing exam, people were very excited slash scared about what this means for doctors and lawyers. And the answer turned out to be approximately nothing. Because it's not like a lawyer's job is to answer bar exam questions all day.
So when GPT-4 came out and OpenAI claimed that it passed the bar exam and the medical licensing exam, people were very excited slash scared about what this means for doctors and lawyers. And the answer turned out to be approximately nothing. Because it's not like a lawyer's job is to answer bar exam questions all day.
These benchmarks that models are being tested on don't really capture what we would use them for in the real world. So that's one reason why LLM evaluation is a minefield. And there's also just a very simple factor of contamination. Maybe the model has already trained on the answers that it's being evaluated on in the benchmark. And so if you ask it new questions, it's going to struggle.
These benchmarks that models are being tested on don't really capture what we would use them for in the real world. So that's one reason why LLM evaluation is a minefield. And there's also just a very simple factor of contamination. Maybe the model has already trained on the answers that it's being evaluated on in the benchmark. And so if you ask it new questions, it's going to struggle.
These benchmarks that models are being tested on don't really capture what we would use them for in the real world. So that's one reason why LLM evaluation is a minefield. And there's also just a very simple factor of contamination. Maybe the model has already trained on the answers that it's being evaluated on in the benchmark. And so if you ask it new questions, it's going to struggle.