Dario Amodei
๐ค SpeakerAppearances Over Time
Podcast Appearances
Uh, and when, you know, in some ways it was fortunate. I was kind of, you know, you can have almost beginner's luck, right? I was like a newcomer to the field. And, you know, I looked at the neural net that we were using for speech, the recurrent neural networks. And I said, I don't know, what if you make them bigger and give them more layers and
Uh, and when, you know, in some ways it was fortunate. I was kind of, you know, you can have almost beginner's luck, right? I was like a newcomer to the field. And, you know, I looked at the neural net that we were using for speech, the recurrent neural networks. And I said, I don't know, what if you make them bigger and give them more layers and
Uh, and when, you know, in some ways it was fortunate. I was kind of, you know, you can have almost beginner's luck, right? I was like a newcomer to the field. And, you know, I looked at the neural net that we were using for speech, the recurrent neural networks. And I said, I don't know, what if you make them bigger and give them more layers and
And what if you scale up the data along with this, right? I just saw these as like independent dials that you could turn. And I noticed that the model started to do better and better as you gave them more data, as you made the models larger, as you trained them for longer.
And what if you scale up the data along with this, right? I just saw these as like independent dials that you could turn. And I noticed that the model started to do better and better as you gave them more data, as you made the models larger, as you trained them for longer.
And what if you scale up the data along with this, right? I just saw these as like independent dials that you could turn. And I noticed that the model started to do better and better as you gave them more data, as you made the models larger, as you trained them for longer.
And I didn't measure things precisely in those days, but along with colleagues, we very much got the informal sense that the more data and the more compute and the more training you put into these models, the better they perform. And so initially my thinking was, hey, maybe that is just true for speech recognition systems, right? Maybe that's just one particular quirk, one particular area.
And I didn't measure things precisely in those days, but along with colleagues, we very much got the informal sense that the more data and the more compute and the more training you put into these models, the better they perform. And so initially my thinking was, hey, maybe that is just true for speech recognition systems, right? Maybe that's just one particular quirk, one particular area.
And I didn't measure things precisely in those days, but along with colleagues, we very much got the informal sense that the more data and the more compute and the more training you put into these models, the better they perform. And so initially my thinking was, hey, maybe that is just true for speech recognition systems, right? Maybe that's just one particular quirk, one particular area.
I think it wasn't until 2017 when I first saw the results from GPT-1. that it clicked for me that language is probably the area in which we can do this. We can get trillions of words of language data. We can train on them. And the models we were trained in those days were tiny.
I think it wasn't until 2017 when I first saw the results from GPT-1. that it clicked for me that language is probably the area in which we can do this. We can get trillions of words of language data. We can train on them. And the models we were trained in those days were tiny.
I think it wasn't until 2017 when I first saw the results from GPT-1. that it clicked for me that language is probably the area in which we can do this. We can get trillions of words of language data. We can train on them. And the models we were trained in those days were tiny.
You could train them on one to eight GPUs, whereas, you know, now we train jobs on tens of thousands, soon going to hundreds of thousands of GPUs. And so when I saw those two things together, and, you know, there were a few people like Ilya Sutskiver, who you've interviewed, who had somewhat similar views, right?
You could train them on one to eight GPUs, whereas, you know, now we train jobs on tens of thousands, soon going to hundreds of thousands of GPUs. And so when I saw those two things together, and, you know, there were a few people like Ilya Sutskiver, who you've interviewed, who had somewhat similar views, right?
You could train them on one to eight GPUs, whereas, you know, now we train jobs on tens of thousands, soon going to hundreds of thousands of GPUs. And so when I saw those two things together, and, you know, there were a few people like Ilya Sutskiver, who you've interviewed, who had somewhat similar views, right?
He might have been the first one, although I think a few people came to similar views around the same time, right? There was, you know, Rich Sutton's bitter lesson. There was, Goren wrote about the scaling hypothesis. But I think somewhere between 2014 and 2017 was when it really clicked for me, when I really got conviction that, hey, we're going to be able to do these incredible
He might have been the first one, although I think a few people came to similar views around the same time, right? There was, you know, Rich Sutton's bitter lesson. There was, Goren wrote about the scaling hypothesis. But I think somewhere between 2014 and 2017 was when it really clicked for me, when I really got conviction that, hey, we're going to be able to do these incredible
He might have been the first one, although I think a few people came to similar views around the same time, right? There was, you know, Rich Sutton's bitter lesson. There was, Goren wrote about the scaling hypothesis. But I think somewhere between 2014 and 2017 was when it really clicked for me, when I really got conviction that, hey, we're going to be able to do these incredible
incredibly wide cognitive tasks if we just scale up the models. And at every stage of scaling, there are always arguments. And when I first heard them, honestly, I thought, probably I'm the one who's wrong. And all these experts in the field are right. They know the situation better than I do. There's the Chomsky argument about you can get syntactics, but you can't get semantics.
incredibly wide cognitive tasks if we just scale up the models. And at every stage of scaling, there are always arguments. And when I first heard them, honestly, I thought, probably I'm the one who's wrong. And all these experts in the field are right. They know the situation better than I do. There's the Chomsky argument about you can get syntactics, but you can't get semantics.