Noam Shazeer
👤 PersonAppearances Over Time
Podcast Appearances
I'm like, this is great.
When are we going to launch it?
And he's like, oh, well, we can't launch this.
It's not really very practical because it takes 12 hours to translate a sentence.
I'm like, well, that seems like a long time.
How could we fix that?
So it turned out they had not really designed it for high throughput, obviously.
And so it was doing like 100,000 disk seeks in a large language model that they'd sort of computed statistics over.
I wouldn't say train, really.
And, you know, for each word that it wanted to translate.
So, like, obviously doing 100,000 disk seeks is not super speedy.
But I said, okay, well, let's dive into this.
And so I spent about two or three months with them designing an in-memory compressed representation of n-gram data.
And we were using – an n-gram is basically statistics for how often every n-word sequence occurs in a large corpus.
So you basically have – in this case, we had like two trillion words.
And most n-gram models of the day were like using two grams or maybe three grams.
But we decided we would use five grams.
So how often every five-word sequence occurs in basically as much of the web as we could process in that day.
And then you have a data structure that says, okay, I really like this restaurant occurs 17 times in the web or something.
And so I built a data structure that would let you store all those in memory on 200 machines and then have sort of a batched API where you could say, here are the 100,000 things I need to look up in this round for this word, and it would give you them all back in parallel.