Noam Shazeer
👤 PersonAppearances Over Time
Podcast Appearances
But we sort of had some early evidence that seemed like it might be possible.
So we're like, great, let's build the whole chip around that.
And then over time, I think you've seen people able to use much lower precision for training as well.
But also the inference precision has, you know, gone.
People are now using INT4 or FP4, which sounded like if you said to someone, like, we're going to use FP4 to like a supercomputing floating point person 20 years ago, they'd be like, what?
That's crazy.
We like 64 bits in our floats.
Or even below that, some people are quantizing models to two bits or one bit.
And I think that's a trend to definitely pay attention to.
Yeah, just a zero or one.
And then you have like a sign bit for a group of bits or something.
Then you're like, yes, quantization is irritating, but your model is going to be three times faster, so you're going to have to deal.
Yeah, so, I mean, let me start with the undergrad thesis.
So I kind of got introduced to neural nets in one section of one class on parallel computing that I was taking in my senior year.
and I needed to do a thesis to graduate, like an honors thesis.
And so I approached the professor and I said, oh, it would be really fun to do something around neural nets.
So he and I decided I would sort of implement a couple of different ways of parallelizing backpropagation training for neural nets in 1990.
And I called him something funny in my thesis, like pattern partitioning or something.
But really I implemented a –
model parallelism and data parallelism on a 32-processor hypercube machine.