Noam Shazeer
๐ค SpeakerAppearances Over Time
Podcast Appearances
How does that flow work when you kind of
need a bit more information and then you want to put it back in the background for it to continue doing you know finding the hotels in brilliant or whatever um i think it's going to be pretty interesting uh and inference will be useful difference will be useful i mean there's also a compute efficiency thing in inference that you don't have in training and that
Yeah, like as a good example of an algorithmic improvement is like the use of drafter models.
So you have like a really small language model that you do one token at a time when you're decoding and predict like four tokens.
And then you give that to the big model and you say, okay, here's the four tokens the little model came up with.
check which ones you agree with and if you agree with the first three then you just advance and then you've basically been able to do a four four token with parallel computation instead of a one token with thing in the big model um and so those are the kinds of things that people are looking at to improve inference efficiency
So you don't have this single token decode bottleneck.
Hello, how are you?
That sounds great to me.
I'm going to advance past that.
I mean, we're already doing it.
So we're pro multi-data center training.
I think in the Gemini 1.5 tech report, we said we used multiple metro areas and trained with some of the compute in each place and then a pretty โ
long latency, but high bandwidth connection between those data centers.
And that works fine.
It's great.
Actually, training is kind of interesting because each step in a training process is, you know, usually for a large model is a few seconds or something at least.
So the latency of it being, you know, 50 milliseconds away doesn't matter that much.
Just the bandwidth.
Yeah, just bandwidth.