Noam Shazeer
๐ค SpeakerAppearances Over Time
Podcast Appearances
And then we have a bunch of work in even early brain days when we were using CPU machines and they were really slow.
So we needed to do asynchronous training to help scale where each copy of the model would kind of do some local computation and then send gradient updates to a centralized system and then apply them.
asynchronously and another copy of the model would be doing the same thing.
You know, it makes your model parameters kind of wiggle around a bit and it makes people uncomfortable with the theoretical guarantees, but it actually seems to work in practice.
So one way to do that is you effectively record the sequence of operations.
So like which gradient update happened and when and on which batch of data.
You don't necessarily record the actual gradient update in a log or something, but you could replay that log of operations
So that you get repeatability.
Then I think you'd be happier.
I mean, the thing that let us go from asynchronous training on CPUs to fully synchronous training is the fact that we have these super fast TPU hardware chips and then pods, which have incredible amounts of bandwidth between the chips and a pod.
And then scaling beyond that, we have really good data center networks and even cross metro area networks.
that enable us to scale to, you know, many, many pods in multiple metro areas for our largest training runs.
And we can do that fully synchronously, as Noam said, as long as the gradient accumulation and communication of the parameters across metro areas happens, you know, fast enough relative to the step time, you're golden.
You don't really care.
But I think as you scale up, there may be a push to have a bit more asynchrony in our system than we have now.
Because, like, we can make it work.
I've been, you know, our ML researchers have been really happy how far we've been able to push synchronous training because it is easier mental model to understand.
You know, you just have your algorithm sort of fighting you rather than...
The asynchrony and the algorithm kind of battling you.
Maybe it's your adversarial machine MUQQ17 that is like setting the seventh bit of your exponent and all your radians or something.