Noam Shazeer
๐ค SpeakerAppearances Over Time
Podcast Appearances
They don't sort of go off and sort of have lots of different branches for mathy things that don't merge back together with the kind of CAD image thing.
And I think we should probably have a more organic structure in these things.
I also would like it if the pieces of the model could be developed a little bit independently.
Yeah.
Like right now, I think we have this issue where we're going to train a model.
So we do a bunch of preparation work on deciding the most awesome algorithms we can come up with and the most awesome data mix we can come up with.
But there's always trade-offs there.
Like we'd love to include more multilingual data, but that might come at the expense of including less coding data.
And so the model's less good at coding, but better multilingual or vice versa.
And I think it would be really great if we could have
like a small set of people who care about a particular subset of languages go off and create really good training data, train a modular piece of a model that we can then hook up to a larger model that improves its capability in, say, Southeast Asian languages or in reasoning about languages
Haskell code or something and then you then also have a nice software engineering benefit where you've decomposed the problem of it compared to what we do today which is we have this kind of a whole bunch of people working but then we have this kind of monolithic process of starting to do pre-training on this model and if we could do that
You could have 100 teams around Google, you could have people all around the world working to improve languages they care about or particular problems they care about and all collectively work on improving the model.
And that's a kind of a form of continual learning.
Yeah, I think there may be ways to get a lot of the benefits of that with kind of a version system of modularity.
Like I have a frozen version of my model.
And then I include a different variant of some particular module and I want to compare its performance or train it a bit more.
And then I compare it to the baseline of this thing with, you know, now version, you know, N prime of this particular module that does Haskell interpretation.
And also more, more parallelizable, I think.
Yeah.