Noam Shazeer
๐ค SpeakerAppearances Over Time
Podcast Appearances
And that should vary by like factors of 10,000 for really easy things and really hard things, maybe even a million.
And it might be iterative.
Like you might make a pass through the model, get some stuff, and then decide you now need to call on some other parts of the model as another aspect of it.
The other thing I would say is like this sounds super complicated to deploy because it's like this weird, you know, constantly evolving thing with maybe not super optimized ways of communicating between pieces.
But you can always distill from that, right?
Like so if you say this is the kind of task I really care about.
let me distill from this giant kind of like organic-y thing into something that I know can be served really efficiently.
And you could do that distillation process, you know, whenever you want, once a day, once an hour.
And that seems like it'd be kind of good.
Yeah.
A related thing is I feel like we need interesting learning techniques during pre-training.
Like I'm not sure we're extracting the maximal value from every token we look at with the current training objective, right?
Like maybe we should think a lot harder about some tokens.
You know, when you get to the answer is maybe the model should at training time do a lot more work than...
when it gets to the.
Yeah, every which way.
Hide some stuff this way, hide some stuff that way.
Make it infer from partial information.
These kinds of things.
I think people have been doing this in vision models for a while.