Aman Sanger
๐ค SpeakerAppearances Over Time
Podcast Appearances
I think context length is another obvious one. So if you care, like, let's say you care about the two things of inference compute and then context window, maybe the thing you want to train is some kind of SSM because they're much, much cheaper and faster at super, super long context.
And even if maybe it is 10X worse scaling properties during training, meaning you have to spend 10X more compute to train the thing to get the same level of capabilities, it's worth it because you care most about that inference budget for really long context windows. So it'll be interesting to see how people kind of play with all these dimensions.
And even if maybe it is 10X worse scaling properties during training, meaning you have to spend 10X more compute to train the thing to get the same level of capabilities, it's worth it because you care most about that inference budget for really long context windows. So it'll be interesting to see how people kind of play with all these dimensions.
And even if maybe it is 10X worse scaling properties during training, meaning you have to spend 10X more compute to train the thing to get the same level of capabilities, it's worth it because you care most about that inference budget for really long context windows. So it'll be interesting to see how people kind of play with all these dimensions.
I mean, I think bigger is certainly better for just raw performance.
I mean, I think bigger is certainly better for just raw performance.
I mean, I think bigger is certainly better for just raw performance.
And raw intelligence. I think that the path that people might take is, I'm particularly bullish on distillation. And like, yeah, how many knobs can you turn to if we spend like a ton, ton of money on training, like get the most capable, cheap model?
And raw intelligence. I think that the path that people might take is, I'm particularly bullish on distillation. And like, yeah, how many knobs can you turn to if we spend like a ton, ton of money on training, like get the most capable, cheap model?
And raw intelligence. I think that the path that people might take is, I'm particularly bullish on distillation. And like, yeah, how many knobs can you turn to if we spend like a ton, ton of money on training, like get the most capable, cheap model?
like really really caring as much as you can because like the the naive version of caring as much as you can about inference time compute is what people have already done with like the llama models or just over training the shit out of 7b models um on way way way more tokens than essential optimal right but if you really care about it maybe the thing to do is what gamma did which is let's just not let's not just train on tokens let's literally train on uh
like really really caring as much as you can because like the the naive version of caring as much as you can about inference time compute is what people have already done with like the llama models or just over training the shit out of 7b models um on way way way more tokens than essential optimal right but if you really care about it maybe the thing to do is what gamma did which is let's just not let's not just train on tokens let's literally train on uh
like really really caring as much as you can because like the the naive version of caring as much as you can about inference time compute is what people have already done with like the llama models or just over training the shit out of 7b models um on way way way more tokens than essential optimal right but if you really care about it maybe the thing to do is what gamma did which is let's just not let's not just train on tokens let's literally train on uh
minimizing the KL divergence with the distribution of gamma 27B, right? So knowledge distillation there. And you're spending the compute of literally training this 27 billion model, billion parameter model on all these tokens just to get out this, I don't know, smaller model.
minimizing the KL divergence with the distribution of gamma 27B, right? So knowledge distillation there. And you're spending the compute of literally training this 27 billion model, billion parameter model on all these tokens just to get out this, I don't know, smaller model.
minimizing the KL divergence with the distribution of gamma 27B, right? So knowledge distillation there. And you're spending the compute of literally training this 27 billion model, billion parameter model on all these tokens just to get out this, I don't know, smaller model.
Yeah, distillation in theory is... I think getting out more signal from the data that you're training on. And it's like another, it's perhaps another way of getting over, not like completely over, but like partially helping with the data wall where like you only have so much data to train on.
Yeah, distillation in theory is... I think getting out more signal from the data that you're training on. And it's like another, it's perhaps another way of getting over, not like completely over, but like partially helping with the data wall where like you only have so much data to train on.
Yeah, distillation in theory is... I think getting out more signal from the data that you're training on. And it's like another, it's perhaps another way of getting over, not like completely over, but like partially helping with the data wall where like you only have so much data to train on.
Let's like train this really, really big model on all these tokens and we'll distill it into a smaller one. And maybe we can get more signal per token for this much smaller model than we would have originally if we trained it.