Aman Sanger
👤 PersonAppearances Over Time
Podcast Appearances
I feel like I have much more to do there. It felt like the path to get to IMO was a little bit more clear because it already could get a few IMO problems. And there are a bunch of like there's a bunch of low hanging fruit given the literature at the time of like what what tactics people could take. I think I'm one much less versed in the space of theorem proving now.
And two, yeah, less intuition about how close we are to solving these really, really hard open problems.
And two, yeah, less intuition about how close we are to solving these really, really hard open problems.
And two, yeah, less intuition about how close we are to solving these really, really hard open problems.
I think we might get feels metal before AGI.
I think we might get feels metal before AGI.
I think we might get feels metal before AGI.
I think it's interesting. The original scaling laws paper by OpenAI was slightly wrong because I think of some issues they did with learning rate schedules. And then Chinchilla showed a more correct version.
I think it's interesting. The original scaling laws paper by OpenAI was slightly wrong because I think of some issues they did with learning rate schedules. And then Chinchilla showed a more correct version.
I think it's interesting. The original scaling laws paper by OpenAI was slightly wrong because I think of some issues they did with learning rate schedules. And then Chinchilla showed a more correct version.
And then from then people have again kind of deviated from doing the compute optimal thing because people start now optimizing more so for making the thing work really well given an inference budget. And I think there are a lot more dimensions to these curves than what we originally used of just compute number of parameters and data. like inference compute is the obvious one.
And then from then people have again kind of deviated from doing the compute optimal thing because people start now optimizing more so for making the thing work really well given an inference budget. And I think there are a lot more dimensions to these curves than what we originally used of just compute number of parameters and data. like inference compute is the obvious one.
And then from then people have again kind of deviated from doing the compute optimal thing because people start now optimizing more so for making the thing work really well given an inference budget. And I think there are a lot more dimensions to these curves than what we originally used of just compute number of parameters and data. like inference compute is the obvious one.
I think context length is another obvious one. So if you care, like, let's say you care about the two things of inference compute and then context window, maybe the thing you want to train is some kind of SSM because they're much, much cheaper and faster at super, super long context.
I think context length is another obvious one. So if you care, like, let's say you care about the two things of inference compute and then context window, maybe the thing you want to train is some kind of SSM because they're much, much cheaper and faster at super, super long context.
I think context length is another obvious one. So if you care, like, let's say you care about the two things of inference compute and then context window, maybe the thing you want to train is some kind of SSM because they're much, much cheaper and faster at super, super long context.
And even if maybe it is 10X worse scaling properties during training, meaning you have to spend 10X more compute to train the thing to get the same level of capabilities, it's worth it because you care most about that inference budget for really long context windows. So it'll be interesting to see how people kind of play with all these dimensions.
And even if maybe it is 10X worse scaling properties during training, meaning you have to spend 10X more compute to train the thing to get the same level of capabilities, it's worth it because you care most about that inference budget for really long context windows. So it'll be interesting to see how people kind of play with all these dimensions.
And even if maybe it is 10X worse scaling properties during training, meaning you have to spend 10X more compute to train the thing to get the same level of capabilities, it's worth it because you care most about that inference budget for really long context windows. So it'll be interesting to see how people kind of play with all these dimensions.
I mean, I think bigger is certainly better for just raw performance.