Reiner Pope
๐ค SpeakerAppearances Over Time
Podcast Appearances
So data on pre-training.
This is not well-cited, but... You want me to not remove that?
No, it's fine.
And I think often active params, number of active params could be in the range of like 100 billion, something like that.
Yeah.
Maybe a bit larger.
So I'm assuming active params are about 100 billion.
And so multiply by 20 to get the chinchilla token count.
So chinchilla, the chinchilla would be around 2 trillion.
And yeah, and we see like we're at 100 times larger than that.
Actually, what does de-chinchilla actually mean?
Like the token count for pre-training for the chinchilla scaling law would recommend, I guess.
Got it.
So, yeah, like the ratio of this 200 trillion or 100 trillion parameters over the, like, the potential optimal of 2 trillion, that's the amount that's overtrained, which is like a factor of 100 overtrained, perhaps.
I mean, this is why you should just approximate everywhere because there's so big error bars on this.
But yeah, it's kind of empowering to just set A equal to B and figure it out.
Yeah, yeah.
That's super cool.
Yeah.
So, I mean, why specifically 50%?