Reiner Pope
๐ค SpeakerAppearances Over Time
Podcast Appearances
Um.
When you look at some of the announcements, sometimes the API providers will brag about how much traffic they have.
The numbers that I've remembered from some announcements of Gemini last year were in the hundreds of millions of tokens per second worldwide.
So about a thousand, like this is one thousandth of that.
Yeah, so perform equality of the model, rather than speed of the model.
Yeah.
So unfortunately, we're not able to answer that analytically.
That is an empirical question of model quality.
Best I can do is pull up a paper and answer that empirically.
So this paper, this is Unified Laws for Routed Language Models.
It's a somewhat old paper by this stage, but one of the things that they did is looked at, if I keep increasing sparsity, what is the model quality impact?
This answer is very sensitive to the actual choice of Mixture of Experts.
Mixture of Experts has been around for a really long time.
I think it was even back in 2017.
but the techniques have changed a lot.
DeepSeq, a mixture of experts, was a big change in how it worked.
There have been older papers which are G-sharge, switch transformer.
So the actual empirical results are going to depend on all of that.
But on one of the older techniques that is shown here, you can see if I hold constant the number of active parameters at a certain size, and then I increase the sparsity, which they call expert count here, the quality keeps increasing.
And then if you imagine drawing a horizontal line from