Sholto Douglas
๐ค SpeakerAppearances Over Time
Podcast Appearances
And you see a similar thing with their approach to sparsity, where they're iteratively working out the best way to do this over multiple papers.
And the part that I like is that it's simple.
A big failure mode that a lot of ML researchers have is you do these overly complicated things that don't
like think hard enough about the hardware systems that you have in mind.
Whereas the first DeepSeq sparsity MOE solution, they designed these rack and node level load balancing losses.
So you can see them being like, OK, we have to perfectly balance on this.
And then they actually come up with a much better solution later on where they don't have to have the auxiliary loss.
where they just have these bias terms that they put in.
And it's cool.
JONATHAN HOLMES, But balancing auxiliary losses and wing.
Like, you're making the model trade off this thing.
And with auxiliary losses, you have to control the coefficient and the weighting.
The bias is cleaner in some respects.
MARK MANDELMANN, Interesting.
They did have to change it during training.
It depends on what your architecture is.
But I just thought it was cute that you can see them running up into this very hardware level
And they tried to go like, what do we wish we could express algorithmically?
What can we express under our constraints?
And iteratively solving to get better constraints.