Keri Briske
👤 PersonAppearances Over Time
Podcast Appearances
And I think that's why, you know, you kind of introduced us and we have many size models of Nematron because you do need larger models to inform and train the smaller models.
And so you do and larger models are a little bit more robust and they can kind of generalize a little bit more even in a domain and they have a bigger capacity to learn a new domain.
So I think that there's room for all kinds of models.
Yeah, so there was a little bit of thought that went into, it seems kind of natural that we did a small, medium, large, but there was a little bit more thought that went into it.
For our nano, we really wanted it to fit into smaller GPU formats, right?
So that anyone that's either renting a, like what I'd say,
an older hardware architecture on the cloud or running on a laptop that has a GPU in it, we want to be able to run that with enough memory and with the efficiency that you need and still have a really great reasoning model.
And so that was the Nano.
And so it's a 9 billion perimeter model.
At floating point 16, you really only need like an 18 gig memory requirement, right?
And then again, you can use our tools to quantize it down even further and make it even smaller and more efficient.
Yeah, so that's kind of what we did for the Nano.
For the Super, we were thinking we wanted to keep it within a single hopper GPU, data center GPU.
And you can easily do that.
You can run it in FP8 and that's easy.
For Ultra, we actually had this debate with our engineering team because a lot of these, again, a lot of the really great large models are larger than a node.
And let me just explain what a node is.
If you don't know, it's eight GPUs in a box.
So most of these really large models are multi-node.
So you have to take up more than eight GPUs just to run one model.