Kwasi Ankomah
👤 PersonAppearances Over Time
Podcast Appearances
already there's like six things that have happened there each of those calls is inference so once that adds up our inference speed starts to make a big difference yeah so that's right that's probably where we big you know heard the biggest um you know kind of praise from our clients is that something that was taking running you know let's say 150 tokens per second on nvidia or like we're running it at like 700 800 tokens as a guy who's running local models on a laptop i just can't
Yeah.
And you can, you can go on Samba Nova cloud and you can see, you can see our kind of, you know, our, our token speeds and that makes a huge difference.
You know, that's the big, you know, we, we, we just did one of our partnerships and they actually showed a video of us versus like naked.
So, you know, they were just kind of, you know, couldn't believe the speed because it makes it in those real time applications, it makes a big difference.
So that's one place I think that we're, we're able to kind of be more, um,
I would say I think we're able to outperform GPUs in that sense.
The second is around that kind of model coordination and model bundling.
And what I mean by that is
You don't always need the same model or a huge model for certain tasks.
And to give you an example, if we just stay with that example about the coding agent, right?
So in a GPU or other architectures, you might use the frontier model, which is super expensive and huge for all of those tasks.
Now, that isn't super efficient, right?
Because, and if you wanted to swap to a different model, you would still have another piece of infrastructure because you can't have this kind of concept of model swapping due to the memory limitations of the chip.
Now we, because we're able to
allow you to swap out models on the fly on the same amount of hardware.
It means that the efficiency is a lot better and the total cost of ownership, especially when you have the rack is a lot cheaper.
So what I mean by that is, let's say that you're using that coding agent and we want our top level agent to use the funky model because it's doing all the planning, but the model that actually just goes and reads like the code and does like some note taking,
We can have a much smaller model.
So to give you an example, we have clients who have done this.