Kwasi Ankomah
👤 PersonAppearances Over Time
Podcast Appearances
And that's a really interesting area as well.
So, yeah, that's where we're trying to focus on at the moment.
Yeah.
Yeah, so inference is coming from the word to infer.
So it's the model going along and then making a prediction of some sort.
So it's taking your input and then it's basically doing the thing that large language models do, which is the next token.
And that is the actual process of inference.
It goes in, it runs through the model and we get an output and that output
keeps going and essentially all you're all that you're doing is that we have a model that has already been kind of put on some sort of architecture and then we are basically giving you the answer or the next token and then of course as you see that stream the next token then goes back in and we get the prediction based on the next token as well so that in a nutshell it's essentially giving you we have a model that's already been trained and we are just giving you the output of that model yeah
Yeah, so I think inference speeds kind of...
kind of directly dictates how the user interacts with the application, right?
So we've all been there on, you know, your favorite chat application, be that what it may.
And when you kind of press that button to inference, right?
So when I've talked about inference, you know, you're making a pass through the model and you're getting output at the end.
Now that pass can,
take a long time depending on the size of the model, the amount of parameters and the hardware that it's running on.
Now, if there is a big latency with a real-time application, that does become an issue.
And we've seen that for, you know, if you try to run certain models on certain architecture, you can have, you know, a time to first token.
That's when the first, you know, the time to first token arriving at the user screen, you know, between kind of like 20 and 30 seconds.
You know, if you're in a real world production system,