Damien Tanner
๐ค SpeakerAppearances Over Time
Podcast Appearances
So we filter out things like that.
And then if you need some more intelligence, you can actually just ship off the partial transcripts to an LLM.
in real time.
So let's say the user's speaking and starts interrupting the agent.
Every kind of word you get, or every few words, you fire off a request to Gemini Flash and you say,
Here's the previous thing that the user said.
Here's what the agent said.
Here's what the users just said.
Yes, respond yes or no.
Do you think they're interrupting the agent?
And you get that back in about 250, 300 milliseconds.
And you just can't, as you get new transcripts, you cancel the old ones.
You just constantly try and make that request until the user stops speaking.
Then you get the response from that.
And then you can kind of make quite an intelligent decision.
But these things feel very hacky, but they actually work very well.
I think smaller LLMs can do that.
Gemini is just incredibly fast.
I think because of their TPU infrastructure, they've got an incredibly low TTF time to first token, which is the most important thing.
But I agree that there are smaller LLMs.