Damien Tanner
๐ค SpeakerAppearances Over Time
Podcast Appearances
Yeah, and everything's streaming.
And so a very interesting problem to solve, because the whole system has to be on real-time.
So the whole thing, we call it a pipeline.
I don't know if that's a great name for it, because it's not like an ETL loading pipeline or something.
But we call it a pipeline.
But the real-time agent system, our back end, when you start a new session,
It runs on Cloudflare Workers.
So it's running right near the user who clicked to chat with your agent with voice.
And then from that point on, everything is streaming.
So the microphone input from the user's browser streaming in, that is then getting streamed to the transcription model in real time.
The transcription model is spitting out partial transcripts.
We send that partial transcript back to you so we can show you what you're saying if you want to show them that.
And then the hardest bit in this whole thing is working out when the user is finished speaking.
it's it's it's so difficult because we pause we make sounds we we we pause and then we start again and with conversation is such a dynamic kind of it's like a game almost right
Yeah.
So we have to do some clever things, use some other AI models to help you detect when the users end speaking.
And when we have enough confidence, like there's no certainty here, but we have enough confidence, we think the users finished their thought.
Then we finalize that transcript, you know, finish transcribing that last word and ship you that whole user utterance, like whether it's a word, a sentence, a paragraph, the user spoken.
The reason we have to kind of like, we can't stream at that point, right?
We have to like bundle up this user utterance and choose an end is because LLMs don't take a streaming input.