Damien Tanner
๐ค SpeakerAppearances Over Time
Podcast Appearances
We didn't touch on it, but interruptions is this other really difficult dynamic part where whilst the agent is speaking its response to you, if the user starts speaking again, you then need to decide in real time whether the user is interrupting the agent.
Or are they just going, mm-hmm, yeah, and agreeing with the agent?
Oh, gosh, yes.
Or are they trying to say, ooh, stop?
I bet that's a hard problem to solve.
We have to still be transcribing audio even when the user's hearing it.
And we've got to deal with background noise and everything.
And then when we're confident the user is trying to interrupt the agent, we've then got to do this whole kind of state change where we tear down all of this in-flight LLM request, in-flight voice generation request,
And then as quickly as possible, start focusing on the user's new question.
And especially if their interruption is really short, like stop.
Suddenly you've got to tear down all the old stuff, transcribe that word stop, then ship that as a new LLM, request to the backend, generate the response, and then get the agent speaking back as quickly as possible.
It's all happening down one pipe, as it were, at the end of the day.
It's like audio from the browser, microphone, and then audio replaying back.
And we would have bugs like you'd interrupt the agent, but then...
When it started replying, there'd still be a few chunks of 20-minute-a-second audio from the old response snuck in there.
Or, you know, the old audio would be interleaved with the new audio from the agent back.
And you're kind of in the, you know, audacity or something, some audio editor trying to work out like, what's going, why does it sound like this?
And you're like rearranging bits of audio going, ah, okay.
The responses are taking turns every 20 milliseconds.
It's interleaving the two responses to try and work out what's going on.