Damien Tanner
๐ค SpeakerAppearances Over Time
Podcast Appearances
Most LLMs we use right now, the ones we use in coding agents, they're optimized for intelligence, not really speed.
Then when people optimize for speed, the LLM labs, they tend to optimize for just token throughput.
Very few people optimize for time to first token.
And that's all that matters in voice, is I give you the user utterance,
How long is the user gonna have to wait before I can start playing back an agent response to them?
And time to first token, is that right?
How long before I get the first kind of word or two that I can turn into voice and they can start hearing?
The only major LLM lab that actually optimizes for this or maintains a low latency of TTFT is Google and Gemini Flash.
OpenAI, most voice agents now doing it this way are either using GPT-4.0 or Gemini Flash.
GPT-4.0 has got some annoying, the OpenAI endpoints have some annoying inconsistencies in latency.
And that's kind of the killer in voice, right?
It's a bad user experience if it works, you know, the first few turns of the conversation are fast and then suddenly the next turn the agent takes three seconds to respond and you're like...
Is the agent wrong?
Is the agent broken?
But then once you get that first token back, then you're good because then you can, you send that text to us, you start streaming text to us, and then we can start turning it into full sentences.
And then again, we get to this batching problem.
The voice models that do text to voice, again, they don't stream in the input.
They require a full sentence of input, right?
before they can start generating any output.
Because again, how you speak, how things are pronounced depends on what comes later.