Jesse Zhang
๐ค SpeakerAppearances Over Time
Podcast Appearances
And ultimately, I think the prevailing view is that first, whatever the final experience is, if you really want to make it indistinguishable from a human, you have to do voice to voice, or you have to at least take into account the voice.
The issue with voice to voice, though, is that also fundamentally, because voice has a lot more dimensions to it, the amount of tokens you generate per sentence is just a lot higher than when you generate text.
The more tokens you have, the easier it is for something to go wrong.
And so the hallucination rate has so far just been a lot higher.
How much higher?
Give us a sense of how far we are from these being really good.
I think probably like 8x higher or something like that.
Wow.
It is quite a bit higher.
Of course, you want to leverage that technology.
So now maybe there's creative ways to make a hybrid of the two.
Maybe you can have a text model.
generate the content, but you take into account the audio from before as well.
And like that makes something very realistic.
But at the same time, latency is still the hard problem because at the enterprise, what's happening is you are doing a lot before you can start responding.
You have to figure out like what are they asking about?
Like what materials do I need to collect?
Do I need to hit any APIs and get that data back?
And so you have to do it in a way that feels very natural.
And sometimes if you think about how a human does it, you might have to say something like, give me a sec to look that up because it actually genuinely takes 10 seconds for the API to come back.