Jesse Zhang
๐ค SpeakerAppearances Over Time
Podcast Appearances
I would say Chachi Beauty Voice, for example, or the Sesame or like these voice to voice models, they're starting to feel very impressive.
But you talk to it long enough, you can actually, it's like, you can definitely tell it's not a human.
So there's that element of it.
But then even for enterprise use cases like us, there's still a ton of hurdles to cross because those models, even though they're good, the hallucination rate is really high.
So you can't really use them necessarily as is in the current systems.
And so a lot of people, what they do now is they go from voice into text and then back to voice.
And then you can run a lot more checks there to make sure that things are accurate.
And so a lot of cool ideas there to explore on how do you make it both human-like but accurate and how do you tie everything together.
So that's where most of the work is going to these days.
So the fundamental difference is if you're just going to text, then no matter what, the final audio is just a narration of the text.
Voice-to-voice is powerful because it takes into the entire audio of what you said.
So it knows cadence and maybe how upset you are and the tone and everything.
Latency is a lot less as well because you're going straight from voice-to-voice.
And latency matters so much when we're talking.
When we're talking right now, our brains are constantly going like, okay, when's he done talking?
When should I start talking?
If someone interrupts someone else, it's like in a polite way.
People adjust very naturally.
So that is the biggest thing.
proponent of voice to voice.