Mohammad Norouzi
π€ SpeakerVoice Profile Active
This person's voice can be automatically recognized across podcast episodes using AI voice matching.
Appearances Over Time
Podcast Appearances
So throughout training, we always measure text accuracy and we update very detailed changes to the model and data and see how that resolves in performance.
A lot of it is really listing all the possible changes and very carefully tuning each element of the model and see what happens.
Obviously, we try to gather as much data as possible.
One of the standard recipes in the industry is that
We take images and we turn them to text using visual language models.
The very first models we were training three, four years ago would be based on the alt text that you can find on the internet.
That is, each image on the internet may have an alt text field associated with it, which describes what's in the image.
But the problem is the alt text is often very short or inaccurate.
And what we do now is we train models to go from image to text.
And in this case, image to text with detailed bounding box information, detailed element information.
If we hear about text and we really want to make sure all the text in the image is correctly described,
And then we go from text to image backwards.
It's kind of interesting.
We gather all the images from the internet.
Some of them may have alt text, some of them may not have alt text.
And then we use AI to go from image to text.
And then we train another AI model to go from text to image.
So that's one of the key recipes that results in very good models.
It's a very good question.