Mohammad Norouzi
π€ SpeakerVoice Profile Active
This person's voice can be automatically recognized across podcast episodes using AI voice matching.
Appearances Over Time
Podcast Appearances
And language models, as of now, they aren't very good with continuous output.
They aren't very good with kind of pixel values or very high dimensional vector representations.
So I guess the constraint here is that the representation has to be, you know, tokens of like, I mean, depending on your language model power,
Maybe it can be a million tokens.
Maybe it can be like, that's too extreme.
For us, it's about 4,000 tokens.
And that's where we still use natural language because these large language models are trained with natural language.
So they are very good with natural language.
But it may become more close to HTML, for example.
That's okay because, again, large language models are trained with HTML and they know the tokens.
It's actually kind of alluding to the editable model that I'm talking about, and we've had a lot of back and forth.
Are we going to have our own JSON for different text elements and buttons and stuff, or are we going to use HTML?
And it seems like HTML makes more sense just because these large language models have already been trained on HTML as opposed to us introducing a new JSON structure.
But I would say, to answer your question, that representation needs to be,
easy for the language model with the particular design that we have right now, which is a language model does some expansion of the ideas.
And then the image model takes those expanded descriptions and turn them into images.
First of all, we would love to work with more engineers, like cracked engineers.
We have a very tiny team.
You see what we were able to produce with such a tiny team.
And if you want high agency, you know, if you want your work to matter and you want to be part of the academic and open source ecosystem, then this is the perfect time to join us.