Gaurav Misra
๐ค SpeakerAppearances Over Time
Podcast Appearances
And a lot of that comes through like video. So video being funneled directly into video generation models. That gives a significant advantage. That's potentially a possible way in which a future sort of business model could be set up. It actually is kind of familiar, by the way.
And a lot of that comes through like video. So video being funneled directly into video generation models. That gives a significant advantage. That's potentially a possible way in which a future sort of business model could be set up. It actually is kind of familiar, by the way.
It seems to me similar to the Facebook or Google business model where you have a mass consumer free product, basically, and the data is used to power essentially like a B2B paid product.
It seems to me similar to the Facebook or Google business model where you have a mass consumer free product, basically, and the data is used to power essentially like a B2B paid product.
It's interesting to think about because for the models that we train, they're diffusion models. So they actually work by starting from noise. It starts from literal noise, like static you see on TV. At every step, based on text that's provided, it looks at the noise and it tries to like predict a layer of clarity in that noise. It says man wearing blue shirt.
It's interesting to think about because for the models that we train, they're diffusion models. So they actually work by starting from noise. It starts from literal noise, like static you see on TV. At every step, based on text that's provided, it looks at the noise and it tries to like predict a layer of clarity in that noise. It says man wearing blue shirt.
So it starts to like draw a little bit of man wearing blue shirt out of noise. And then every pass it's taking through it, it's discovering a little bit more of the man wearing blue shirt. So that's the text conditioning that's helping it decide how to reach the destination of what man wearing blue shirt looks like.
So it starts to like draw a little bit of man wearing blue shirt out of noise. And then every pass it's taking through it, it's discovering a little bit more of the man wearing blue shirt. So that's the text conditioning that's helping it decide how to reach the destination of what man wearing blue shirt looks like.
So that's how the diffusion models work, which is slightly different from like how a next token prediction model like GPT works, which is kind of just as you might think about it, just predicting the next word based on all the previous words that have been spoken, which are considered the context. So these models are different. We are still earlier on in the diffusion model training path.
So that's how the diffusion models work, which is slightly different from like how a next token prediction model like GPT works, which is kind of just as you might think about it, just predicting the next word based on all the previous words that have been spoken, which are considered the context. So these models are different. We are still earlier on in the diffusion model training path.
We're still in that 10 billion, 20 billion, 30 billion. Meta's movie gen was, I believe, 30 billion parameters. People haven't really scaled this up. We actually don't know how big OpenAI Sora is. They didn't, I think, release that information. But a lot of the work is going to go into scaling up these things. Video obviously is really heavy. That's what makes it different from text.
We're still in that 10 billion, 20 billion, 30 billion. Meta's movie gen was, I believe, 30 billion parameters. People haven't really scaled this up. We actually don't know how big OpenAI Sora is. They didn't, I think, release that information. But a lot of the work is going to go into scaling up these things. Video obviously is really heavy. That's what makes it different from text.
Consumes a ton of space, a ton of processing. For us, even if we were to download, just download all of our training videos online, it will cost us a million dollars to download the training videos. That's a whole different regime than like text. It brings different types of challenges to training these models, basically.
Consumes a ton of space, a ton of processing. For us, even if we were to download, just download all of our training videos online, it will cost us a million dollars to download the training videos. That's a whole different regime than like text. It brings different types of challenges to training these models, basically.
You never know. But honestly, I think what will save us on the video model side is actually the fact that it is an easier problem than the text problem. The text problem is intelligence, as we're talking about. And the video problem is more rendering. We already know how much rendering costs. We already know, yeah, it's GPU intensive.
You never know. But honestly, I think what will save us on the video model side is actually the fact that it is an easier problem than the text problem. The text problem is intelligence, as we're talking about. And the video problem is more rendering. We already know how much rendering costs. We already know, yeah, it's GPU intensive.
If you were to like literally CGI render a scene out, like, yeah, it will spend some time on the GPU. There's no doubt. Can we be more efficient than that? It's possible. It may not be the most efficient today. Maybe there's better ways of doing it. Maybe AI will be cheaper and faster than regular rendering. And I think if that's the case, then that's a good thing.
If you were to like literally CGI render a scene out, like, yeah, it will spend some time on the GPU. There's no doubt. Can we be more efficient than that? It's possible. It may not be the most efficient today. Maybe there's better ways of doing it. Maybe AI will be cheaper and faster than regular rendering. And I think if that's the case, then that's a good thing.
But I think we know that it shouldn't be worse than that. We should be able to solve it with fewer resources than that, potentially, or at least the same. We generally understand where it's going to fall. It's still early. Just like on the training side, we're still scaling up these models and it's still, oh, it's 10 billion parameters, 20 billion parameters, whatever.
But I think we know that it shouldn't be worse than that. We should be able to solve it with fewer resources than that, potentially, or at least the same. We generally understand where it's going to fall. It's still early. Just like on the training side, we're still scaling up these models and it's still, oh, it's 10 billion parameters, 20 billion parameters, whatever.