Jaeden Schaefer
๐ค SpeakerAppearances Over Time
Podcast Appearances
And we also need to optimize the tech stack for people that are generating stuff.
So I think while training oftentimes gets a lot of kind of like the headlines and people talk about it a lot because it's basically this kind of massive upfront compute demand, right?
Like in order to train one of these models, you're spending millions and millions of dollars.
I think inference is quietly becoming a really dominant cost center for a lot of these AI companies because their models are getting deployed to millions of users.
And then if you look at Google, that's like all of the search tools.
You have co-pilots from Microsoft and a bunch of others and a lot of the enterprise software.
So every query, autocomplete or generated paragraph, every bit of that is consuming compute power and cooling.
Even like a very small efficiency gain at the chip level can translate into some really big cost savings at cloud scale.
So it's interesting because this is, you know, obviously something Microsoft's concerned about, but every other AI company should be and is concerned about this as well, because they need to make those, you know, they need to make the cost savings, not just when they're training the model, but when they're actually generating stuff.
Microsoft right now, they're betting that this new kind of Maya 200, it's going to be a really big shift in that financial equation.
They said that the chip is going to be designed to essentially run today's largest frontier models.
So you can imagine the ones that they partnered with, like OpenAI, and they're going to be able to do that on a single node while leaving enough headroom to accommodate larger and more demanding architecture in the future, which is kind of interesting, right?
They're not just looking at what is OpenAI, what do our AI models need today?
They're looking at what is it going to need in the future.
I think because they kind of have this design, it's very forward looking.
I think this matters because model sizes are continuing to grow.
And I think as companies are increasingly, you know, expecting lower latency, they're expecting kind of this always on AI service rather than a batch style workload, right?