Tim Davis
๐ค SpeakerAppearances Over Time
Podcast Appearances
that inference is almost in many ways more complex than training at scale now.
It involves increasingly at the Kubernetes, particularly the Kubernetes style layer in the stack, you have this idea now of splitting out pre-fill and decode components of the transformer architecture.
You have different ways of managing caching and prompt caching and how you do that across all sorts of different models and all sorts of different hardware.
And so what you end up needing is, you know, essentially a solution there too.
And that, you know, today we call it mammoth, but fundamentally that's going to become a cloud, you know, part of our cloud offering.
And for many developers, our goal is just, look, we can serve across hardware.
This idea of, you know, talking about analogies, Corey,
You know, we love to frame the unified compute model really as like a hypervisor for compute, right?
If you could log into a platform and you could just say, here's my throughput, here's my latency, here's my, you know, how much accuracy I'm willing to reduce through methods like quantization, and here's my cost target, just make it work.
Like when I log into Snowflake,
I don't execute a query and then go, oh, well, what CPU machine did Snowflake choose to execute my query?
I actually don't know, right?
Well, AI should be similar, certainly at least in the cloud to start, similar.
You should be able to log into a platform, just give your requirements and expect the best TCO for your workload.
Like the total cost should be highly optimized to whatever those parameters are
And software should figure it out.
It should be that simple for developers then to go off and build applications.
Grant, hopefully that answers the question of what are we building at Modulo?
That is very much what we're building.
Yeah.