Azeem Azhar
๐ค SpeakerAppearances Over Time
Podcast Appearances
And earlier this week, Brookfield
which is an asset manager lined up with our estimates that in a few years, about 70 to 75% of compute cycles will be used on inference.
So that's going to be a tension, right?
Do we pay bills today or do we build the next big thing in some different way?
And you see the labs.
I mean, I think the contrast between Anthropic and OpenAI is most marked in how they approach that, right?
Anthropic appears to be rather more focused in thinking through the economics of that particular trade-off between training and inference.
There are levers to address that.
Efficiency gains being one that is an obvious approach.
We are starting to see more and more companies routing requests to cheaper, which means models that use less electricity and cost less models in
within their application.
So you put in the query and the query figures out, oh, maybe I should send this to DeepSeek rather than to an open AI model to get the result.
And we've made real technical progress both across the algorithms and across the chips that serve them over the last few years.
If you look at a sort of GPT-4 level class of inferencing back in 2022, it would take
one watt hour to generate 50 tokens.
So what is a watt hour?
If you've got a 10 watt LED bulb, one watt hour is sort of leaving that on for six minutes to get your 50 tokens.
Today with the latest NVIDIA chips and more efficient optimized language models, we're getting to about 600 tokens per watt hour.
So that's a 20X improvement just over four years.
Of course, the amount of tokens we want has increased significantly.