Azeem Azhar
๐ค SpeakerAppearances Over Time
Podcast Appearances
We are starting to see more and more companies routing requests to cheaper, which means models that use less electricity and cost less models in
within their application.
So you put in the query and the query figures out, oh, maybe I should send this to DeepSeek rather than to an open AI model to get the result.
And we've made real technical progress both across the algorithms and across the chips that serve them over the last few years.
If you look at a sort of GPT-4 level class of inferencing back in 2022, it would take
one watt hour to generate 50 tokens.
So what is a watt hour?
If you've got a 10 watt LED bulb, one watt hour is sort of leaving that on for six minutes to get your 50 tokens.
Today with the latest NVIDIA chips and more efficient optimized language models, we're getting to about 600 tokens per watt hour.
So that's a 20X improvement just over four years.
Of course, the amount of tokens we want has increased significantly.
And so on the other hand, you have these new reasoning models that might burn 10 to 100 times more tokens per query.
So what you have there is these firms, the hyperscalers who are really hungry for compute, they're hungry for compute, not just for AI, but for other workloads.
And they're hungry for that because we as consumers and as businesses want those types of services.
So you've got those on one hand, the grids can't keep up, the GPUs are being rationed between training and serving.
This is a short-term squeeze is my sense and that over as the industry matures, the trade-offs will become much more apparent.
We'll get through some of the blockages around providing power.
We often see that in markets that you get these squeezes and ultimately the market industries take one or two years to reconfigure and to be able to deliver what is required.
It's just not going to happen tomorrow.
I think the third thing is about the economic engine and the question about whether we are going to see results from all of this.